architecting for failure - 4developers 2015
TRANSCRIPT
architecting for failure building fault-
tolerant systemsJakub Derda
Warsaw, 2015
‘Tree’ component – overview
‘Tree’ component – detailed view
‘Tree’ component – detailed view
client
network connection
sever
‘Tree’ component – detailed view
human factor software client library
ISP protocol stack network
load balancers OS power source
client
network connection
sever
Your component – detailed viewWhat is a fault?
What is not a fault?
Service is not working on our side*
* Caused by e.g. technical failures, outages, corrupted data, attacks
What is a fault?
The real fault is when we don’t
deliver valueto customers.
Value delivering without working system
Bring your own wine, we’re waiting for license.Last election in Poland
What fault-tolerance is not?
It’s NOT making sure your system never goes down.
It (eventually) will.
What is a fault-tolerance?
It’s making sure that system can quickly recover and/or
client is not impacted.
How to solve it?
Solving – redundancy
Hot/warm replicas
Caches
Geographical distribution, CDNs
Hardware redundancy
Alternative systems and procedures
Solving – design
Stateless
Auditing
Idempotent requests
Uniqueness / randomness
Asynchronous and decoupling
EIPs
Commands, not data
Break the rules
Solving – procedures
Backup creation, cleanup and restore
QA & potential problems
Continuous integration
Deployment
Solving – observe
Dive deep, post-mortems
Identify bottlenecksObserve key metrics
Verify assumptionsPredict traffic
Tradeoffs - simple
cost time
1/scope
QUALITY
Tradeoffs - real
cost durability
time
consistency
trust
audit (traceability)
complexity
security
scalabilityfunctionalitystability
reliability
extensibility
performancemaintainability
manageability
Summary
Learn to live with crashes
Summary
Automate procedures
Summary
Don’t be afraid to cross the line
Fault tolerance is not a property of a design,it’s a process.