fault tolerance made easy
DESCRIPTION
Fault tolerance in general is a challenging topic. Yet we need fault toleranct designs more badly than ever in order to provide robust, highly available systems - especially in times of scale out systems becoming more and more popular. Unfortunately, most developers do not care too much about a fault tolerant design, either because they are scared by the complexity of the realm or because they do not care enough. One of the problems is that a lack of fault tolerant design does not hurt a lot in development or in QA, but it hurts a lot in production - as Michael Nygard said: "It's all about production!" (at least figuratively). In this presentation I do *not* try to give a general introduction to fault tolerant design. Instead I pick a few generic case studies that demonstrate the results of missing fault tolerant design, try to sensitize a bit about the production relevance of fault tolerant design and then go along with a few selected patterns. I picked a few patterns which are surprisingly easy to implement and help to mitigate the problems of the former case studies. This way I try to show two things: 1. A piece of architecture or design as a pattern is not necessarily hard to implement. Sometimes the code is written quicker than it takes to explain the pattern beforehand. 2. Even if fault tolerant design as a general topic might be hard, some parts of it can be implemented very easily and it's more than worth the coding effort if you look how much better your system behaves in production just from adding those few lines of code.TRANSCRIPT
![Page 1: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/1.jpg)
Fault tolerance made easy Patterns for fault tolerance implemented surprisingly easy
Uwe Friedrichsen, codecentric AG, 2013-2014
![Page 2: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/2.jpg)
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com
![Page 3: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/3.jpg)
It‘s all about production!
![Page 4: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/4.jpg)
Production
Availability
Resilience
Fault Tolerance
![Page 5: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/5.jpg)
Your web server doesn‘t look good …
![Page 6: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/6.jpg)
Pattern #1
Timeouts
![Page 7: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/7.jpg)
Timeouts (1) // Basics myObject.wait(); // Do not use this by default myObject.wait(TIMEOUT); // Better use this // Some more basics myThread.join(); // Do not use this by default myThread.join(TIMEOUT); // Better use this
![Page 8: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/8.jpg)
Timeouts (2) // Using the Java concurrent library Callable<MyActionResult> myAction = <My Blocking Action> ExecutorService executor = Executors.newSingleThreadExecutor(); Future<MyActionResult> future = executor.submit(myAction); MyActionResult result = null; try { result = future.get(); // Do not use this by default result = future.get(TIMEOUT, TIMEUNIT); // Better use this } catch (TimeoutException e) { // Only thrown if timeouts are used ... } catch (...) { ... }
![Page 9: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/9.jpg)
Timeouts (3) // Using Guava SimpleTimeLimiter Callable<MyActionResult> myAction = <My Blocking Action> SimpleTimeLimiter limiter = new SimpleTimeLimiter(); MyActionResult result = null; try { result = limiter.callWithTimeout(myAction, TIMEOUT, TIMEUNIT, false); } catch (UncheckedTimeoutException e) { ... } catch (...) { ... }
![Page 10: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/10.jpg)
Determining Timeout Duration Configurable Timeouts Self-Adapting Timeouts Timeouts in JavaEE Containers
![Page 11: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/11.jpg)
Pattern #2
Circuit Breaker
![Page 12: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/12.jpg)
Circuit Breaker (1)
Client Resource Circuit Breaker
Request
Resource unavailable
Resource available
Closed Open
Half-Open
Lifecycle
![Page 13: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/13.jpg)
Circuit Breaker (2)
Closed on call / pass through call succeeds / reset count call fails / count failure threshold reached / trip breaker
Open on call / fail on timeout / attempt reset
trip breaker
Half-Open on call / pass through call succeeds / reset call fails / trip breaker
trip breaker attempt reset
reset
Source: M. Nygard, „Release It!“
![Page 14: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/14.jpg)
Circuit Breaker (3) public class CircuitBreaker implements MyResource { public enum State { CLOSED, OPEN, HALF_OPEN } final MyResource resource; State state; int counter; long tripTime; public CircuitBreaker(MyResource r) { resource = r; state = CLOSED; counter = 0; tripTime = 0L; } ...
![Page 15: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/15.jpg)
Circuit Breaker (4) ... public Result access(...) { // resource access Result r = null; if (state == OPEN) { checkTimeout(); throw new ResourceUnavailableException(); } try { r = r.access(...); // should use timeout } catch (Exception e) { fail(); throw e; } success(); return r; } ...
![Page 16: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/16.jpg)
Circuit Breaker (5) ... private void success() { reset(); } private void fail() { counter++; if (counter > THRESHOLD) { tripBreaker(); } } private void reset() { state = CLOSED; counter = 0; } ...
![Page 17: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/17.jpg)
Circuit Breaker (6) ... private void tripBreaker() { state = OPEN; tripTime = System.currentTimeMillis(); } private void checkTimeout() { if ((System.currentTimeMillis - tripTime) > TIMEOUT) { state = HALF_OPEN; counter = THRESHOLD; } } public State getState() return state; } }
![Page 18: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/18.jpg)
Thread-Safe Circuit Breaker Failure Types Tuning Circuit Breakers Available Implementations
![Page 19: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/19.jpg)
Pattern #3
Fail Fast
![Page 20: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/20.jpg)
Fail Fast (1)
Client Resources Expensive Action
Request Uses
![Page 21: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/21.jpg)
Fail Fast (2)
Client Resources
Expensive Action
Request
Fail Fast Guard
Uses
Check availability
Forward
![Page 22: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/22.jpg)
Fail Fast (3) public class FailFastGuard { private FailFastGuard() {} public static void checkResources(Set<CircuitBreaker> resources) { for (CircuitBreaker r : resources) { if (r.getState() != CircuitBreaker.CLOSED) { throw new ResourceUnavailableException(r); } } } }
![Page 23: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/23.jpg)
Fail Fast (4) public class MyService { Set<CircuitBreaker> requiredResources; // Initialize resources ... public Result myExpensiveAction(...) { FailFastGuard.checkResources(requiredResources); // Execute core action ... } }
![Page 24: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/24.jpg)
The dreaded SiteTooSuccessfulException …
![Page 25: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/25.jpg)
Pattern #4
Shed Load
![Page 26: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/26.jpg)
Shed Load (1)
Clients Server
Too many Requests
![Page 27: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/27.jpg)
Shed Load (2)
Server
Too many Requests
Gate Keeper
Monitor
Requests
Request Load Data Monitor Load
Shedded Requests
Clients
![Page 28: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/28.jpg)
Shed Load (3) public class ShedLoadFilter implements Filter { Random random; public void init(FilterConfig fc) throws ServletException { random = new Random(System.currentTimeMillis()); } public void destroy() { random = null; } ...
![Page 29: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/29.jpg)
Shed Load (4) ... public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws java.io.IOException, ServletException { int load = getLoad(); if (shouldShed(load)) { HttpServletResponse res = (HttpServletResponse)response; res.setIntHeader("Retry-After", RECOMMENDATION); res.sendError(HttpServletResponse.SC_SERVICE_UNAVAILABLE); return; } chain.doFilter(request, response); } ...
![Page 30: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/30.jpg)
Shed Load (5) ... private boolean shouldShed(int load) { // Example implementation if (load < THRESHOLD) { return false; } double shedBoundary = ((double)(load - THRESHOLD))/ ((double)(MAX_LOAD - THRESHOLD)); return random.nextDouble() < shedBoundary; } }
![Page 31: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/31.jpg)
Shed Load (6)
![Page 32: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/32.jpg)
Shed Load (7)
![Page 33: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/33.jpg)
Shedding Strategy Retrieving Load Tuning Load Shedders Alternative Strategies
![Page 34: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/34.jpg)
Pattern #5
Deferrable Work
![Page 35: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/35.jpg)
Deferrable Work (1)
Client
Requests
Request Processing
Resources
Use
Routine Work
Use
![Page 36: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/36.jpg)
OVERLOAD
Deferrable Work (2)
WithoutDeferrable Work
100%
OVERLOAD
With Deferrable Work
100%
Request Processing
Routine Work
![Page 37: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/37.jpg)
// Do or wait variant ProcessingState state = initBatch(); while(!state.done()) { int load = getLoad(); if (load > THRESHOLD) { waitFixedDuration(); } else { state = processNext(state); } } void waitFixedDuration() { Thread.sleep(DELAY); // try-catch left out for better readability }
Deferrable Work (3)
![Page 38: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/38.jpg)
// Adaptive load variant ProcessingState state = initBatch(); while(!state.done()) { waitLoadBased(); state = processNext(state); } void waitLoadBased() { int load = getLoad(); long delay = calcDelay(load); Thread.sleep(delay); // try-catch left out for better readability } long calcDelay(int load) { // Simple example implementation if (load < THRESHOLD) { return 0L; } return (load – THRESHOLD) * DELAY_FACTOR; }
Deferrable Work (4)
![Page 39: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/39.jpg)
Delay Strategy Retrieving Load Tuning Deferrable Work
![Page 40: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/40.jpg)
I can hardly hear you …
![Page 41: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/41.jpg)
Pattern #6
Leaky Bucket
![Page 42: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/42.jpg)
Leaky Bucket (1)
Leaky Bucket
Fill
Problem occured
Periodically
Leak
Error Handling
Overflowed?
![Page 43: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/43.jpg)
public class LeakyBucket { // Very simple implementation final private int capacity; private int level; private boolean overflow; public LeakyBucket(int capacity) { this.capacity = capacity; drain(); } public void drain () { this.level = 0; this.overflow = false; } ...
Leaky Bucket (2)
![Page 44: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/44.jpg)
... public void fill() { level++; if (level > capacity) { overflow = true; } } public void leak() { level--; if (level < 0) { level = 0; } } public boolean overflowed() { return overflow; } }
Leaky Bucket (3)
![Page 45: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/45.jpg)
Thread-Safe Leaky Bucket Leaking strategies Tuning Leaky Bucket Available Implementations
![Page 46: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/46.jpg)
Pattern #7
Limited Retries
![Page 47: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/47.jpg)
// doAction returns true if successful, false otherwise // General pattern boolean success = false int tries = 0; while (!success && (tries < MAX_TRIES)) { success = doAction(...); tries++; } // Alternative one-retry-only variant success = doAction(...) || doAction(...);
Limited Retries (1)
![Page 48: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/48.jpg)
Idempotent Actions Closures / Lambdas Tuning Retries
![Page 49: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/49.jpg)
More Patterns • Complete Parameter Checking • Marked Data • Routine Audits
![Page 50: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/50.jpg)
Further reading 1. Michael T. Nygard, Release It!,
Pragmatic Bookshelf, 2007
2. Robert S. Hanmer, Patterns for Fault Tolerant Software, Wiley, 2007
3. James Hamilton, On Designing and Deploying Internet-Scale Services,21st LISA Conference 2007
4. Andrew Tanenbaum, Marten van Steen, Distributed Systems – Principles and Paradigms, Prentice Hall, 2nd Edition, 2006
![Page 51: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/51.jpg)
It‘s all about production!
![Page 52: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/52.jpg)
@ufried Uwe Friedrichsen | [email protected] | http://slideshare.net/ufried | http://ufried.tumblr.com
![Page 53: Fault tolerance made easy](https://reader034.vdocument.in/reader034/viewer/2022052505/554fb08eb4c905ad218b5205/html5/thumbnails/53.jpg)