how we sleep well at night using hystrix at finn.no

Hystrix- What did we learn?

JavaZone September 2015

Hystrix cristata

Audun Fauchald Strand & Henning Spjelkavik

public int lookup(MapPoint p ) { return altitude(p);}

Example

public int lookup(MapPoint p ) { return new LookupCommand(p).execute();}

private class LookupCommand extends HystrixCommand<Integer> { final MapPoint p;

LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; }

protected Integer run() throws Exception { return altitude(p); }

protected Integer getFallback() { return -1; }}

Example

Audun Fauchald Strand@audunstrand

Henning Spjelkavik@spjelkavik

AgendaWhy?Tolerance for failure - How?How to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn

Service A calls Service B

Map calls User over the networkWhat can possibly go wrong?

Map calls UserWhat can possibly go wrong?1. Connection refused2. Slow answer3. Veery slow answer (=never)4. The result causes an exception in

the client library

Map calls UserWhat can possibly go wrong?1. Connection refused => < 2 ms2. Slow answer => 5 s3. Veery slow answer => timeout4. The result causes an exception in

the client library => depends

Fails quickly

May kill both the server and the client

Map calls UserLet’s assume:

Thread pr requestResponse time - 4 sMap has 60 req/s. Fan-out to User is 2 => 120 req/s240 / 480 threads blocking

mobilewebN has 130 req/sLet’s assume:

Thread pr requestRandomApp has 130 req/s. Fan-out to service is 2 => 260 req/s520 / 1040 threads blocking

What happens in an app with 500 blocking threads?

Not much. Besides waiting. CPU is idle.If maximum-threads == 500

=> no more connections are allowedAnd what about 1040 occupied threads?

And where is the user after 8 s?At Youtube, Facebook or searching for cute kittens.

The problem we try to solve

An application with 30 dependent services - with 99.99%

uptime for each service99.99^30 = 99.7% uptime

0.3% of 1 billion requests = 3,000,000 failures

2+ hours downtime/month even if all dependencies have excellent uptime.

98%^30 = 54% uptime

99.99% = 8 sec a day; 99.7% 4 min pr day;

AgendaWhy?Tolerance for failure - How?How to create a Hystrix CommandMonitoring and DashboardExamples from finnOne step further

Control over latency and failure from dependencies

Stop cascading failures in a complex distributed system.

Fail fast and rapidly recover.

Fallback and gracefully degrade when possible.

Enable near real-time monitoring, alerting

What is Hystrix for?

Fail fast - don’t let the user wait!Circuit breaker - don’t bother, it’s already downFallback - can you give a sensible default, show stale data?Bulkhead - protect yourself against cascading failure

Principles

Avoid any single dep from using up all threads

Shedding load and failing fast instead of queueing

Providing fallbacks wherever feasible

Using isolation techniques (such as bulkhead, swimlane,

and circuit breaker patterns) to limit the impact of any one

dependency.

Two different ways of isolationSemaphore

“at most 5 concurrent calls”only for CPU-intensive, local callsThread pool (dedicated couriers)the call to the underlying service is handled by a pooloverhead is usually not problematicdefault approach

Recommended book: Release it!

DependenciesDepends on

rxjavaarchaius (& commons-configuration)

FINN uses Constretto for configuration management, hence:

https://github.com/finn-no/archaius-constretto

DependenciesThere are useful addons:

hystrix-metrics-event-stream - json/http stream

hystrix-codahale-metrics-publisher (currently io.dropwizard.metrics)

(Follows the recent trend of really splitting up the dependencies - include only what you need)

Default propertiesQuite sensible, “fail fast”Do your own calculations of

number of concurrent requeststimeouts (99.8 percentile)...by looking at your current performance

(latency) pr request and add a little buffer

threadsrequests per second at peak when healthy × 99th percentile latency in seconds + some breathing room

Hystrix - part of NetflixOSSNetflix OSSHystrix - resilienceRibbon - remote callsFeign - Rest clientEureka - Service discoveryArchaius - ConfigurationKaryon - Starting point

Hystrix at FINN.no

AgendaWhy?Tolerance for failureHow to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn

How to create a Hystrix CommandA command class wrapping the “risky” operation.- must implement run()- might implement fallback()

Since version 1.4 Observable implementation also available

public int lookup(MapPoint p ) { return altitude(p);}

AltitudeSearch - before

private class LookupCommand extends HystrixCommand<Integer> {

final MapPoint p;

LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; }

protected Integer run() throws Exception { return altitude(p); }}

AltitudeSearch - after

FAQDoes that mean I have to write a command for (almost) every remote operation in my application?

YES!YES!

Why is it so intrusive?

But Why?

Hystrix-Javanica

@HystrixCommand(fallbackMethod = "defaultUser" ignoreExceptions = {BadRequestException.class}) public User getUserById(String id) { } private User defaultUser(String id) { }

Concurrency - The client decides

T = c.execute() synchronous

Future<T> = c.queue() asynchronousObservable<T> = c.observable() reactive streams

Runtime behaviour

AgendaWhy?Tolerance for failureHow to create a Hystrix CommandMetrics, Monitoring and DashboardExamples from finnWhat did we learn

MetricsCircuit breaker open?Calls pr. secondExecution time?

Median, 90th, 95th and 99th percentile

Status of thread pool?Number of clients in

cluster

Publishing the metricsServo - Netflix metrics libraryCodaHale/Yammer/dropwizard - metrics

HystrixPlugins.registerMetricsPublisher(HystrixMetricsPublisher impl)

Dashboard toolset

hystrix-metrics-event-streamout of the box: servlet we use embedded jetty for thrift services

turbine-webaggregates metrics-event-stream into clusters

hystrix-dashboardgraphical interface

Dashboard

More Details

Thread Pools

Details

AgendaWhy?Tolerance for failureHow to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn

Examples from Finn - Code

AltitudesearchFetch Several Profiles using collapsingOperations

private class LookupCommand extends HystrixCommand<Integer> {

final MapPoint p; LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; }

protected Integer run() throws Exception { return altitude(p); }

protected Integer getFallback() { return -1; }}

AltitudeSearch

Migrating a libraryCreate commandsWrap commands with

existing servicesBackwards compatibleNo flexibility

Fetch a map pointFetch Several Profiles using collapsingOperations

Request Collapsing

Fetch one profile takes 10ms

Lots of concurrent requests

Better to fetch multiple profiles

Request Collapsing - why

decouples client model from server interface

reduces network overhead

client container/thread batches requests

Request Collapsingcreate two commands

Collapserone new() pr client request

BatchCommandone new() pr server request

Request CollapsingIntegrate two commands in two methods

createCommand()Create batchCommand from a list of

singlecommandsmapResponseToRequests()

Map listResponse to single resposes

Create Collapser

public Collapser(Query query) { this.query = query;

Create BatchCommand

return new BatchCommand(collapsedRequests, client);

create BatchCommand

@Overrideprotected HystrixCommand<Map<Query,Profile>>

createCommand(Collection<Request> collapsedRequests) { return new BatchCommand(collapsedRequests, client);}

mapResponseToRequests @Overrideprotected void mapResponseToRequests(

Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) {

collapsedRequests.stream().forEach(c -> c.setResponse(batchResponse.getOrDefault(

c.getArgument(), new ImmutableProfile(id) );) }

c.getArgument(), new ImmutableProfile(id) );) } Graceful

degradation

Request Collapsing - experiencesEach individual request will be slower for the

client, is that ok?10 ms operation into 100 ms window Max 110 ms for clientAverage 60 msRead documentation first!!

Fetch a map pointFetch Several Profiles using collapsingOperations

Example from Finn - Operations[2015-06-31T13:37:00,485][ERROR] Forwarding to error page from request due to exception [AdCommand short-circuited and no fallback available.]com.netflix.hystrix.exception.HystrixRuntimeException: RecommendMoreLikeThisCommand short-circuited and no fallback available.at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:811)

Error happens in productionOperations gets paged with lots of error

messages in logsThey read the logsLots or [ERROR]They restart the application

Learnings - operationsError messages means different things with

HystrixWhat they say, not where they occurBuilt in error recovery with circuit breakerOperations reads logs, not hystrix dashboardLots of unnecessary restarts

Conclusions

What did we learn

Experiences from Finn

Hystrix belongs client-side

Nested Hystrix commands are ok

Graceful degradation is a big change in mindset

Little use of proper fallback-values

Tried putting hystrix in low-level http client without great success.

Server side errors are detected clientside

Not all exceptions are errors.

RxJava needs a full rewrite… Still useful without!

Experiences from FINNHystrix standardises things we did before:

Nitty gritty http-client stuffTimeoutsConnection pools

Tuning thread poolsDashboardsMetrics

Wrap upShould you start using Hystrix?- Bulkhead and circuit-breaker - explicit timeout and error

handling is useful- DashboardsFurther readingBen Christensen, GOTO Aarhus 2013 - https://www.youtube.com/watch?v=_t06LRX0DV0Updated for QConSF2014; https://qconsf.com/system/files/presentation-slides/ReactiveProgrammingWithRx-QConSF-2014.pdf

Thanks for listening! audun.fauchald.strand@finn.no & henning.spjelkavik@finn.no

Questions?

how we sleep well at night using hystrix at finn.no

Technology

sleep and pregnancy: sleep deprivation, sleep disturbed

sleep and sleep disorder

using hystrix to build resilient distributed systems

come sleep o sleep

hystrix - nina naturforskning · issn 0394-1914 hystrix...

tibco bwce and netflix' hystrix circuit breaker for cloud...

hystrix the italian journal of mammalogy · hystrix the...

modernizing systems with microservices, hystrix and...

sleep solutions sleep solutions

the pleistocene porcupine hystrix vinogradovi argyropulo,...

sleep and sleep disorders

hystrix the italian journal of mammalogy · issn 1825-5272...

quantified sleep: measuring sleep quality with sleep as...

sleep, sleep, sleep

application resiliency using netflix hystrix

impact of sleep deficiency 2020 -...

hands-on: hystrix - inovex€¦ · hands-on: hystrix best...

christmas carols from poland polish trad., transl. douglas...

resilience with hystrix

building your innovation academy (finn.no)