the dark art of production alerting

T h e D a r k A r t o f B u i l d i n g a P r o d u c ti o n I n c i d e n t S y s t e m

@AloisReitbauerwww.ruxit.com

N o b r o ke n c a b l e s

N o d a t a c e n t e r fi r e s

O t h e r t h i n g s c a n h a p p e n a s w e l l

Continuous deployments

Infrastructure changes

other “everyday” stuff

Scaling an incident system

H o w i t f e e l s t o d o w h a t w e d o

D o y o u a l e r t ?

Typical error rate of 3 percent at 10.000 transactions/min

During the night we now have 5 errors in 100 requests.

D o y o u a l e r t ?

Typical response time has been around 300 ms.

Now we see response times up to 600 ms.

W e a r e g o o d a t fi x i n g p r o b l e m s , b u t n o t r e a l l y g o o d

a t d e t e c ti n g t h e m .

H o w c a n w e g e t b e tt e r ?.

It is all about statisticsI t ’s a l l a b o u t s t a ti s ti c s

Stati sti cs is about objecti vely lying to yourself

in a meaningful way.

H o w t o d e s i g n a n i n c i d e n t

How to calculatethis value?

I t l o o k s r e a l l y s i m p l e

Which metric to pick?

How to getthis baseline?

How to define thatthis happened?

W h i c h m et r i c s to p i c k ?

T h r e e t y p e s o f m e t r i c sCapacity MetricsDefine how much of a resource is used.

Discrete MetricsSimple countable things, like errors or users.

Continuous MetricsMetrics represented by a range of values at any given time.

C a p a c i t y M et r i c sGood for capacity planning, not so good for production alerting

C o n n e c ti o n P o o l s

b ett e r u s eConnection acquisition timeTells you, whether anyone needed a connection and did not get it.

C P U U s a g e

b ett e r u s eCombination of Load Average and CPU usageeven better correlate the with response times of applications

D i s c rete M et r i c sPretty easy to track and analyze.

C o nti n u o u s M et r i c sRequire some extra work as they are not that easy to track.

Conti nuous Metrics – The hope

Conti nuous Metrics – The reality

What the average tells us

What the median tells us

H o w to get a b a s e l i n e ?

A baseline is not a numberBaselines define the range of a value combined with a probability

Normal distributi on as baseline

Mean: 500 msStd. Dev.: 100 ms

68 %400ms – 600 ms

95 %300ms – 700 ms

100 200 300 400 500 600 700 800 900

99 %200ms – 800 ms

T h i s c a n g o r e a l l y w r o n g

“Why alerts suck and monitoring solutions need to become better”

H o w t h i s l e a d s t o f a l s e a l e r t s

Many false alerts

Aggressive Baseline

No alerts at all

Moderate Baseline

Find the right distributi on modelHowever, this can be really hard to impossible

Your distr ibuti on might look l ike this

… or l ike this

or completely diff erentyou never know …

H o w c a n w e s o l v e t h i s p r o b l e m ?

N o r m a l d i s t r i b u ti o n - a g a i n

50 Percent slower than μ

97.6 Percent slower than μ + 2σ

Median97th Percentile

The 50 t h and 90 t h percenti le defi ne normal behavior

without needingto know anything about the

distributi on model

Median shows the real problem

H o w t o d e fi n e n o n - n o r m a l b e h a v i o r ?

Fo r t u n ate l y, t h i s i s n o t t h e p ro b l e m we n e e d to s o l ve

We are only talking about missed expectations

Let’s look at two scenarios

Errors

Is a certain error rate likely to happen or not?

Response Times

Is a certain increase in response time significant

enough to trigger an incident?

The error rate scenarioWe have a typical error rate of 3 percent at 10.000 transactions/minute

During the night we now have 5 errors in 100 requests. Should we alert – or not?

W h a t c a n w e l e a r n

S t a ti s ti c s i s e v e r w h e r e

B i n o m i a l D i st r i b u ti o nTells us how likely it is to see n successes in a certain number of trials

H o w m a n y e r r o r s a r e o k ?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0%

100.0%

120.0%

Likeliness of at least n errors

18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.

R e s p o n s e T i m e E x a m p l eOur median response time is 300 ms

and we measure

200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms

P e r c e n ti l e D r i ft

D e t e c ti o n

Did the median drift signifi cantly?

Check all values above 300 ms200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms

7 values are higher than the median. Is this normal?

We can again use the Binomial Distribution

A p p l y i n g t h e B i n o m i a l D i s t r i b u ti o n

We have a 50 percent likeliness to see values above the median.

How likely is is that 7 out of 10 samples are higher?

The probability is 17 percent, so we should not alert.

How to calculatethis value?

… a n d w e a r e d o n e !

Which metric to pick?

How to getthis baseline?

How to define thatthis happened?

This was just the beginningThere are many more use things about statistics, probabilities, testing, ….

A l o i s R e i t b a u e ralois.reitbauer@ruxit.com@AloisReitbauer

http://bit.ly/nycwebperferf

Image Credits

http://commons.wikimedia.org/wiki/File:Network_switches.jpghttp://commons.wikimedia.org/wiki/File:Wheelock_mt.jpghttp://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpghttp://commons.wikimedia.org/wiki/File:Estacaobras.jpghttp://commons.wikimedia.org/wiki/File:Speedo_angle.jpghttp://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPGhttp://commons.wikimedia.org/wiki/File:Dice_02138.JPGhttp://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg

the dark art of production alerting

continuous metrics metrics

median response time

capacity metrics good

normal distribution

aloisreitbauer http

typical response time

imagecredits http

binomial distribution

Technology

alerting service

the dark side of neste’s biofuel production

efficient monitoring and alerting

scalable monitoring & alerting

common alerting protocol

monitoring and alerting

dark side of clarity: its effect on knowledge production

the dark knight rises production notes

the dark of building an production incident syste

confidential: all rights reserved web-based alerting the...

100 alerting-warning 2013

hawkular alerting

production of hydrogen from dark fermentation …

biohydrogen production by dark fermentation: from

dark production of reactive oxygen species in photosystem

dark fermentative biohydrogen production from palm oil

alerting the whole community: removing barriers to …...

sms alerting system - nagios · nagios sms alerting system...

reheating and post-inﬂationary production of dark...

the dark knight production notes