![Page 1: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/1.jpg)
T h e D a r k A r t o f B u i l d i n g a P r o d u c ti o n I n c i d e n t S y s t e m
@Alois ReitbauerTech. Evangelist & Product Mgr., Compuware
![Page 2: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/2.jpg)
N o b r o ke n c a b l e s
![Page 3: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/3.jpg)
N o d a t a c e n t e r fi r e s
![Page 4: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/4.jpg)
O t h e r t h i n g s c a n h a p p e n a s w e l l
Continuous deployments
Infrastructure changes
other “everyday” stuff
![Page 5: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/5.jpg)
Scaling an incident system
![Page 6: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/6.jpg)
H o w i t f e e l s t o d o w h a t w e d o
![Page 7: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/7.jpg)
D o y o u a l e r t ?
Typical error rate of 3 percent at 10.000 transactions/min
During the night we now have 5 errors in 100 requests.
![Page 8: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/8.jpg)
D o y o u a l e r t ?
Typical response time has been around 300 ms.
Now we see response times up to 600 ms.
![Page 9: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/9.jpg)
We a re g o o d a t fi x i n g p ro b l e m s , b u t n o t re a l l y g o o d a t d e t e c ti n g t h e m .
![Page 10: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/10.jpg)
H o w c a n w e g e t b e tt e r ?.
![Page 11: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/11.jpg)
It is all about statisticsI t ’s a l l a b o u t s t a ti s ti c s
![Page 12: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/12.jpg)
Stati sti cs is about objecti vely lying to yourself
in a meaningful way.
![Page 13: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/13.jpg)
H o w t o d e s i g n a n i n c i d e n t
![Page 14: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/14.jpg)
How to calculatethis value?
I t l o o k s r e a l l y s i m p l e
Which metric to pick?
How to getthis baseline?
How to define thatthis happened?
![Page 15: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/15.jpg)
W h i c h m e t r i c s t o p i c k ?
![Page 16: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/16.jpg)
T h r e e t y p e s o f m e t r i c s
Capacity MetricsDefine how much of resource is used.
Discrete MetricsSimple countable things, like errors or users.
Continuous MetricsMetrics represented by a range of values at any given time.
![Page 17: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/17.jpg)
Capac i ty Metr icsGood for capacity planning, not so good for production alerting
![Page 18: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/18.jpg)
C o n n e c ti o n P o o l s
![Page 19: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/19.jpg)
bett er useConnection acquisition timeTells you, whether anyone needed a connection and did not get it.
![Page 20: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/20.jpg)
C P U U s a g e
![Page 21: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/21.jpg)
bett er useCombination of Load Average and CPU usageeven better correlate the with response times of applications
![Page 22: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/22.jpg)
Discrete Metr i csPretty easy to track and analyze.
![Page 23: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/23.jpg)
Conti nuous Metr i csRequire some extra work as they are not that easy to track.
![Page 24: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/24.jpg)
Conti nuous Metrics – The hope
42
![Page 25: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/25.jpg)
Conti nuous Metrics – The reality
![Page 26: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/26.jpg)
What the average tells us
![Page 27: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/27.jpg)
What the median tells us
![Page 28: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/28.jpg)
H o w t o g e t a b a s e l i n e ?
![Page 29: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/29.jpg)
A baseline is not a numberBaselines define the range of a value combined with a probability
![Page 30: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/30.jpg)
Normal distributi on as baseline
Mean: 500 msStd. Dev.: 100 ms
68 %400ms – 500 ms
95 %300ms – 700 ms
100 200 300 400 500 600 700 800 900
99 %200ms – 800 ms
![Page 31: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/31.jpg)
T h i s c a n g o r e a l l y w r o n g
“Why alerts suck and monitoring solutions need to become better”
![Page 32: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/32.jpg)
H o w t h i s l e a d s t o f a l s e a l e r t s
![Page 33: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/33.jpg)
Many false alerts
Aggressive Baseline
![Page 34: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/34.jpg)
No alerts at all
Moderate Baseline
![Page 35: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/35.jpg)
Find the right distributi on model
However, this can be really hard to impossible
![Page 36: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/36.jpg)
Your distributi on might look l ike this
![Page 37: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/37.jpg)
… or like this
![Page 38: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/38.jpg)
or completely diff erentyou never know …
![Page 39: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/39.jpg)
H o w c a n w e s o l v e t h i s p r o b l e m ?
![Page 40: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/40.jpg)
N o r m a l d i s t r i b u ti o n - a g a i n
50 Percent slower than μ
97.6 Percent slower than μ + 2σ
Median
97th Percentile
![Page 41: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/41.jpg)
The 50 t h and 90 t h percenti le defi ne normal behavior
without needingto know anything about the
distributi on model
![Page 42: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/42.jpg)
Median shows the real problem
![Page 43: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/43.jpg)
H o w t o d e fi n e n o n - n o r m a l b e h a v i o r ?
![Page 44: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/44.jpg)
Fortunately this is not the problem we need to solveWe are only talking about missed expectations
![Page 45: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/45.jpg)
Let’s look at two scenarios
Errors
Is a certain error rate likely to happen or not?
Response Times
Is a certain increase in response time significant
enough to trigger an incident?
![Page 46: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/46.jpg)
The error rate scenario
We have a typical error rate of 3 percent at 10.000 transactions/minute
During the night we now have 5 errors in 100 requests. Should we alert – or not?
![Page 47: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/47.jpg)
W h a t c a n w e l e a r n
![Page 48: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/48.jpg)
S t a ti s ti c s i s e v e r w h e r e
![Page 49: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/49.jpg)
Binomia l D ist r ibuti onTells us how likely it is to see n successes in a certain number of trials
![Page 50: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/50.jpg)
H o w m a n y e r r o r s a r e o k ?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
Likeliness of at least n errors
18 % probability to see 5 or more errors. Which is within 2 times Std. Deviation. We do not alert.
![Page 51: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/51.jpg)
R e s p o n s e T i m e E x a m p l e
Our median response time is 300 ms
and we measure
200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
![Page 52: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/52.jpg)
P e r c e n ti l e D r i ft
D e t e c ti o n
![Page 53: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/53.jpg)
Did the median drift signifi cantly?
Check all values above 300 ms
200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms
7 values are higher than the median. Is this normal?
We can again use the Binomial Distribution
![Page 54: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/54.jpg)
A p p l y i n g t h e B i n o m i a l D i s t r i b u ti o n
We have a 50 percent likeliness to see values above the median.
How likely is is that 7 out of 10 samples are higher?
The probability is 17 percent, so we should not alert.
![Page 55: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/55.jpg)
How to calculatethis value?
… a n d w e a r e d o n e !
Which metric to pick?
How to getthis baseline?
How to define thatthis happened?
![Page 56: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/56.jpg)
This was just the beginning
There are many more use things about statistics, probabilities, testing, ….
![Page 57: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/57.jpg)
A l o i s R e i t b a u e [email protected]@AloisReitbauerapmblog.compuware.com
![Page 58: The Dark of Building an Production Incident Syste](https://reader035.vdocument.in/reader035/viewer/2022062514/55838dbad8b42a9e528b4aa6/html5/thumbnails/58.jpg)
Image Credits
http://commons.wikimedia.org/wiki/File:Network_switches.jpghttp://commons.wikimedia.org/wiki/File:Wheelock_mt.jpghttp://commons.wikimedia.org/wiki/File:Fire-lite-bg-10.jpghttp://commons.wikimedia.org/wiki/File:Estacaobras.jpghttp://commons.wikimedia.org/wiki/File:Speedo_angle.jpghttp://commons.wikimedia.org/wiki/File:WelcomeToVegasNite.JPGhttp://commons.wikimedia.org/wiki/File:Dice_02138.JPGhttp://commons.wikimedia.org/wiki/File:Teadlased_j%C3%A4%C3%A4l.jpg