next generation alerting and fault detection, srecon europe 2016

120
Next-generation alerting and fault detection Dieter Plaetinck, raintank SRECon16 Europe – Dublin, Ireland July 12, 2016

Upload: dieter-plaetinck

Post on 14-Apr-2017

484 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Next generation alerting and fault detection, SRECon Europe 2016

Next-generation

alerting and fault detection

Dieter Plaetinck, raintank

SRECon16 Europe – Dublin, Ireland July 12, 2016

Page 2: Next generation alerting and fault detection, SRECon Europe 2016

Alerting, fault & anomaly detection through:

Machine learning

event & stream processing

Alerting IDE’s

Page 3: Next generation alerting and fault detection, SRECon Europe 2016
Page 4: Next generation alerting and fault detection, SRECon Europe 2016
Page 5: Next generation alerting and fault detection, SRECon Europe 2016
Page 6: Next generation alerting and fault detection, SRECon Europe 2016
Page 7: Next generation alerting and fault detection, SRECon Europe 2016

Also on

● support for Graphite and Grafana

Page 8: Next generation alerting and fault detection, SRECon Europe 2016

Also on

● support for Graphite and Grafana● hosted Graphite and Grafana

Page 9: Next generation alerting and fault detection, SRECon Europe 2016

Presumptions

● Monitoring using metrics in place

Page 10: Next generation alerting and fault detection, SRECon Europe 2016

Presumptions

● Monitoring using metrics in place● Alerting on metrics

Page 11: Next generation alerting and fault detection, SRECon Europe 2016

Presumptions

● Monitoring using metrics in place● Alerting on metrics● Alerts need high signal/noise ratio

Page 12: Next generation alerting and fault detection, SRECon Europe 2016
Page 13: Next generation alerting and fault detection, SRECon Europe 2016
Page 14: Next generation alerting and fault detection, SRECon Europe 2016
Page 15: Next generation alerting and fault detection, SRECon Europe 2016
Page 16: Next generation alerting and fault detection, SRECon Europe 2016
Page 17: Next generation alerting and fault detection, SRECon Europe 2016

www.quora.com/unanswered

Page 18: Next generation alerting and fault detection, SRECon Europe 2016

Google trends

Page 19: Next generation alerting and fault detection, SRECon Europe 2016

Static thresholds → automated anomaly detection

● Not scaling / too much data

Page 20: Next generation alerting and fault detection, SRECon Europe 2016

Static thresholds → automated anomaly detection

● Not scaling / too much data

● Infrastructure complexity

Page 21: Next generation alerting and fault detection, SRECon Europe 2016

Static thresholds → automated anomaly detection

● Not scaling / too much data

● Infrastructure complexity● Alerting on Patterns

Page 22: Next generation alerting and fault detection, SRECon Europe 2016

Machine learning is a subfield of computer science that evolved from the study of patternrecognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel definedmachine learning as a field of study that

gives computers the ability to learn without being explicitlyprogrammed.Machine learning explores the study and construction of

algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in orderto make data-driven predictions or decisions expressed as outputs, rather than following strictly staticprogram instructions.”

https://en.wikipedia.org/wiki/Machine_learning

Page 23: Next generation alerting and fault detection, SRECon Europe 2016

http://www.extremetech.com/extreme/224445-its-2-0-how-googles-deepmind-is-beating-the-best-in-go-and-why-that-matters

Page 24: Next generation alerting and fault detection, SRECon Europe 2016

https://research.googleblog.com/2014/09/building-deeper-understanding-of-images.html

Page 25: Next generation alerting and fault detection, SRECon Europe 2016
Page 26: Next generation alerting and fault detection, SRECon Europe 2016

Using machine learningfor automated anomaly detection

Page 27: Next generation alerting and fault detection, SRECon Europe 2016
Page 28: Next generation alerting and fault detection, SRECon Europe 2016
Page 29: Next generation alerting and fault detection, SRECon Europe 2016

Challenges.

Page 30: Next generation alerting and fault detection, SRECon Europe 2016
Page 31: Next generation alerting and fault detection, SRECon Europe 2016

Challenges

1● e.g. Amazon, Facebook, LinkedIncontext

Page 32: Next generation alerting and fault detection, SRECon Europe 2016

1● e.g. Amazon, Facebook, LinkedIn● e.g. infrastructure change

context

Challenges

Page 33: Next generation alerting and fault detection, SRECon Europe 2016

2● Games vs your infraChanging rules

Challenges

Page 34: Next generation alerting and fault detection, SRECon Europe 2016

Challenges

2● Games vs your infra● Trained model doesn’t work

on new scenarios

Changing rules

Page 35: Next generation alerting and fault detection, SRECon Europe 2016

3Image recognition, security

vsops metrics

Signal strength

Challenges

Page 36: Next generation alerting and fault detection, SRECon Europe 2016

4● e.g. super fast to fastrelevancy

Challenges

Page 37: Next generation alerting and fault detection, SRECon Europe 2016

4● e.g. super fast to fast● e.g. redundancy failover

relevancy

Challenges

Page 38: Next generation alerting and fault detection, SRECon Europe 2016

4● e.g. super fast to fast● e.g. redundancy failover

operator knows best

relevancy

Challenges

Page 39: Next generation alerting and fault detection, SRECon Europe 2016

5● data prep: filtering, selection, cleaning● statistical modeling, model selection● training, testing● track performance & maintenance● operate infrastructure● fitting UX/UI

Effort

Challenges

Page 40: Next generation alerting and fault detection, SRECon Europe 2016

6● IntrinsicComplexity

Challenges

Page 41: Next generation alerting and fault detection, SRECon Europe 2016

6● Intrinsic

● Incidentalhttps://engineering.quora.com/Avoiding-Complexity-of-Machine-Learning-Systems

Complexity

Challenges

Page 42: Next generation alerting and fault detection, SRECon Europe 2016

ML / AD for operations

has merits

BUT:

Page 43: Next generation alerting and fault detection, SRECon Europe 2016

ML / AD for operations

has merits

BUT: ● Anomalies != Faults. Signal/noise trap

Page 44: Next generation alerting and fault detection, SRECon Europe 2016

ML / AD for operations

has merits

BUT: ● Anomalies != Faults. Signal/noise trap● Significant effort & complexity

Page 45: Next generation alerting and fault detection, SRECon Europe 2016

ML / AD for operations

has merits

BUT: ● Anomalies != Faults. Signal/noise trap● Significant effort & complexity● Limited use cases

Page 46: Next generation alerting and fault detection, SRECon Europe 2016

What might help

● Enrich metric metadata (metrics20.org)

Page 47: Next generation alerting and fault detection, SRECon Europe 2016

What might help

● Enrich metric metadata (metrics20.org)

clustered, stronger signals with more context

Page 48: Next generation alerting and fault detection, SRECon Europe 2016

What might help

● Enrich metric metadata (metrics20.org)

clustered, stronger signals with more context

classification for model selection

Page 49: Next generation alerting and fault detection, SRECon Europe 2016

What might help

● Enrich metric metadata (metrics20.org)

clustered, stronger signals with more context

classification for model selection

derive relevancy

Page 50: Next generation alerting and fault detection, SRECon Europe 2016

What might help

● Enrich metric metadata (metrics20.org)

clustered, stronger signals with more context

classification for model selection

derive relevancy● integration with CM, PaaS.

Page 51: Next generation alerting and fault detection, SRECon Europe 2016

What might help

● Enrich metric metadata (metrics20.org)

clustered, stronger signals with more context

classification for model selection

derive relevancy● integration with CM, PaaS.

awareness of infrastructure

Page 52: Next generation alerting and fault detection, SRECon Europe 2016

What might help

● Enrich metric metadata (metrics20.org)

clustered, stronger signals with more context

classification for model selection

derive relevancy● integration with CM, PaaS.

awareness of infrastructure

awareness of infrastructure change

Page 53: Next generation alerting and fault detection, SRECon Europe 2016

HOW doTHEY do it?

Page 54: Next generation alerting and fault detection, SRECon Europe 2016

https://codeascraft.com/2013/06/11/introducing-kale/

Page 55: Next generation alerting and fault detection, SRECon Europe 2016
Page 56: Next generation alerting and fault detection, SRECon Europe 2016

https://www.oreilly.com/ideas/monitoring-distributed-systems

“it’s important that monitoring systems - especially the criticalpath from the onset of a production problem, through a pageto a human, through basic triage and deep debugging - be kept simple and comprehensible by everyone on theteam.”

“Similarly, to keep noise low and signal high, the elements ofyour monitoring system that direct to a pager need to be verysimple and robust. Rules that generate alerts for humansshould be simple to understand and represent a clearfailure.”

Page 57: Next generation alerting and fault detection, SRECon Europe 2016

Conclusion

Page 58: Next generation alerting and fault detection, SRECon Europe 2016
Page 59: Next generation alerting and fault detection, SRECon Europe 2016

CEP& Stream processing

Page 60: Next generation alerting and fault detection, SRECon Europe 2016

CEP & stream processinge.g. storm, riemann.io, spark streaming

Page 61: Next generation alerting and fault detection, SRECon Europe 2016

in → logic → out

Page 62: Next generation alerting and fault detection, SRECon Europe 2016

Riemann.io

Page 63: Next generation alerting and fault detection, SRECon Europe 2016

CEP & stream processing

Compared to query-based alerting systems:

Page 64: Next generation alerting and fault detection, SRECon Europe 2016

CEP & stream processing

Compared to query-based alerting systems:

● Good scheduling guarantees/execution timeliness

Page 65: Next generation alerting and fault detection, SRECon Europe 2016

CEP & stream processing

Compared to query-based alerting systems:

● Good scheduling guarantees/execution timeliness● Unfamiliar paradigm (maybe)

Page 66: Next generation alerting and fault detection, SRECon Europe 2016

CEP & stream processing

Compared to query-based alerting systems:

● Good scheduling guarantees/execution timeliness● Unfamiliar paradigm (maybe)● Performance/scalability (maybe)

Page 67: Next generation alerting and fault detection, SRECon Europe 2016

CEP & stream processing

Compared to query-based alerting systems:

● Good scheduling guarantees/execution timeliness● Unfamiliar paradigm (maybe)● Performance/scalability (maybe)● operational complexity (maybe)

Page 68: Next generation alerting and fault detection, SRECon Europe 2016

Conclusion

Page 69: Next generation alerting and fault detection, SRECon Europe 2016

Not a bad idea…But doesn’t get to the root of the alerting problems.

Page 70: Next generation alerting and fault detection, SRECon Europe 2016

Aha!

Page 71: Next generation alerting and fault detection, SRECon Europe 2016
Page 72: Next generation alerting and fault detection, SRECon Europe 2016

Picture by Matt Simmons

Page 73: Next generation alerting and fault detection, SRECon Europe 2016

IDE for alerting

Support programmers building and maintaining software

Page 74: Next generation alerting and fault detection, SRECon Europe 2016

IDE for alerting

Support programmers building and maintaining software

Support operators building and maintaining alerting

Page 75: Next generation alerting and fault detection, SRECon Europe 2016
Page 76: Next generation alerting and fault detection, SRECon Europe 2016
Page 77: Next generation alerting and fault detection, SRECon Europe 2016
Page 78: Next generation alerting and fault detection, SRECon Europe 2016
Page 79: Next generation alerting and fault detection, SRECon Europe 2016

1vs traditional alerting, machine learning

[historical] testing

Key features

Page 80: Next generation alerting and fault detection, SRECon Europe 2016

2Arbitrary scopedata juggling

Key features

Page 81: Next generation alerting and fault detection, SRECon Europe 2016

Key features

2Arbitrary scopeArbitrary data

data juggling

Page 82: Next generation alerting and fault detection, SRECon Europe 2016

3dependencies

Key features

Page 83: Next generation alerting and fault detection, SRECon Europe 2016

http://www.slideshare.net/adriancockcroft/gluecon-monitoring-microservices-and-containers-a-challenge

Page 84: Next generation alerting and fault detection, SRECon Europe 2016

Key features

4transcience

Page 85: Next generation alerting and fault detection, SRECon Europe 2016

Key features

5DRY

Page 86: Next generation alerting and fault detection, SRECon Europe 2016

Keyinsights.

Page 87: Next generation alerting and fault detection, SRECon Europe 2016

Key insights

1remove hassle wrt improving signal/noise

Page 88: Next generation alerting and fault detection, SRECon Europe 2016

Key insights

1● ongoing maintenance & tuning is critical

remove hassle wrt improving signal/noise

Page 89: Next generation alerting and fault detection, SRECon Europe 2016

Key insights

1● ongoing maintenance & tuning is critical● code for UI and logic > knobs

remove hassle wrt improving signal/noise

Page 90: Next generation alerting and fault detection, SRECon Europe 2016

Key insights

1● ongoing maintenance & tuning is critical● code for UI and logic > knobs● leveraging additional data

remove hassle wrt improving signal/noise

Page 91: Next generation alerting and fault detection, SRECon Europe 2016

Key insights

2● Author to recipientcommunication

Page 92: Next generation alerting and fault detection, SRECon Europe 2016

Key insights

2● Author to recipient● Alert often primary UI

communication

Page 93: Next generation alerting and fault detection, SRECon Europe 2016

Key insights

3Human > computer

Page 94: Next generation alerting and fault detection, SRECon Europe 2016

Key insights

4attention is scarce, expensive

Page 95: Next generation alerting and fault detection, SRECon Europe 2016

Key insights

4attention is scarce, expensive

"provide monitoring platform that enables operators toefficiently utilize their attention"

Page 96: Next generation alerting and fault detection, SRECon Europe 2016

fault detectionwith bosun

Page 97: Next generation alerting and fault detection, SRECon Europe 2016

Classify series & find KPI’s

Page 98: Next generation alerting and fault detection, SRECon Europe 2016

Smoothly seasonal: good

Page 99: Next generation alerting and fault detection, SRECon Europe 2016

Smoothly seasonal: offset

Page 100: Next generation alerting and fault detection, SRECon Europe 2016

Smoothly seasonal: spikes

Page 101: Next generation alerting and fault detection, SRECon Europe 2016

Smoothly seasonal: erratic

Page 102: Next generation alerting and fault detection, SRECon Europe 2016

Band(), graphiteBand()

bosun.org/expressions.html

Solution 1/2 : strength

Page 103: Next generation alerting and fault detection, SRECon Europe 2016

Solution 1/2 : strength

Page 104: Next generation alerting and fault detection, SRECon Europe 2016

Solution 2/2 : erraticness

Page 105: Next generation alerting and fault detection, SRECon Europe 2016

Deviation-now Erraticness now = --------------------------

Deviation-historical

Solution 2/2 : erraticness

Page 106: Next generation alerting and fault detection, SRECon Europe 2016

Deviation-now median-historicalErraticness now = -------------------------- * -----------------------

Deviation-historical median-now

Solution 2/2 : erraticness

Page 107: Next generation alerting and fault detection, SRECon Europe 2016

deviation-now * median-historical Erraticness now = -------------------------------------------------------

(deviation-historical * median-now) + 0.01

Solution 2/2 : erraticness

Page 108: Next generation alerting and fault detection, SRECon Europe 2016
Page 109: Next generation alerting and fault detection, SRECon Europe 2016
Page 110: Next generation alerting and fault detection, SRECon Europe 2016
Page 111: Next generation alerting and fault detection, SRECon Europe 2016
Page 112: Next generation alerting and fault detection, SRECon Europe 2016

dieter.plaetinck.be/post/practical-fault-detection-on-timeseries-part-2

More details

Bosun macro, template & example

Grafana dashboard

Page 113: Next generation alerting and fault detection, SRECon Europe 2016

Static thresholds → automated anomaly detection

● Not scaling / too much data

Page 114: Next generation alerting and fault detection, SRECon Europe 2016

Static thresholds → automated anomaly detection

● Not scaling / too much data

● Infrastructure complexity

Page 115: Next generation alerting and fault detection, SRECon Europe 2016

Static thresholds → automated anomaly detection

● Not scaling / too much data

● Infrastructure complexity● Alerting on patterns

Page 116: Next generation alerting and fault detection, SRECon Europe 2016

Conclusion

Page 117: Next generation alerting and fault detection, SRECon Europe 2016

● All about the workflow

Page 118: Next generation alerting and fault detection, SRECon Europe 2016

● All about the workflow● An IDE like bosun exponentially boosts ability to maintain high

signal/noise alerting

Page 119: Next generation alerting and fault detection, SRECon Europe 2016

● All about the workflow● An IDE like bosun exponentially boosts ability to maintain high

signal/noise alerting● Build & share!

Page 120: Next generation alerting and fault detection, SRECon Europe 2016

Want more ?● bosun.org/resources presentations by Kyle Brandt (LISA 2014 + Monitorama 2015)● “my philosophy on alerting” by Rob Ewaschuk● kitchensoap.com/2015/05/01/openlettertomonitoringproducts● kitchensoap.com/2013/07/22/owning-attention-considerations-for-alert-design● “monitoring microservices” by Adrian Cockroft● (dieter.plaetinck.be/post/practical-fault-detection-alerting-dont-need-to-be-data-

scientist)● dieter.plaetinck.be/post/practical-fault-detection-on-timeseries-part-2● metrics20.org/media● mabrek.github.io● iwringer.wordpress.com

@Dieter_be - @raintanksaas – slack.raintank.io – raintank.io – bosun.org – grafana.org