learning at scale is hard! - usenix
Post on 27-Apr-2022
2 Views
Preview:
TRANSCRIPT
Learning at Scale is Hard!
Outage Pattern Analysis and Dirty Data
Tanner LundMicrosoft Azure SRE
@101010Lund
@101010Lund Photo:RachelChapman(CC)
@101010Lund
@101010Lund
@101010Lund
@101010Lund
@101010Lund
@101010Lund Photo:MoRiza(CC)
@101010Lund Photo:RachelChapman(CC)
Learning (From Failure) At Scale
@101010Lund
Trends: Identified
@101010Lund
Antipatterns: Quashed
@101010Lund
Reliability Work:Actually Gets Done
Appropriately Prioritized
@101010Lund
@101010Lund
Data Scientists:
@101010Lund
Problem Management
@101010Lund
Problem: “The cause of one or more incidents” – Information Technology
Infrastructure Library (ITIL)
@101010Lund
@101010Lund Photo:RachelChapman(CC)
Sharing is caring!
@101010Lund
Gathering data
@101010Lund
Selecting models
@101010Lund
Training said models
@101010Lund
Evaluating models
@101010Lund
You know what was harder?
@101010Lund
Knowing what we’re actually looking for.
@101010Lund
IDK, something amazing!
¯\(°_o)/¯
@101010Lund
Fundamental Issue: ROOT CAUSES
@101010Lund
@101010Lund
Complex Systems fail in complex ways
@101010Lund
“Each of these small failures is necessary to cause catastrophe
but only a combination is sufficient to permit failure”
-Richard I. Cook, “How Complex Systems Fail”
@101010Lund
Let’s take a step back
@101010Lund
Why do we do RCAs?
@101010Lund
To stop bad stuff from happening (again)
@101010Lund
Hunting for Causes Problems Contributing Factors
@101010Lund
Outage (for our purposes):
Service or platform level issue that impacts customer experience
@101010Lund
Postmortem Text Analysis
@101010Lund
BeautifulSoupNLTK
GensimpyLDAvis
@101010Lund
@101010Lund
Not actionable.
@101010Lund
@101010Lund
Big Deal™
@101010Lund
Metrics!
@101010Lund
@101010Lund Photo:JudyWitts (cc)
Pain Value
@101010Lund
Pain Value=(No.ofoutages)*(duration)*(severity)*
(weightingfactor)
@101010Lund
Customers ImpactedRegions
Hardware SKUsDistance Below SLO
Number of breached SLOs
@101010Lund
Data Scientists:
@101010Lund
Pain Value=(No.ofoutages)*(duration)*(severity)*
(weightingfactor)
@101010Lund
Human interpretation still necessary
@101010LundPhoto:WikimediaCommons
@101010Lund
Missing/InsufficientData
@101010Lund
Incomplete Data
@101010Lund
InaccurateData
ItWasDefinitelyNetwork’sFault
OurCertsExpired
@101010Lund
Irrelevant Data
@101010Lund
Ambiguity
Node – CPUNode – Instance of ProgramNode – Physical Hardware BoxNode – Point on Graph such that G = (V,E)Node – Any device connected to the networkNode – Communication endpointNode – Client, Server, or PeerNode – Bitcoin minerNode – Data TypeNode – Node.js
@101010Lund
Confounding Factors
(like config drift)
@101010Lund
@101010Lund
Dirty data will lie to you.
@101010Lund
What was the (preliminary) result?
@101010Lund
1. Surfaced surprise issues
@101010Lund
2. Debunked production myths
@101010Lund
3. Stronger arguments for prioritization of reliability
work
@101010Lund
What did we learn?
@101010Lund
1. Define your hypotheses
@101010Lund
2. Clean your data
@101010Lund
3. Work your way up the DIKW pyramid
@101010Lund
What else can we do?
@101010Lund
Cross-Correlate Data Sets
@101010Lund
@101010Lund
Study your minor failures
@101010Lund
Intelligently Calculate Risk
@101010Lund
Continue to improve the RCA Process
@101010Lund
@101010Lund Photo:RachelChapman(CC)
@101010Lund
@101010Lund
Tanner Lund@101010Lundtalund@Microsoft.com/in/tannerlund
top related