sherlock – diagnosing problems in the enterprise srikanth kandula victor bahl, ranveer chandra,...
TRANSCRIPT
![Page 1: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/1.jpg)
Sherlock – Diagnosing Problems in the Enterprise
Srikanth Kandula
Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang
![Page 2: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/2.jpg)
Enterprise Management: Between a Rock and a Hard Place
Manageability• Stick with tried software,
never change infrastructure• Cheap• Upgrades are hard, forget
about innovation!
Usability• Keep pace with technology• Expensive
– IT staff in 1000s– 72% of MS IT budget is
staff• Reliability Issues
– Cost of down-time
![Page 3: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/3.jpg)
Well-Managed Enterprises Still Unreliable
10% Troubled
85% Normal
Fraction Of Requests
0.7% Down
.1
.02
.04
.06
.08
10 100 1000 10000
Response time of a Web server (ms)
0
10% responses take up to 10x longer than normal
How do we manage evolving enterprise networks?How do we manage evolving enterprise networks?
![Page 4: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/4.jpg)
Current Tools Miss the Forest for the Trees
• Monitor Individual Boxes or Protocols
• Flood admin with alerts• Don’t convey the end-to-end
picture
SQLBackend
Web Server
Authentication Server
DNS
Client
But, the primary goal of enterprise management is to diagnose user-perceived problems!
But, the primary goal of enterprise management is to diagnose user-perceived problems!
![Page 5: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/5.jpg)
Instead of looking at the nitty-gritty of individual components, use an end-to-end approach that focuses on user problems
Sherlock
![Page 6: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/6.jpg)
Challenges for the End-to-End Approach
• Don’t know what user’s performance depends on
![Page 7: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/7.jpg)
• Don’t know what user’s performance depends on– Dependencies are distributed
– Dependencies are non-deterministic
• Don’t know which dependency is causing the problem– Server CPU 70%, link dropped 10
packets, but which affected user?
SQLBackend
Web Server
Auth. Server
DNS
Client
E.g., Web Connection
Challenges for the End-to-End Approach
![Page 8: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/8.jpg)
Sherlock’s Contributions
• Passively infers dependencies from logs• Builds a unified dependency graph incorporating
network, server and application dependencies• Diagnoses user problems in the enterprise • Deployed in a part of the Microsoft Enterprise
![Page 9: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/9.jpg)
Sherlock’s Architecture
![Page 10: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/10.jpg)
Servers
Clients
Sherlock’s Architecture
Web1 1000ms
Web2 30ms
File1 Timeout
User Observations+
=
List Troubled Components
Network Dependency Graph
Inference Engine
Sherlock works for various client-server applications Sherlock works for various client-server applications
![Page 11: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/11.jpg)
Video Server
Data Store
DNS
How do you automatically learn such distributed dependencies?
![Page 12: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/12.jpg)
Strawman: Instrument all applications and libraries
Sherlock exploits timing info
Time
My Client talks to B
t
My Client talks to C
If talks to B, whenever talks to C Dependent Connections
Not Practical
![Page 13: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/13.jpg)
Sherlock exploits timing info
Time
t
BBB B BB
False Dependence
BC
If talks to B, whenever talks to C Dependent Connections
Strawman: Instrument all applications and libraries Not Practical
![Page 14: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/14.jpg)
Sherlock exploits timing info
Time
If talks to B, whenever talks to C Dependent Connections
t
BB C
Inter-access timeDependent iff t << Inter-access time
As long as this occurs with probability higher than chance
Strawman: Instrument all applications and libraries Not Practical
![Page 15: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/15.jpg)
Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing
Video
DNS
Store
Dependency Graph
![Page 16: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/16.jpg)
Bill’s Client StoreDNS
Sherlock’s Algorithm to Infer Dependencies Infer dependent connections from timing Infer topology from Traceroutes & configurations
Video Store
Video
Bill Watches Video
Bill DNS Bill Video
• Works with legacy applications
• Adapts to changing conditions
Dependency Graph
Video
DNS
Store
![Page 17: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/17.jpg)
But hard dependencies are not enough…
![Page 18: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/18.jpg)
Bill’s Client StoreDNS
Video Store
Video
Bill watches Video
Bill DNS Bill Video
But hard dependencies are not enough…
Need Probabilities
p1
p3
If Bill caches server’s IP DNS down but Bill gets video
Sherlock uses the frequency with which a dependence occurs in logs as its edge probability
p2p1=10% p2=100%
![Page 19: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/19.jpg)
How do we use the dependency graph to diagnose user problems?
![Page 20: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/20.jpg)
Bill’s Client StoreDNS
Video Store
Video
Bill Watches Video
Bill DNS Bill Video
Which components caused the problem?
Need to disambiguate!!
Diagnosing User Problems
![Page 21: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/21.jpg)
Bill’s Client StoreDNS
Video Store
Video
Bill Watches Video
Bill DNS Bill Video
Diagnosing User Problems
Which components caused the problem?
Bill Sees Sales
Sales
Bill Sales
Paul Watches Video2
Paul Video2
Video2 Store
Video2
Use correlation to disambiguate!!• Disambiguate by correlating
– Across logs from same client– Across clients
• Prefer simpler explanations
![Page 22: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/22.jpg)
Will Correlation Scale?
![Page 23: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/23.jpg)
Corporate Core
Will Correlation Scale?Microsoft Internal Network• O(100,000) client desktops• O(10,000) servers• O(10,000) apps/services• O(10,000) network devices
Building Network
Campus Core
Data Center
Dependency Graph is Huge
![Page 24: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/24.jpg)
Can we evaluate all combinations of component failures?
The number of fault combinations is exponential!
Impossible to compute!
Will Correlation Scale?
![Page 25: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/25.jpg)
Scalable Algorithm to Correlate
But how many is few?
Evaluate enough to cover 99.9% of faults
For MS network, at most 2 concurrent faults 99.9% accurate
Only a few faults happen concurrently
Exponential PolynomialExponential Polynomial
![Page 26: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/26.jpg)
But how many is few?
Evaluate enough to cover 99.9% of faults
For MS network, at most 2 concurrent faults 99.9% accurate
Scalable Algorithm to Correlate
Only a few faults happen concurrently
Only few nodes change state
Exponential PolynomialExponential Polynomial
![Page 27: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/27.jpg)
Re-evaluate only if an ancestor changes state
Reduces the cost of evaluating a case by 30x-70x
Reduces the cost of evaluating a case by 30x-70x
Exponential PolynomialExponential Polynomial
But how many is few?
Evaluate enough to cover 99.9% of faults
For MS network, at most 2 concurrent faults 99.9% accurate
Only a few faults happen concurrently
Only few nodes change state
Scalable Algorithm to Correlate
![Page 28: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/28.jpg)
Results
![Page 29: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/29.jpg)
Experimental Setup
• Evaluated on the Microsoft enterprise network
• Monitored 23 clients, 40 production servers for 3 weeks– Clients are at MSR Redmond– Extra host on server’s Ethernet logs packets
• Busy, operational network– Main Intranet Web site and software distribution file server– Load-balancing front-ends– Many paths to the data-center
![Page 30: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/30.jpg)
What Do Web Dependencies in the MS Enterprise Look Like?
![Page 31: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/31.jpg)
Auth. Server
What Do Web Dependencies in the MS Enterprise Look Like?
Client Accesses Portal
![Page 32: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/32.jpg)
Auth. Server
What Do Web Dependencies in the MS Enterprise Look Like?
Client Accesses Portal
![Page 33: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/33.jpg)
Auth. Server
Sherlock discovers complex dependencies of real apps.Sherlock discovers complex dependencies of real apps.
What Do Web Dependencies in the MS Enterprise Look Like?
Client Accesses Portal Client Accesses Sales
![Page 34: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/34.jpg)
What Do File-Server Dependencies Look Like?
Client Accesses Software Distribution Server
Auth.Server
WINS DNS
Backend Server 1
Backend Server 2
Backend Server 3
Backend Server 4
ProxyFile Server
100%10% 6% 5% 2%
8%
5%
1%.3%
Sherlock works for many client-server applicationsSherlock works for many client-server applications
![Page 35: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/35.jpg)
Dependency Graph: 2565 nodes; 358 components that can fail
Sherlock Identifies Causes of Poor Performance
Com
pone
nt In
dex
Time (days)87% of problems localized to 16 components87% of problems localized to 16 components
![Page 36: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/36.jpg)
Sherlock Identifies Causes of Poor PerformanceInference Graph: 2565 nodes; 358 components that can fail
Corroborated the three significant faults
Com
pone
nt In
dex
Time (days)
![Page 37: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/37.jpg)
• SNMP-reported utilization on a link flagged by Sherlock• Problems coincide with spikes
Sherlock Goes Beyond Traditional Tools
Sherlock identifies the troubled link but SNMP cannot! Sherlock identifies the troubled link but SNMP cannot!
![Page 38: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/38.jpg)
Comparing with Alternatives
• Dataset of known (fault, observations) pairs• Accuracy = 1 – (Prob. False Positives + Prob. False Negatives)
53%SCORE
(non-probabilistic)
![Page 39: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/39.jpg)
Comparing with Alternatives
• Dataset of known (fault, observations) pairs• Accuracy = 1 – (Prob. False Positives + Prob. False Negatives)
59%
53%SCORE
(non-probabilistic)
Shrink (probabilistic)
![Page 40: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/40.jpg)
Comparing with Alternatives
• Dataset of known (fault, observations) pairs• Accuracy = 1 – (Prob. False Positives + Prob. False Negatives)
Sherlock
Sherlock outperforms existing tools!Sherlock outperforms existing tools!
91%
53%
Shrink
SCORE (non-probabilistic)
Shrink (probabilistic)
59%
![Page 41: Sherlock – Diagnosing Problems in the Enterprise Srikanth Kandula Victor Bahl, Ranveer Chandra, Albert Greenberg, David Maltz, Ming Zhang](https://reader035.vdocument.in/reader035/viewer/2022062511/5514755d550346284e8b6266/html5/thumbnails/41.jpg)
Conclusions
• Sherlock passively infers network-wide dependencies from logs and traceroutes
• It diagnoses faults by correlating user observations
• It works at scale!
• Experiments in Microsoft’s Network show– Finds faults missed by existing tools like SNMP– Is more accurate than prior techniques
• Steps towards a Microsoft product