providing fast and meaningful insights into …€¦ · moc 2020 workshop john liagouris boston...
TRANSCRIPT
MOC 2020 Workshop
John Liagouris Boston University
3 March 2020
PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS
`
2
alerts, telemetry data, topology updates …
Datacenter
queries, complex analytics, simulations, what-if analysis,…
policy enforcement, re-configuration, …
trace streamsDistributed Streaming
Dataflow System
THE BIG PICTURE: UNDERSTANDING THE DATACENTER
DATACENTER STACK IS ALREADY HEAVILY INSTRUMENTED
3
Trace points End-to-end traces Component boundary
Middleware Distributed filesystem
App server
4
Trace points End-to-end traces Component boundary
Middleware Distributed filesystem
App server
individual events, stack traces, and log records tell a small part of the story
5
Trace points End-to-end traces Component boundary
Middleware Distributed filesystem
App server
tracking lineage and dependencies of individual events provides better insights
6
USE CASES
ONLINE TRACE TREE RECONSTRUCTION
ONLINE CRITICAL PATH ANALYSIS
TRACE TREE RECONSTRUCTION
7
Application A
Application B
A.1
A.2
A.3
B.1
B.2
B.3
B.4
Time: 2015/09/01 10:03:38.599859Session ID: XKSHSKCBA53U088FXGE7LD8Transaction ID: 26-3-11-5-1
8
Foundation for diagnostic, profiling, and monitoring tasks essential to the operation of the datacenter
• User sessions
• Spans
• Call graphs
• Provenance graphs
• Critical path analysis
• Timing charts
• Wait-for graphsF. Zhou et al. OSDI’18
TRACE TREES
B. H. Sigelman et al. (Google Dapper) Y. Wu et al. NSDI’19
M. Chow et al. OSDI’14
ONLINE TRACE TREE RECONSTRUCTION
9
Trace points End-to-end traces Component boundary
Log collection
Middleware Distributed filesystemApplication servers
UI: Query interface, Live visualization
Re-order buffer Tree re-construction Tree statisticsStreaming System
(Timely Dataflow)
Logs spread across 1263 streams and 42 servers
Mean input rate:1.3 million events/sec at 424.3 MB/sec
Can keep up with a single 8-core commodity machine
Eurosys’17Z. Chothia, J. Liagouris, D. Dimitrova, T. Roscoe. Online Reconstruction of Structural Information from Datacenter Logs. EuroSys 2017.
COMPOSING TRACE ANALYTICS
10
Exploiting a general framework permits a simple, concise implementation in 1770 lines of code while seamlessly integrating with management applications.
Composition of analytic tasks: • Online trace tree clustering • Service dependency extraction • Inferring call-graph patterns
real time
https://github.com/strymon-system/reconstruction
11
USE CASES
ONLINE TRACE TREE RECONSTRUCTION
ONLINE CRITICAL PATH ANALYSIS
S1
S2
S3 task A
waiting
task Ctask B
12
task A
message
DISTRIBUTED EXECUTION
Services
S1
S2
S3 task A
waiting
task Ctask B
13
task A
message
Task A is the most time-consuming
CONVENTIONAL PROFILING IN DISTRIBUTED EXECUTION
S1
S2
S3
waiting
14
What if we optimize it?
task A task Ctask B
task A
message
CONVENTIONAL PROFILING IN DISTRIBUTED EXECUTION
15
S1
S2
S3 task B
waiting
task C
No performance benefit for the parallel execution
CONVENTIONAL PROFILING IN DISTRIBUTED EXECUTION
16
A PARALLEL EXECUTION
W1
W2
W3
waiting
deserialization
No performance benefit for the parallel execution!
serialization
conventional profiling can be misleading
CRITICAL PATH
17
S1
S2
S3
a b
c d
Provides good candidates for optimization
Captures execution dependencies
ONLINE ANALYSIS OF TRACE SNAPSHOTS
18
trace stream
trace snapshot construction
Online graph metrics
SnailTrail
performance summaries
NSDI’18
M. Hoffmann, A. Lattuada, J. Liagouris, V. Kalavri, D. Dimitrova, S. Wicki, Z. Chothia, T. Roscoe. SnailTrail: Generalising Critical Paths for the Online Analysis of Distributed Dataflows. NSDI 2018.
ONLINE PERFORMANCE SUMMARIES WITH SNAILTRAIL
19
0 5 10 15Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15Snapshot
%w
eigh
t
Processing Scheduling
Conventional ProfilingSnailTrail Profiling
Apache Spark: Yahoo! Streaming Benchmark, 16 workers, 8s snapshots
DRIVER
W1W2
W3
See also: Venkataraman, S., Panda, A., Ousterhout, K., Ghodsi, A., Franklin, M. J., Recht, B., and Stoica, I. Drizzle: Fast and Adaptable Stream Processing at Scale. OSDI 2017.
COMPARISON WITH CONVENTIONAL PROFILING
20
0 5 10 15Snapshot
0.0
0.2
0.4
0.6
0.8
CP
0 5 10 15Snapshot
%w
eigh
t
Processing Scheduling
Conventional ProfilingSnailTrail Profiling
Apache Spark: Yahoo! Streaming Benchmark, 16 workers, 8s snapshots
Ignores critical path dependencies
https://github.com/strymon-system/snailtrail
`
21
alerts, telemetry data, topology updates …
Datacenter
queries, complex analytics, simulations, what-if analysis,…
policy enforcement, re-configuration, …
trace streamsDistributed Streaming
Dataflow System
THE BIG PICTURE: UNDERSTANDING THE DATACENTER