providing fast and meaningful insights into …€¦ · moc 2020 workshop john liagouris boston...

21
MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

Upload: others

Post on 26-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

MOC 2020 Workshop

John Liagouris Boston University

3 March 2020

PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

Page 2: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

`

2

alerts, telemetry data, topology updates …

Datacenter

queries, complex analytics, simulations, what-if analysis,…

policy enforcement, re-configuration, …

trace streamsDistributed Streaming

Dataflow System

THE BIG PICTURE: UNDERSTANDING THE DATACENTER

Page 3: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

DATACENTER STACK IS ALREADY HEAVILY INSTRUMENTED

3

Trace points End-to-end traces Component boundary

Middleware Distributed filesystem

App server

Page 4: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

4

Trace points End-to-end traces Component boundary

Middleware Distributed filesystem

App server

individual events, stack traces, and log records tell a small part of the story

Page 5: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

5

Trace points End-to-end traces Component boundary

Middleware Distributed filesystem

App server

tracking lineage and dependencies of individual events provides better insights

Page 6: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

6

USE CASES

ONLINE TRACE TREE RECONSTRUCTION

ONLINE CRITICAL PATH ANALYSIS

Page 7: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

TRACE TREE RECONSTRUCTION

7

Application A

Application B

A.1

A.2

A.3

B.1

B.2

B.3

B.4

Time: 2015/09/01 10:03:38.599859Session ID: XKSHSKCBA53U088FXGE7LD8Transaction ID: 26-3-11-5-1

Page 8: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

8

Foundation for diagnostic, profiling, and monitoring tasks essential to the operation of the datacenter

• User sessions

• Spans

• Call graphs

• Provenance graphs

• Critical path analysis

• Timing charts

• Wait-for graphsF. Zhou et al. OSDI’18

TRACE TREES

B. H. Sigelman et al. (Google Dapper) Y. Wu et al. NSDI’19

M. Chow et al. OSDI’14

Page 9: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

ONLINE TRACE TREE RECONSTRUCTION

9

Trace points End-to-end traces Component boundary

Log collection

Middleware Distributed filesystemApplication servers

UI: Query interface, Live visualization

Re-order buffer Tree re-construction Tree statisticsStreaming System

(Timely Dataflow)

Logs spread across 1263 streams and 42 servers

Mean input rate:1.3 million events/sec at 424.3 MB/sec

Can keep up with a single 8-core commodity machine

Eurosys’17Z. Chothia, J. Liagouris, D. Dimitrova, T. Roscoe. Online Reconstruction of Structural Information from Datacenter Logs. EuroSys 2017.

Page 10: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

COMPOSING TRACE ANALYTICS

10

Exploiting a general framework permits a simple, concise implementation in 1770 lines of code while seamlessly integrating with management applications.

Composition of analytic tasks: • Online trace tree clustering • Service dependency extraction • Inferring call-graph patterns

real time

https://github.com/strymon-system/reconstruction

Page 11: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

11

USE CASES

ONLINE TRACE TREE RECONSTRUCTION

ONLINE CRITICAL PATH ANALYSIS

Page 12: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

S1

S2

S3 task A

waiting

task Ctask B

12

task A

message

DISTRIBUTED EXECUTION

Services

Page 13: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

S1

S2

S3 task A

waiting

task Ctask B

13

task A

message

Task A is the most time-consuming

CONVENTIONAL PROFILING IN DISTRIBUTED EXECUTION

Page 14: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

S1

S2

S3

waiting

14

What if we optimize it?

task A task Ctask B

task A

message

CONVENTIONAL PROFILING IN DISTRIBUTED EXECUTION

Page 15: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

15

S1

S2

S3 task B

waiting

task C

No performance benefit for the parallel execution

CONVENTIONAL PROFILING IN DISTRIBUTED EXECUTION

Page 16: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

16

A PARALLEL EXECUTION

W1

W2

W3

waiting

deserialization

No performance benefit for the parallel execution!

serialization

conventional profiling can be misleading

Page 17: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

CRITICAL PATH

17

S1

S2

S3

a b

c d

Provides good candidates for optimization

Captures execution dependencies

Page 18: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

ONLINE ANALYSIS OF TRACE SNAPSHOTS

18

trace stream

trace snapshot construction

Online graph metrics

SnailTrail

performance summaries

NSDI’18

M. Hoffmann, A. Lattuada, J. Liagouris, V. Kalavri, D. Dimitrova, S. Wicki, Z. Chothia, T. Roscoe. SnailTrail: Generalising Critical Paths for the Online Analysis of Distributed Dataflows. NSDI 2018.

Page 19: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

ONLINE PERFORMANCE SUMMARIES WITH SNAILTRAIL

19

0 5 10 15Snapshot

0.0

0.2

0.4

0.6

0.8

CP

0 5 10 15Snapshot

%w

eigh

t

Processing Scheduling

Conventional ProfilingSnailTrail Profiling

Apache Spark: Yahoo! Streaming Benchmark, 16 workers, 8s snapshots

DRIVER

W1W2

W3

See also: Venkataraman, S., Panda, A., Ousterhout, K., Ghodsi, A., Franklin, M. J., Recht, B., and Stoica, I. Drizzle: Fast and Adaptable Stream Processing at Scale. OSDI 2017.

Page 20: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

COMPARISON WITH CONVENTIONAL PROFILING

20

0 5 10 15Snapshot

0.0

0.2

0.4

0.6

0.8

CP

0 5 10 15Snapshot

%w

eigh

t

Processing Scheduling

Conventional ProfilingSnailTrail Profiling

Apache Spark: Yahoo! Streaming Benchmark, 16 workers, 8s snapshots

Ignores critical path dependencies

https://github.com/strymon-system/snailtrail

Page 21: PROVIDING FAST AND MEANINGFUL INSIGHTS INTO …€¦ · MOC 2020 Workshop John Liagouris Boston University 3 March 2020 PROVIDING FAST AND MEANINGFUL INSIGHTS INTO ENTERPRISE DATACENTERS

`

21

alerts, telemetry data, topology updates …

Datacenter

queries, complex analytics, simulations, what-if analysis,…

policy enforcement, re-configuration, …

trace streamsDistributed Streaming

Dataflow System

THE BIG PICTURE: UNDERSTANDING THE DATACENTER

[email protected]