welcome to always bee tracing! · 2020-07-03 · now we have n2 problems (one slow service bogs...

45
Welcome to Always Bee Tracing! If you haven’t already, please clone the repository of your choice: Golang (into your $GOPATH): git clone [email protected]:honeycombio/tracing-workshop-go.git Node: git clone [email protected]:honeycombio/tracing-workshop-node.git Please: also accept your invites to the "Always Bee Tracing" Honeycomb team and our Slack channel

Upload: others

Post on 30-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Welcome to Always Bee Tracing!If you haven’t already, please clone the repository of your choice:

▸ Golang (into your $GOPATH): git clone [email protected]:honeycombio/tracing-workshop-go.git

▸ Node: git clone [email protected]:honeycombio/tracing-workshop-node.git

Please: also accept your invites to the "Always Bee Tracing" Honeycomb team and our Slack channel

Page 2: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Always Bee TracingA Honeycomb Tracing workshop

Page 3: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

▸ We used to have "one thing" (monolithic application)

▸ Then we started to have "more things" (splitting monoliths into services)

▸ Now we have "yet more things", or even "Death Star" architectures (microservices, containers, serverless)

A bit of history

Page 4: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

▸ Now we have N2 problems (one slow service bogs down everything, etc.)

▸ 2010 - Google releases the Dapper paper describing how they improve on existing tracing systems

▸ Key innovations: use of sampling, common client libraries decoupling app code from tracing logic

A bit of history

Page 5: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

▸ 2012 - Zipkin was developed at Twitter for use with Thrift RPC

▸ 2015 - Uber releases Jaeger (also OpenTracing)

▸ Better sampling story, better client libraries, no Scribe/Kafka

▸ Various proprietary systems abound

▸ 2019 - Honeycomb is the best available due to best-in-class queries ;)

Why should GOOG have all the fun?

Page 6: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

▸ Standards for tracing exist: OpenTracing, OpenCensus, etc.

▸ Pros: Collaboration, preventing vendor lock-in

▸ Cons: Slower innovation, political battles/drama

▸ Honeycomb has integrations to bridge standard formats with the Honeycomb event model

A word on standards

Page 7: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

How Honeycomb fits in

Understand how your production systems are behaving, right now

QUERY BUILDER

INTERACTIVE VISUALS RAW DATA TRACES BUBBLEUP +

OUTLIERS

BEELINES (AUTOMATIC INSTRUMENTATION + TRACING APIS)

DATA STOREHigh Cardinality Data | High Dimensionality Data | Efficient storage

Page 8: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

▸ For software engineers who need to understand their code

▸ Better when visualized (preferably first in aggregate)

▸ Best when layered on top of existing data streams (rather than adding another data silo to your toolkit)

Tracing is…

Page 9: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they
Page 10: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Instrumentation (and tracing) should evolve alongside your code

Page 11: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Our path today▸ Establish a baseline: send simple events

▸ Customize: enrich with custom fields and extend into traces

▸ Explore: learn to query a collection of traces, to find the most interesting one

Page 12: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

WALLUSER

ANALYSIS

(LAMBDA FN: PERSIST)

TWITTER.COM

our (second) service

a third-party dependency

a black-box service

Page 13: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

EXERCISE: Run the wall service

go run ./wall.go

‣ Open up http://localhost:8080 in your browser and post some messages to your wall.

‣ Try writing messages like these:

‣ "hello #test #hashtag"

‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"

node ./wall.jsGo: Node:

Page 14: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

→ let’s see what we’ve got

Page 15: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Node

Go

Page 16: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

→ let’s see what we’ve got

Page 17: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Custom Instrumentation▸ Identify metadata that will help you isolate unexpected behavior in

custom logic:

▸ Bits about your infrastructure (e.g. which host)

▸ Bits about your deploy (e.g. which version/build, which feature flags)

▸ Bits about your business (e.g. which customer, which shopping cart)

▸ Bits about your execution (e.g. payload characteristics, sub-timers)

Page 18: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

EXERCISE: Find Checkpoint 1Go

Node

Page 19: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

→ let’s see what we’ve got

Page 20: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

trace.trace_id The ID of the trace this span belongs to

trace.span_id A unique ID for each span

trace.parent_id The ID of this span’s parent span, the call location the current span was called from

service_name The name of the service that generated this span

name The specific call location (like a function or method name)

duration_ms How much time the span took, in milliseconds

EVENT ID: A

EVENT ID: B, PARENTID: A

EVENT ID: C, PARENTID: B

TRACE 1

Page 21: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

EVENT ID: A

EVENT ID: B, PARENTID: A

EVENT ID: C, PARENTID: B

TRACE 1

Page 22: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

EXERCISE: Find Checkpoint 2

‣ Try writing messages like these:

‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"

‣ "have you tried @honeycombio for @mysql #observability?"

Page 23: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

→ let’s see what we’ve got

Page 24: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Our first, simple trace

Page 25: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

→ let’s see what we’ve got

Page 26: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Checkpoint 2 Takeaways▸ Events can be used to trace across functions within a service just as

easily as it can be "distributed"

▸ Store useful metadata on any event in a trace — and query against it!

▸ To aggregate per trace, filter to trace.parent_id does-not-exist (or break down by unique trace.trace_id values)

Page 27: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

EXERCISE: ID sources of latency▸ Who’s experienced the longest delay when talking to Twitter?

▸ Hint: app.username, MAX(duration_ms), and name = check_twitter

▸ Who’s responsible for the most amount of cumulative time talking to Twitter?

▸ Hint: Use SUM(duration_ms) instead

Page 28: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

WALLUSER

ANALYSIS

TWITTER.COM

our (second) service

a third-party dependency

Page 29: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

EXERCISE: Run the analysis service

‣ Open up http://localhost:8080 in your browser and post some messages to your wall.

‣ Try these:

‣ "everything is awesome!"

‣ "the sky is dark and gloomy and #winteriscoming"

go run ./analysis.go node ./analysis.jsGo: Node:

Page 30: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

→ let’s see what we’ve got

Page 31: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

EXERCISE: Find Checkpoint 3Go

Node

Page 32: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

→ let’s see what we’ve got

Page 33: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Checkpoint 3 Takeaways▸ Tracing across services just requires serialization of tracing context over

the wire

▸ Wrapping outbound HTTP requests is a simple form of tracing dependencies

Page 34: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Stretch Break

Page 35: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Mosey back to seats, please :)

Page 36: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

WALLUSER

ANALYSIS

(LAMBDA FN: PERSIST)

TWITTER.COM

our (second) service

a third-party dependency

a black-box service

Page 37: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

EXERCISE: Find Checkpoint 4

Go

Node

Page 38: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

→ let’s see what we’ve got

Page 39: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Checkpoint 4 Takeaways▸ Working with a black box? Instrument from the perspective of the code

you can control.

▸ Similar to identifying test cases in TDD: capture fields to let you refine your understanding of the system.

Page 40: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

EXERCISE: Who’s knocking over my black box?

▸ First: what does "knocking over" mean? We know that we talk to our black box via an HTTP call. What are our signals of health?

▸ What’s the "usual worst" latency for this call out to AWS?(Explore different calculations: P95 = 95th percentile, MAX, HEATMAP)

▸ Hint: P95(duration_ms), and request.host contains aws

Page 41: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Puzzle Time

Page 42: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Scenario #1

Symptoms: we pulled in that last POST in order to persist messages somewhere, but we’re hearing from customer support that behavior has felt buggy lately — like it works sometimes but not always. What’s going on?

Think about: ‣ Verify this claim. Are we sure persist has been flaky? What does failure look like? ‣ Look through all of the metadata we have to try and find some correlation across

those failing requests.

response.status_code request.content_length HEATMAPs are great :)

Page 43: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Scenario #2

Symptoms: everything feels slowed down, but more importantly the persistence behavior seems completely broken. What gives?

Think about: ‣ What might failure mean in this case? ‣ Once you’ve figured out what these failures look like, can we do anything to stop

the bleeding? What might we need to find out to answer that question?

response.status_code app.username

Page 44: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Scenario #3

Symptoms: persistence seems fine, but all requests seem to have slowed down to a snail’s pace. What could be impacting our overall latency so badly?

Prompts: ‣ Hint! Think about adding a num_hashtags or num_handles field to your events

if you’d like to capture more about the characteristics of your payload. ‣ It may be helpful to zoom in (aka add a filter) to just requests talking to

amazonaws.com

response.status_code request.host contains aws

Page 45: Welcome to Always Bee Tracing! · 2020-07-03 · Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they

Thank you & Office Hours