welcome to always bee tracing!...now we have n2 problems (one slow service bogs down everything,...

45
Welcome to Always Bee Tracing! If you haven’t already, please clone the repository of your choice: Golang (into your $GOPATH): git clone [email protected]:honeycombio/tracing-workshop-go.git Node: git clone [email protected]:honeycombio/tracing-workshop-node.git Please: also accept your invites to the "Always Bee Tracing" Honeycomb team and our Slack channel

Upload: others

Post on 17-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Welcome to Always Bee Tracing!If you haven’t already, please clone the repository of your choice:

▸ Golang (into your $GOPATH): git clone [email protected]:honeycombio/tracing-workshop-go.git

▸ Node: git clone [email protected]:honeycombio/tracing-workshop-node.git

Please: also accept your invites to the "Always Bee Tracing" Honeycomb team and our Slack channel

Page 2: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Always Bee TracingA Honeycomb Tracing workshop

Page 3: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

▸ We used to have "one thing" (monolithic application)

▸ Then we started to have "more things" (splitting monoliths into services)

▸ Now we have "yet more things", or even "Death Star" architectures (microservices, containers, serverless)

A bit of history

Page 4: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

▸ Now we have N2 problems (one slow service bogs down everything, etc.)

▸ 2010 - Google releases the Dapper paper describing how they improve on existing tracing systems

▸ Key innovations: use of sampling, common client libraries decoupling app code from tracing logic

A bit of history

Page 5: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

▸ 2012 - Zipkin was developed at Twitter for use with Thrift RPC

▸ 2015 - Uber releases Jaeger (also OpenTracing)

▸ Better sampling story, better client libraries, no Scribe/Kafka

▸ Various proprietary systems abound

▸ 2019 - Honeycomb is the best available due to best-in-class queries ;)

Why should GOOG have all the fun?

Page 6: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

▸ Standards for tracing exist: OpenTracing, OpenCensus, etc.

▸ Pros: Collaboration, preventing vendor lock-in

▸ Cons: Slower innovation, political battles/drama

▸ Honeycomb has integrations to bridge standard formats with the Honeycomb event model

A word on standards

Page 7: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

How Honeycomb fits in

Understand how your production systems are behaving, right now

QUERY BUILDER

INTERACTIVE VISUALS RAW DATA TRACES BUBBLEUP +

OUTLIERS

BEELINES (AUTOMATIC INSTRUMENTATION + TRACING APIS)

DATA STOREHigh Cardinality Data | High Dimensionality Data | Efficient storage

Page 8: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

▸ For software engineers who need to understand their code

▸ Better when visualized (preferably first in aggregate)

▸ Best when layered on top of existing data streams (rather than adding another data silo to your toolkit)

Tracing is…

Page 9: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing
Page 10: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Instrumentation (and tracing) should evolve alongside your code

Page 11: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Our path today▸ Establish a baseline: send simple events

▸ Customize: enrich with custom fields and extend into traces

▸ Explore: learn to query a collection of traces, to find the most interesting one

Page 12: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

WALLUSER

ANALYSIS

(LAMBDA FN: PERSIST)

TWITTER.COM

our (second) service

a third-party dependency

a black-box service

Page 13: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

EXERCISE: Run the wall service

go run ./wall.go

‣ Open up http://localhost:8080 in your browser and post some messages to your wall.

‣ Try writing messages like these:

‣ "hello #test #hashtag"

‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"

node ./wall.jsGo: Node:

Page 14: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

→ let’s see what we’ve got

Page 15: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Node

Go

Page 16: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

→ let’s see what we’ve got

Page 17: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Custom Instrumentation▸ Identify metadata that will help you isolate unexpected behavior in

custom logic:

▸ Bits about your infrastructure (e.g. which host)

▸ Bits about your deploy (e.g. which version/build, which feature flags)

▸ Bits about your business (e.g. which customer, which shopping cart)

▸ Bits about your execution (e.g. payload characteristics, sub-timers)

Page 18: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

EXERCISE: Find Checkpoint 1Go

Node

Page 19: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

→ let’s see what we’ve got

Page 20: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

trace.trace_id The ID of the trace this span belongs to

trace.span_id A unique ID for each span

trace.parent_id The ID of this span’s parent span, the call location the current span was called from

service_name The name of the service that generated this span

name The specific call location (like a function or method name)

duration_ms How much time the span took, in milliseconds

EVENT ID: A

EVENT ID: B, PARENTID: A

EVENT ID: C, PARENTID: B

TRACE 1

Page 21: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

EVENT ID: A

EVENT ID: B, PARENTID: A

EVENT ID: C, PARENTID: B

TRACE 1

Page 22: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

EXERCISE: Find Checkpoint 2

‣ Try writing messages like these:

‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"

‣ "have you tried @honeycombio for @mysql #observability?"

Page 23: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

→ let’s see what we’ve got

Page 24: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Our first, simple trace

Page 25: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

→ let’s see what we’ve got

Page 26: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Checkpoint 2 Takeaways▸ Events can be used to trace across functions within a service just as

easily as it can be "distributed"

▸ Store useful metadata on any event in a trace — and query against it!

▸ To aggregate per trace, filter to trace.parent_id does-not-exist (or break down by unique trace.trace_id values)

Page 27: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

EXERCISE: ID sources of latency▸ Who’s experienced the longest delay when talking to Twitter?

▸ Hint: app.username, MAX(duration_ms), and name = check_twitter

▸ Who’s responsible for the most amount of cumulative time talking to Twitter?

▸ Hint: Use SUM(duration_ms) instead

Page 28: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

WALLUSER

ANALYSIS

TWITTER.COM

our (second) service

a third-party dependency

Page 29: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

EXERCISE: Run the analysis service

‣ Open up http://localhost:8080 in your browser and post some messages to your wall.

‣ Try these:

‣ "everything is awesome!"

‣ "the sky is dark and gloomy and #winteriscoming"

go run ./analysis.go node ./analysis.jsGo: Node:

Page 30: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

→ let’s see what we’ve got

Page 31: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

EXERCISE: Find Checkpoint 3Go

Node

Page 32: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

→ let’s see what we’ve got

Page 33: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Checkpoint 3 Takeaways▸ Tracing across services just requires serialization of tracing context over

the wire

▸ Wrapping outbound HTTP requests is a simple form of tracing dependencies

Page 34: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Stretch Break

Page 35: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Mosey back to seats, please :)

Page 36: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

WALLUSER

ANALYSIS

(LAMBDA FN: PERSIST)

TWITTER.COM

our (second) service

a third-party dependency

a black-box service

Page 37: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

EXERCISE: Find Checkpoint 4

Go

Node

Page 38: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

→ let’s see what we’ve got

Page 39: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Checkpoint 4 Takeaways▸ Working with a black box? Instrument from the perspective of the code

you can control.

▸ Similar to identifying test cases in TDD: capture fields to let you refine your understanding of the system.

Page 40: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

EXERCISE: Who’s knocking over my black box?

▸ First: what does "knocking over" mean? We know that we talk to our black box via an HTTP call. What are our signals of health?

▸ What’s the "usual worst" latency for this call out to AWS?(Explore different calculations: P95 = 95th percentile, MAX, HEATMAP)

▸ Hint: P95(duration_ms), and request.host contains aws

Page 41: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Puzzle Time

Page 42: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Scenario #1

Symptoms: we pulled in that last POST in order to persist messages somewhere, but we’re hearing from customer support that behavior has felt buggy lately — like it works sometimes but not always. What’s going on?

Think about: ‣ Verify this claim. Are we sure persist has been flaky? What does failure look like? ‣ Look through all of the metadata we have to try and find some correlation across

those failing requests.

response.status_code request.content_length HEATMAPs are great :)

Page 43: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Scenario #2

Symptoms: everything feels slowed down, but more importantly the persistence behavior seems completely broken. What gives?

Think about: ‣ What might failure mean in this case? ‣ Once you’ve figured out what these failures look like, can we do anything to stop

the bleeding? What might we need to find out to answer that question?

response.status_code app.username

Page 44: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Scenario #3

Symptoms: persistence seems fine, but all requests seem to have slowed down to a snail’s pace. What could be impacting our overall latency so badly?

Prompts: ‣ Hint! Think about adding a num_hashtags or num_handles field to your events

if you’d like to capture more about the characteristics of your payload. ‣ It may be helpful to zoom in (aka add a filter) to just requests talking to

amazonaws.com

response.status_code request.host contains aws

Page 45: Welcome to Always Bee Tracing!...Now we have N2 problems (one slow service bogs down everything, etc.) 2010 - Google releases the Dapper paper describing how they improve on existing

Thank you & Office Hours