welcome to always bee tracing!...now we have n2 problems (one slow service bogs down everything,...

Post on 17-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Welcome to Always Bee Tracing!If you haven’t already, please clone the repository of your choice:

▸ Golang (into your $GOPATH): git clone git@github.com:honeycombio/tracing-workshop-go.git

▸ Node: git clone git@github.com:honeycombio/tracing-workshop-node.git

Please: also accept your invites to the "Always Bee Tracing" Honeycomb team and our Slack channel

Always Bee TracingA Honeycomb Tracing workshop

▸ We used to have "one thing" (monolithic application)

▸ Then we started to have "more things" (splitting monoliths into services)

▸ Now we have "yet more things", or even "Death Star" architectures (microservices, containers, serverless)

A bit of history

▸ Now we have N2 problems (one slow service bogs down everything, etc.)

▸ 2010 - Google releases the Dapper paper describing how they improve on existing tracing systems

▸ Key innovations: use of sampling, common client libraries decoupling app code from tracing logic

A bit of history

▸ 2012 - Zipkin was developed at Twitter for use with Thrift RPC

▸ 2015 - Uber releases Jaeger (also OpenTracing)

▸ Better sampling story, better client libraries, no Scribe/Kafka

▸ Various proprietary systems abound

▸ 2019 - Honeycomb is the best available due to best-in-class queries ;)

Why should GOOG have all the fun?

▸ Standards for tracing exist: OpenTracing, OpenCensus, etc.

▸ Pros: Collaboration, preventing vendor lock-in

▸ Cons: Slower innovation, political battles/drama

▸ Honeycomb has integrations to bridge standard formats with the Honeycomb event model

A word on standards

How Honeycomb fits in

Understand how your production systems are behaving, right now

QUERY BUILDER

INTERACTIVE VISUALS RAW DATA TRACES BUBBLEUP +

OUTLIERS

BEELINES (AUTOMATIC INSTRUMENTATION + TRACING APIS)

DATA STOREHigh Cardinality Data | High Dimensionality Data | Efficient storage

▸ For software engineers who need to understand their code

▸ Better when visualized (preferably first in aggregate)

▸ Best when layered on top of existing data streams (rather than adding another data silo to your toolkit)

Tracing is…

Instrumentation (and tracing) should evolve alongside your code

Our path today▸ Establish a baseline: send simple events

▸ Customize: enrich with custom fields and extend into traces

▸ Explore: learn to query a collection of traces, to find the most interesting one

WALLUSER

ANALYSIS

(LAMBDA FN: PERSIST)

TWITTER.COM

our (second) service

a third-party dependency

a black-box service

EXERCISE: Run the wall service

go run ./wall.go

‣ Open up http://localhost:8080 in your browser and post some messages to your wall.

‣ Try writing messages like these:

‣ "hello #test #hashtag"

‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"

node ./wall.jsGo: Node:

→ let’s see what we’ve got

Node

Go

→ let’s see what we’ve got

Custom Instrumentation▸ Identify metadata that will help you isolate unexpected behavior in

custom logic:

▸ Bits about your infrastructure (e.g. which host)

▸ Bits about your deploy (e.g. which version/build, which feature flags)

▸ Bits about your business (e.g. which customer, which shopping cart)

▸ Bits about your execution (e.g. payload characteristics, sub-timers)

EXERCISE: Find Checkpoint 1Go

Node

→ let’s see what we’ve got

trace.trace_id The ID of the trace this span belongs to

trace.span_id A unique ID for each span

trace.parent_id The ID of this span’s parent span, the call location the current span was called from

service_name The name of the service that generated this span

name The specific call location (like a function or method name)

duration_ms How much time the span took, in milliseconds

EVENT ID: A

EVENT ID: B, PARENTID: A

EVENT ID: C, PARENTID: B

TRACE 1

EVENT ID: A

EVENT ID: B, PARENTID: A

EVENT ID: C, PARENTID: B

TRACE 1

EXERCISE: Find Checkpoint 2

‣ Try writing messages like these:

‣ "seems @twitteradmin isn’t a valid username but @honeycombio is"

‣ "have you tried @honeycombio for @mysql #observability?"

→ let’s see what we’ve got

Our first, simple trace

→ let’s see what we’ve got

Checkpoint 2 Takeaways▸ Events can be used to trace across functions within a service just as

easily as it can be "distributed"

▸ Store useful metadata on any event in a trace — and query against it!

▸ To aggregate per trace, filter to trace.parent_id does-not-exist (or break down by unique trace.trace_id values)

EXERCISE: ID sources of latency▸ Who’s experienced the longest delay when talking to Twitter?

▸ Hint: app.username, MAX(duration_ms), and name = check_twitter

▸ Who’s responsible for the most amount of cumulative time talking to Twitter?

▸ Hint: Use SUM(duration_ms) instead

WALLUSER

ANALYSIS

TWITTER.COM

our (second) service

a third-party dependency

EXERCISE: Run the analysis service

‣ Open up http://localhost:8080 in your browser and post some messages to your wall.

‣ Try these:

‣ "everything is awesome!"

‣ "the sky is dark and gloomy and #winteriscoming"

go run ./analysis.go node ./analysis.jsGo: Node:

→ let’s see what we’ve got

EXERCISE: Find Checkpoint 3Go

Node

→ let’s see what we’ve got

Checkpoint 3 Takeaways▸ Tracing across services just requires serialization of tracing context over

the wire

▸ Wrapping outbound HTTP requests is a simple form of tracing dependencies

Stretch Break

Mosey back to seats, please :)

WALLUSER

ANALYSIS

(LAMBDA FN: PERSIST)

TWITTER.COM

our (second) service

a third-party dependency

a black-box service

EXERCISE: Find Checkpoint 4

Go

Node

→ let’s see what we’ve got

Checkpoint 4 Takeaways▸ Working with a black box? Instrument from the perspective of the code

you can control.

▸ Similar to identifying test cases in TDD: capture fields to let you refine your understanding of the system.

EXERCISE: Who’s knocking over my black box?

▸ First: what does "knocking over" mean? We know that we talk to our black box via an HTTP call. What are our signals of health?

▸ What’s the "usual worst" latency for this call out to AWS?(Explore different calculations: P95 = 95th percentile, MAX, HEATMAP)

▸ Hint: P95(duration_ms), and request.host contains aws

Puzzle Time

Scenario #1

Symptoms: we pulled in that last POST in order to persist messages somewhere, but we’re hearing from customer support that behavior has felt buggy lately — like it works sometimes but not always. What’s going on?

Think about: ‣ Verify this claim. Are we sure persist has been flaky? What does failure look like? ‣ Look through all of the metadata we have to try and find some correlation across

those failing requests.

response.status_code request.content_length HEATMAPs are great :)

Scenario #2

Symptoms: everything feels slowed down, but more importantly the persistence behavior seems completely broken. What gives?

Think about: ‣ What might failure mean in this case? ‣ Once you’ve figured out what these failures look like, can we do anything to stop

the bleeding? What might we need to find out to answer that question?

response.status_code app.username

Scenario #3

Symptoms: persistence seems fine, but all requests seem to have slowed down to a snail’s pace. What could be impacting our overall latency so badly?

Prompts: ‣ Hint! Think about adding a num_hashtags or num_handles field to your events

if you’d like to capture more about the characteristics of your payload. ‣ It may be helpful to zoom in (aka add a filter) to just requests talking to

amazonaws.com

response.status_code request.host contains aws

Thank you & Office Hours

top related