observability for developers€¦ · dev ops not all ... is it behaving that way?" monitoring...

50
observability for developers How to Get from Here to There @cyen @honeycombio

Upload: others

Post on 09-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

observability for developers How to Get from Here to There

@cyen @honeycombio

Page 2: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

DEVChristine

Page 3: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

▸Design documents

▸Architecture review

▸Test-driven development

▸ Integration tests

▸Code review

▸Continuous integration

▸Continuous deployment

▸🎉🥂🍾🎊

▸ (Wait for exception tracker to complain)

DEVTheSoftware

Process

Page 4: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

"Works on my machine"

DEV"The only good diff is a red diff"

OPS

"What does it look like for the user?"

Page 5: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

OPSDEV▸ How to build those features / fix those bugs

▸ How features and fixes are scoped

▸ How to verify correctness or completion

▸ How to roll out that feature or fix

Page 6: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

OPSDEV▸ Not all interesting things are problems

▸ Not all interesting things are known ahead of time… or express themselves as anomalies

▸ Not all problems manifest as exceptions

… but lots of these interesting answers live in production

Page 7: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

observabilitya.k.a. understanding the behavior of a system based on knowledge of its

external outputs.

a.k.a. "what is my software doing, and why is it behaving that way?"

Page 8: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

monitoring observabilityThe system as black box magic. Thresholds, alerts, system signals like CPU and memory.Checking and rechecking for known bad behaviors.

The system as a living, adaptable thing. A culture of instrumentation and metadata rather than strictly-defined counters. Being able to tease out previously-unknown bad behaviors and outliers.

Page 9: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

The First Wave of DevOps:teaching ops folks to code

The Second Wave of DevOps:teaching devs to own code in production

OPSDEV +

Page 10: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

observability for developers

should be a core part of the processis a superpoweris an indispensable

part of decisionmaking^

Page 11: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

DEVDEBUG & decide what to buildBUILD the darn thing

VERIFY that it works (on my machine)

WATCH it for errors

VERIFY that it works (in production)

▸Design documents

▸Architecture review

▸Test-driven development

▸ Integration tests

▸Code review

▸Continuous integration

▸Continuous deployment

▸🎉🥂🍾🎊

▸ (Wait for exception tracker to complain)

TheSoftware

Process

Page 12: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

DEBUG PRODUCTION SYSTEMS

Page 13: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

DEBUG

▸ Locally: log lines, printfs, debuggers attached to our IDEs

▸ In production: we only have the data we captured when it happened

▸ Make it as easy as possible to add new data as needed

Page 14: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

DEBUG "My data isn’t showing up in Honeycomb!"

+ event_time_delta_sec

Page 15: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

DEBUG EXAMPLE

Page 16: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

BUILD THE DARN THING

Page 17: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

BUILD

▸ Know what the status quo is

▸ Events (instrumentation) can be like DEBUG statements in prod

▸ What and how we build should be informed by reality

Page 18: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

VERIFY ON MY MACHINE

Page 19: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like
Page 20: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

VERIFY IN PROD

Page 21: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

VERIFY

▸ "Test in Prod"… doesn’t mean only testing in prod

▸ "What happens when…"

▸ "… you know the answer" → continuous integration

▸ "… you want to test a hypothesis" → roll it out (carefully) and verify

▸ Complex systems have an infinitely long list of black swan failure scenarios

Page 22: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

FEATURE FLAGS 💞

VERIFY

Page 23: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

VERIFY (PROD)

// Dataset-keyed feature flags

FlagColdStorageDataset = BoolDatasetFlag{"cold-storage-dataset", false}

FlagColdStorageQuery = BoolDatasetFlag{"cold-storage-query", true} // note default true

FlagHiresInternalHeatmaps = BoolDatasetFlag{"hires-internal-heatmaps", false}

FlagTwoPassHeatmaps = BoolDatasetFlag{"two-pass-heatmaps", false}

FlagVarstringDictWrite = BoolDatasetFlag{"varstring-compression-write", false}

FlagVarstringDictRead = BoolDatasetFlag{"varstring-compression-read", false}

FEATURE FLAGS 💞

Page 24: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

WATCHIS IT STILL WORKING? LET’S

Page 25: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

WATCH

▸ Watch to make sure reality lines up with expectations

▸ … in the terms that we understand intimately

Page 26: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

WATCH

Page 27: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

▸ Form hypotheses about what code will do in prod

▸ Add/tweak instrumentation as necessary

▸ Query data to (in)validate hypotheses

▸ Take action (and repeat as necessary)

OPSDEV

Page 28: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

getting from here to there

▸ Best Practices for Instrumentation ▸ Taking the First Few Steps ▸ Migrating from Unstructured Text Logs ▸ Stop Searching, Start Analyzing ▸ Tracing as a New Frontier

Page 29: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

BEST PRACTICES FOR INSTRUMENTATION

▸ Capture contextual, structured data

{ Timestamp: "2018-03-20T00:47:25.339Z", content_length: 172, database_dur_ms: 15.79283, endpoint: "/posts/15", method: "PUT", request_dur_ms: 72.446625, render_dur_ms: 25.31729, service_name: "api", user_token: "2e6cfd4" }

Page 30: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

BEST PRACTICES FOR INSTRUMENTATION

▸ Capture contextual, structured data

▸ Common set of nouns and consistent naming

Page 31: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

BEST PRACTICES FOR INSTRUMENTATION

▸ Capture contextual, structured data

▸ Common set of nouns and consistent naming

▸ Instrument from the perspective of what you can control

APPUSER DATABASE

user_id

endpoint

params

hostname

active_queue

request_dur_ms

response_status_code

🚫query_sql caller_fn

database_dur_ms

num_rows_returned

Page 32: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

TAKING THE FIRST FEW STEPS

▸ Describe your basic "unit of work" and identify where it "enters" the system

Page 33: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

TAKING THE FIRST FEW STEPS

▸ Describe your basic "unit of work" and identify where it "enters" the system

▸ Identify metadata to help you isolate unexpected behavior in your business logic

Your Infra Your Deploy Your Business Your Execution

- hostname - machine type

- version / build - feature flags

- customer - shopping cart

- payload characteristics

- timers

Page 34: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

TAKING THE FIRST FEW STEPS

▸ Describe your basic "unit of work" and identify where it "enters" the system

▸ Identify metadata to help you isolate unexpected behavior in your business logic

▸ Experiment! Add temporary fields when needed to validate hypotheses

Page 35: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

TAKING THE FIRST FEW STEPS

▸ Describe your basic "unit of work" and identify where it "enters" the system

▸ Identify metadata to help you isolate unexpected behavior in your business logic

▸ Experiment! Add temporary fields when needed to validate hypotheses

▸ Prune stale fields (if necessary)

Page 36: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

MIGRATING FROM UNSTRUCTURED TEXT LOGS

2019-01-25T01:30:29.953Z Task timed out after 6.01 seconds

2019-01-25T01:30:23.743Z Enqueued task 2019-01-25T01:30:24.120Z Task processed, returning 42 entries2019-01-25T01:30:24.212Z Task complete (email sent to [email protected])2019-01-25T01:30:26.014Z Enqueued task 2019-01-25T01:30:26.214Z Enqueued task 2019-01-25T01:30:24.120Z Task errored: unknown constant ::Fixnum

2019-01-25T01:30:32.762Z Enqueued task 2019-01-25T01:30:32.791Z Enqueued task 2019-01-25T01:30:32.993Z Task processed, returning 7 entries2019-01-25T01:30:33.132Z Task complete (email not found, noop)2019-01-25T01:30:34.243Z Task processed, returning 0 entries2019-01-25T01:30:34.243Z Task complete, (email sent to [email protected])

Page 37: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

MIGRATING FROM UNSTRUCTURED TEXT LOGS

▸ Identify entities that are relevant to your business logic (and include them in your logs!)

2019-01-25T01:30:29.953Z Task timed out after 6.01 seconds task_id=72 type=process

Page 38: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

MIGRATING FROM UNSTRUCTURED TEXT LOGS

▸ Identify entities that are relevant to your business logic (and include them in your logs!)

▸ Start introducing structure into your logs

Timestamp=2019-01-25T01:30:29.953Z message=Task timed out after 6.01 seconds task_id=72 type=process

2019-01-25T01:30:29.953Z Task timed out after 6.01 seconds task_id=72 type=process

Page 39: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

MIGRATING FROM UNSTRUCTURED TEXT LOGS

▸ Identify entities that are relevant to your business logic (and include them in your logs!)

▸ Start introducing structure into your logs

▸ Build up context instead of outputting disjoint lines

Timestamp=2019-01-25T01:30:29.953Z message=Task timed out after 6.01 seconds task_id=72

2019-01-25T01:30:29.953Z Task timed out after 6.01 seconds task_id=72 type=process2019-01-25T01:30:23.743Z Enqueued task task_id=72 type=enqueue target=email

target=email queue_dur_ms=200 timeout_dur_ms=6010

Page 40: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

STOP SEARCHING, START ANALYZING

▸ Logs were conceived to store and find history, not for analytics

@example.com

@example.com

2019-01-25T01:30:29.953Z Task timed out after 6.01 seconds

2019-01-25T01:30:23.743Z Enqueued task 2019-01-25T01:30:24.120Z Task processed, returning 42 entries2019-01-25T01:30:24.212Z Task complete (email sent to [email protected])2019-01-25T01:30:26.014Z Enqueued task 2019-01-25T01:30:26.214Z Enqueued task 2019-01-25T01:30:24.120Z Task errored: unknown constant ::Fixnum

2019-01-25T01:30:32.762Z Enqueued task 2019-01-25T01:30:34.243Z Task processed, returning 0 entries2019-01-25T01:30:34.243Z Task complete, (email sent to [email protected])

Page 41: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

STOP SEARCHING, START ANALYZING

▸ Logs were conceived to store and find history, not for analytics

▸ Logs are no longer human-scale — they are machine-scale

“There is no PageRank for log data.” — @michaelwilde

Page 42: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

STOP SEARCHING, START ANALYZING

▸ Logs were conceived to store and find history, not for analytics

▸ Logs are no longer human-scale — they are machine-scale

▸ Visualizations are necessary to identify an outlier as a trend or an anomaly

Page 43: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

TRACING AS A NEW FRONTIER

▸ Tracing: not just for concurrent or distributed systems

Page 44: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

TRACING AS A NEW FRONTIER

▸ Tracing: not just for concurrent or distributed systems

2019-01-25T01:30:29.953Z Task timed out after 6.01 seconds task=72

2019-01-25T01:30:23.743Z Enqueued task task=72

2019-01-25T01:30:24.212Z Task processed, returning 42 entries task=742019-01-25T01:30:26.014Z Task complete (email sent to [email protected]) task=74

2019-01-25T01:30:24.120Z Enqueued task task=74

2019-01-25T01:30:26.214Z Enqueued task task=772019-01-25T01:30:24.120Z Task errored: unknown constant ::Fixnum task=77

2019-01-25T01:30:32.762Z Enqueued task task=782019-01-25T01:30:34.243Z Task processed, returning 0 entries task=782019-01-25T01:30:34.243Z Task complete, (email sent to [email protected]) task=78

Page 45: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

TRACING AS A NEW FRONTIER

▸ Tracing: not just for concurrent or distributed systems

▸ A series of related log lines can, in fact, share a lot in common with a trace

service_name

name

duration_ms

trace_id

span_id

parent_id

trace_id: 1

span_id: A

span_id: B, parent_id: A

span_id: C, parent_id: B

Page 46: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

TRACING AS A NEW FRONTIER

Page 47: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

TRACING AS A NEW FRONTIER

▸ Tracing: not just for concurrent or distributed systems

▸ A series of related log lines can, in fact, share a lot in common with a trace

▸ Tracing will be commonplace in 2019 [0]

0: https://monitoring.love/articles/2019-predictions/

Page 48: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

DEBUG VERIFY (WFM 🤘)

VERIFY (PROD)

BUILD

ASK NEW QUESTIONS

WATCH

SHIP BETTER SOFTWARE

Page 49: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

DEVS, OUR MISSION:▸ Stop writing software based on intuition, start

backing it up with data

▸ Teach observability tools to speak more than "Ops"

▸ ??? (← ask lots of questions and validate hypotheses)

▸ Profit!

Page 50: observability for developers€¦ · DEV OPS Not all ... is it behaving that way?" monitoring observability The system as black box magic. Thresholds, alerts, system signals like

ASK NEW QUESTIONS

SHIP BETTER SOFTWARE

thanks! @cyen@honeycombio