observability for startups - amazon s3fo… · observability for startups. luke demi @luke_demi_ no...

61
Observability for Startups

Upload: others

Post on 20-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Observability for Startups

Page 2: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Luke Demi

@luke_demi_

Page 3: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

No vendors were harmed in the making of this

presentation

Page 4: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 5: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 6: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Backend RPM

Page 7: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Backend RPM

Page 8: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Capacity Planning for Crypto Mania

Page 9: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Web Service Time Breakdown

Page 10: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Web Service Time Breakdown

Page 11: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Web Service Time Breakdown

Page 12: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 13: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

We finally began to ask the hard questions about what was happening in our environment.

Page 14: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

User TrafficBackend RPM

Page 15: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Lesson: Good instrumentation will surface your problems. Bad instrumentation will

obscure them.

Page 16: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

What is good instrumentation?

all we want is to know when and why our services break

Page 17: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Which really means:

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface

Page 18: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 19: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Logging Rich, open-ended context around discrete events

Pros

Cons

Example

● Audit trails● Debug/error messages

● High cardinality events

● Extreme granularity (down to the microsecond!)

● Very expensive (not optimized for storage)● Slow aggregations

Page 20: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Tracing End-to-end visibility into real requests

Pros

Cons

Example

● Entire lifecycle of a request or transaction

● Granular timings from the perspective of the application

● Magical

● Lacks full context - aggregations are approximations● Performance penalties

Page 21: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Metrics Structured, lightweight and easy to aggregate

Pros

Cons

Example

● Depth of a queue (gauge)● Count of incoming requests (counter)

● Fast aggregations

● Storage/retention optimized (burst-proof!)

● Lacks granularity and cardinality for deeper analysis

Page 22: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 23: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 24: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Logs provide many, many, many of the features we wanted:

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface

Page 25: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 26: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

enrichment

Page 27: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

We piled so much into logs...

Page 28: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 29: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

40 nodes * 64GB RAM = 2.5TB= 12-24 hours of logs

… so aggregating over 7 days is a little bit painful

Page 30: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 31: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

What was wrong with tripling down on Elasticsearch/Kibana?

1. Bad interface for building alerts2. Challenges around long term retention and

management 3. Iffy aggregation support

Page 32: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 33: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Lesson: Hammers don't work for everything. Sometimes you need a screwdriver?

Page 34: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface● + Alerts● + Stability

Page 35: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Maybe Kibana was just a bad metrics

provider?

Page 36: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 37: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 38: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

What does Datadog Provide?

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface● + Alerts● + Stability

Page 39: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Looks like we're compensating for something?

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface● + Alerts● + Stability

Page 40: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

app1

app2

EC2 Instance

DB

udp

/proc/loadavg/proc/meminfo

Coinbase VPC

/tmp/safe-docker.sock

Page 41: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 42: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Cardinality?!

Page 43: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 44: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Δt

Page 45: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Δt

10s flush time for aggregations

Page 46: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Δt

10s flush time for aggregations

aggregations stored pertag

Page 47: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 48: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

definitely a darkersquare

definitely40 pxx40 px

Page 49: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

definitely a darkersquare

definitely40 pxx40 px

STILL

Page 50: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 51: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 52: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

● A lot can happen in 10 seconds...● Pre aggregations feel kind of dumb

○ We're limiting the questions we can ask in the future

● No built in context/discoverability● What about high cardinality?

Page 53: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Lesson: The grass isn't always greener on the other side of the observability fence.

Page 54: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 55: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 56: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM
Page 57: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

When Nirvana?

● Low latency (under 1 minute)● Long retention● Low granularity● High cardinality● In-house ("secure"?)● Fast aggregations● Intuitive interface● + Alerts● + Stability

Page 58: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

sampling

Alternative 1

Page 59: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Is a perfect solution too much to ask?

Page 60: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Double down into events

Alternative 2

Page 61: Observability for Startups - Amazon S3fo… · Observability for Startups. Luke Demi @luke_demi_ No vendors were harmed in the making of this presentation. Backend RPM. Backend RPM

Questions?

@luke_demi_