counting with prometheus (cloudnativecon+kubecon europe 2017)
TRANSCRIPT
Brian BrazilFounder
Counting with Prometheus
Who am I?
Engineer passionate about running software reliably in production.
Core developer of Prometheus
Studied Computer Science in Trinity College Dublin.
Google SRE for 7 years, working on high-scale reliable systems.
Contributor to many open source projects, including Prometheus, Ansible, Python, Aurora and Zookeeper.
Founder of Robust Perception, provider of commercial support and consulting for Prometheus.
What is Prometheus?
Prometheus is a metrics-based time series database, designed for whitebox monitoring.
It supports labels (dimensions/tags).
Alerting and graphing are unified, using the same language.
Architecture
Counting
Counting is easy right?
Just 1, 2, 3?
Counting… to the Extreme!!!
What if we're counting monitoring-related events though for a metrics system.
We're usually sampling data over the network => potential data loss.
What happens with the data we transfer at the other end?
Counters and Gauges
There are two base metric types.
Gauges are a snapshot of current state, such as memory usage or temperature.
They can go up and down.
Counters are the other base type.
To explain them, we need to go on a "small" detour.
Events are key
Events are the thing we want to track with Counters.
An event might be a HTTP request hitting our server, a function call being made or an error being returned.
An event logging system would record each event individually.
A metrics-based system like Prometheus or Graphite has events aggregated across time before they get to the TSDB. Therein lies the rub.
Approach #1: Resetting Count
There's a few common approaches to providing this aggregate over time.
The first and simplest is the resetting count.
You start at 0, and every time there's an event you increment the count.
On a regular interval you transfer the current count, and reset to 0.
Counter Reset, Normal Operation
Approach #1: Resetting Count Problems
If a transfer fails, you've lost that data.
Can't presume this effect will be random and unbiased, e.g. if a big spike in traffic also saturates network links used for monitoring.
Doesn't work if you want to transfer data to more than one place for redundancy. Each would get 1/n of the data.
Counter Reset, Failed Transfer
Approach #2: Exponential moving average
A number of instrumentation libraries offer this, such as DropWizard's Meter.
Basically the same way Unix load averages work:
result(t) = (1 - w) * result(t-1) + (w) * events_this_period
Where t is the tick, and w is a weighting factor.
The weighting factor determines how quickly new data is incorporated.
Dropwizard evaluates the above every 5 seconds.
Exponential Moving Average, Normal Operation
Approach #2: Exponential moving average Problems
Events aren't uniformly considered. If you're transferring data every 10s, then the most recent 5s matter more.
Thus reconstructing what happened is hard for debugging, unless you get every 5s update.
You're bound to the 1m, 5m and 15m weightings that the implementation has chosen.
Also means that it's not particularly resilient to missing a scrape.
Aside: Graphite's summarize() function
Summarize() returns events during e.g. the last hour.
Some have a belief that summarize is accurate. It isn't.
Problem is that with say 15m granularity, data point at 13:02 will include data from 13 minutes before 13:00-14:00 and similarly at the end.
If you want this accurately, need to use logs.
No metrics based system can report this accurately in the general case.
Graphite's summarize() and non-aligned data
Aliasing
Depending on the exact time offsets between the process start, metric initialisation, data transfers and when the user makes a query you can get different results.
A second in either direction can make a big difference to the patterns you see in your graphs.
This is an expected signal processing effect, be aware of it.
Expressive Power
Both previous solutions are reasonable if the monitoring system is a fairly dumb data store, often with little math beyond addition (if even that).
Losing data or having no redundancy are better than having nothing at all.
What if you have the option for your monitoring system to do math?
What if you control both ends?
Approach #3: Prometheus Counters
Like Approach #1, we have a counter that starts at 0 increments at each event.
This is transferred regularly to Prometheus.
It's not reset on every transfer though, keeps on increasing.
Rate() function in Prometheus takes this in, and calculates how quickly it increased over the given time period.
Prometheus rate(), Normal Operation
Approach #3: Prometheus Counters
Resilient to failed transfers (lose resolution, not data)
Can handle multiple places to transfer to
Can choose the time period you want to calculate over in monitoring system, thus choose your level of smoothing e.g. rate(my_metric[5m]) or rate(my_metric[10m])
Uniform consideration of data
Easy to implement on client side
Prometheus rate(), Failed Transfer
Prometheus Counters: Rate()
There's many details to getting the rate() function right.
Processes might restart
Scrapes might be missed
Time periods rarely align perfectly
Time series stop and start
Prometheus Counters: Resets
Counters can be reset to 0, such as if a process restarts or a network switch gets rebooted.
If we see the value decreasing, then that's a counter reset so presume it went to 0.
So seeing 0 -> 10 -> 5 is 10 + 5 = 15 of an increase.
Graphite/InfluxDB's nonNegativeDerivative() function would ignore the drop, report based on just the increase 10.
Prometheus Counters: Missed scrapes
If we miss a scrape in the middle of a time period, no problem as we still have the data points around it.
Little more complicated around the edges of the time period we're looking at though.
Prometheus Counters: Alignment
It is rare that the user will request data exactly on the same cycle as the scrapes.
Especially when you're monitoring multiple servers with staggered scrapes.
Or given that timestamps are millisecond resolution, and the endpoint graphs use accepts only second-granularity input.
Thus we need to extrapolate out to the end of the rate()'s range.
increase() can return non-integers on integral data
This is why one of the more surprising behaviours of increase() happens.
So if we have data which is:t= 1 10t= 6 12t=11 13Request a increase() from t=0 for 15s, you'll get an increase of 3 over 10s.Extrapolating over the 15s, that's a result of 4.5.
This is the correct result on average. If you want exact answers, use logs.
Non-integral increase due to extrapolation
Prometheus Counters: Time series lifecycle
Time series are created and destroyed. If we always extrapolated out to the edge of the rate() range we'd get much bigger results than we should.
So we detect that. We calculate the average interval between data points.
If the first/last data point start/end of the range is within 110% of the average interval, then we extrapolate to the start/end. Allows for failed scrapes.
Otherwise we extrapolate 50% of the average interval.
We also know counters can't go negative, so don't extrapolate before the point they'd be 0 at.
rate() extrapolation
Problem: Timeseries not always existing
The previous logic handles all the edge cases around counters resets, process restarts and rolling restarts, on average.
What if a counter appears with the value 1 though long after the process has started and doesn't increase again?
No increase in the history, so rate() doesn't see it. Can't tell when the increase happened. Prometheus is designed to be available, not catch 100% of events.
Solution: Logs, or make sure all your counters are being initialised on process start so it goes 0->1. Will only miss it prior to the first scrape then.
Problem: Lag
All these solutions produce results that lag the actual data - already seen with summarize().
A 5m Prometheus rate() at a given time, is really the average from 5 minutes ago to now. Similarly with resetting counters.
Exponential moving averages more complicated to explain, same issue though.
Always compare like with like, stick to one range for your rate()s.
Client Implications
The Prometheus Counter is very easy to implement, only need to increment a single number.
Concurrency handling varies by language. Mutexes are the slowest, then atomics, then per-processor values - which the Prometheus Java client approximates.
Dropwizard Meter has to increment 4 numbers and do the decay logic, so about 6x slower per benchmarks.
Dropwizard Counter (which is really a Gauge, as it can go down) is as fast as Prometheus Counter.
Other performance considerations
Values for each label value (called a "Child") are in map in each metric.
That map lookup can be relatively expensive (~100ns), keep a pointer to the Child if that could matter. Need to know the labels you'll be using in advance though.
Similarly, don't create a map from metric names to metric objects. Store metric objects as pointers in simple variables after you create them.
Best Practices
Use seconds for timing. Prometheus values are all floats, so developers don't need to choose and deal with a mix of ns, us, ms, s and minutes.
increase() function handy for display, but similarly for consistency only use it for display. Use rate() for recording rules.
increase() is only syntactic sugar for rate().
irate(): The other rate function
Prometheus also has irate().
This looks at the last two points in the given range, and returns how fast they're increasing per second.
Great for seeing very up to date data.
Can be hard to read if data is very spiky.
Need to be careful you're not missing data.
irate(), Normal Operation
irate(), Failed Transfer
Steps and rate durations
The query_range HTTP endpoint has a step parameter, this is the resolution of the data returned.
If you have a 10m step and 5m rate, you're going to be ignoring half your data.
To avoid this, make sure your rate range is at least your step for graphs.
For irate(), your step should be no more than your sample resolution.
Compound Types: Summary
How to track average latency? With two counters!
One for total requests (_count), one for total latency of those requests (_sum).
Take the rates, divide and you have average latency.
This is how the compound Summary metric works. It's a more convenient API over doing the above by hand.
Some clients also offer quantiles. Beware, slow and unaggregatable.
Compound Types: Histogram
Histogram also includes the _count and _sum.
The main purpose is calculating quantiles in Prometheus.
The histogram has buckets, which are counters. You can take the rate() of these, aggregate them and then use histogram_quantile() to calculate arbitrary quantiles.
Be wary of cardinality explosion, use sparingly.
Resources
Official Project Website: prometheus.io
User Mailing List: [email protected]
Developer Mailing List: [email protected]
Source code: https://github.com/prometheus/prometheus/blob/master/promql/functions.go
Robust Perception Blog: www.robustperception.io/blog