metrics - coda hale€¦ · the enterprise social network. saturday, april 9, 2011
TRANSCRIPT
METRICS
METRICS EVERYWHERESaturday, April 9, 2011
METRICS
METRICS EVERYWHERESaturday, April 9, 2011
Make better decisions by using numbers.
Saturday, April 9, 2011
Coda Hale@coda
github.com/codahale
Saturday, April 9, 2011
www.yammer.comThe enterprise social network.
Saturday, April 9, 2011
I write code.
Saturday, April 9, 2011
But that’s notactually my job.
Saturday, April 9, 2011
code
Saturday, April 9, 2011
codebusiness
value
Saturday, April 9, 2011
What the hell is business value?
Saturday, April 9, 2011
A new feature.
Saturday, April 9, 2011
An improvedexisting feature.
Saturday, April 9, 2011
Fewer bugs.
Saturday, April 9, 2011
Not pissing our users off with a slow site.
Saturday, April 9, 2011
Not pissing our users off with a slow site.
ugly
Saturday, April 9, 2011
Not pissing our users off with a slow site.
uglypretty
Saturday, April 9, 2011
Making futurechanges easier.
Saturday, April 9, 2011
Adding a unit test before fixing that bug.
Saturday, April 9, 2011
Business value is anything which makes people more likely to
give us money.
Saturday, April 9, 2011
We want to generate more business value.
Saturday, April 9, 2011
We need to makebetter decisionsabout our code.
Saturday, April 9, 2011
Our code generates business valuewhen it runs.
Saturday, April 9, 2011
Our code generates business valuewhen it runs,
not when we write it.
Saturday, April 9, 2011
We need to knowwhat our code does
when it runs.
Saturday, April 9, 2011
We can’t do this unless we measure it.
Saturday, April 9, 2011
Why measure it?
Saturday, April 9, 2011
territorymap ≠
Saturday, April 9, 2011
cityofSanFrancisco
mapof
SanFrancisco
≠
Saturday, April 9, 2011
thewayitis
thewaywetalk
≠
Saturday, April 9, 2011
thethinginitself
thething
wethink of
≠
Saturday, April 9, 2011
realityperception ≠
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
We have amental model
of what our code does.
Saturday, April 9, 2011
It’s a mental model.It’s not the code.
Saturday, April 9, 2011
It is often wrong.
Saturday, April 9, 2011
Confusion.
Saturday, April 9, 2011
“This code can’t possibly work.”
Saturday, April 9, 2011
(It works.)
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
“This code can’t possibly fail.”
Saturday, April 9, 2011
(It fails.)
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
Which is faster?
Saturday, April 9, 2011
Which is faster?items.sort_by { |i| i.name }
Saturday, April 9, 2011
Which is faster?items.sort_by { |i| i.name }
items.sort { |a, b| a.name <=> b.name }
Saturday, April 9, 2011
We don’t know.
Saturday, April 9, 2011
We don’t know.
def sort_by(&blk) sleep(100) # FIXME: I AM POISON super(&blk)end
Saturday, April 9, 2011
We don’t know.
def sort_by(&blk) sleep(100) # FIXME: I AM POISON super(&blk)end
def sort(&blk) # TODO: make not explode raise Exception.new("Haw haw!")end
Saturday, April 9, 2011
We can’t know untilwe measure it.
Saturday, April 9, 2011
This affects how we make decisions.
Saturday, April 9, 2011
“Our application is slow. This page takes 500ms.
Fix it.”
Saturday, April 9, 2011
Find the bottleneck!
Saturday, April 9, 2011
Find the bottleneck!
SQL Query
Saturday, April 9, 2011
Find the bottleneck!
SQL Query
Template Rendering
Saturday, April 9, 2011
Find the bottleneck!
SQL Query
Template Rendering
Session Storage
Saturday, April 9, 2011
We don’t know.
Saturday, April 9, 2011
Find The Bottleneck 2.0!
SQL Query
Template Rendering
Session Storage
Saturday, April 9, 2011
Find The Bottleneck 2.0!
SQL Query
Template Rendering
Session Storage
53ms
Saturday, April 9, 2011
Find The Bottleneck 2.0!
SQL Query
Template Rendering
Session Storage
53ms
1ms
Saturday, April 9, 2011
Find The Bottleneck 2.0!
SQL Query
Template Rendering
Session Storage
53ms
1ms
315ms
Saturday, April 9, 2011
Find The Bottleneck 2.0!
SQL Query
Template Rendering
Session Storage
53ms
1ms
315ms
Saturday, April 9, 2011
Confusion.
Saturday, April 9, 2011
Saturday, April 9, 2011
We made a better decision.
Saturday, April 9, 2011
We improve our mental model by measuringwhat our code does.
Saturday, April 9, 2011
territorymap ≠
Saturday, April 9, 2011
territorymap→
Saturday, April 9, 2011
We use ourmental model
to decide what to do.
Saturday, April 9, 2011
A bettermental model
makes us better at deciding what to do.
Saturday, April 9, 2011
A bettermental model
makes us better at generating
business value.
Saturday, April 9, 2011
Measuring makes your decisions better.
Saturday, April 9, 2011
But only if we’re measuring
the right thing.
Saturday, April 9, 2011
We need to measure our code where it
matters.
Saturday, April 9, 2011
In the wild.
Saturday, April 9, 2011
Generatingbusiness value.
Saturday, April 9, 2011
Saturday, April 9, 2011
PRODUCTION
Saturday, April 9, 2011
Continuously measuring code in production.
Saturday, April 9, 2011
Metrics
Saturday, April 9, 2011
MetricsJava/Scala
Saturday, April 9, 2011
github.com/codahale/metrics
MetricsJava/Scala
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
Each metric is associated with a class
and has a name.
Saturday, April 9, 2011
An autocomplete service for city names.
Saturday, April 9, 2011
An autocomplete service for city names.
> GET /complete?q=San%20Fra
Saturday, April 9, 2011
An autocomplete service for city names.
> GET /complete?q=San%20Fra
< HTTP/1.1 200 RAD<< ["San Francisco"]
Saturday, April 9, 2011
What does this code do that affects its business value?
Saturday, April 9, 2011
And how can we measure that?
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
GaugeThe instantaneous value of something.
Saturday, April 9, 2011
# of cities
Saturday, April 9, 2011
metrics.gauge("cities") { cities.size }
Saturday, April 9, 2011
metrics.gauge("cities") { cities.size }
Saturday, April 9, 2011
metrics.gauge("cities") { cities.size }
Saturday, April 9, 2011
“The service has 589 cities registered.”
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
GaugesCounters
MetersHistograms
Timers
Saturday, April 9, 2011
CounterAn incrementing and decrementing value.
Saturday, April 9, 2011
# of open connections
Saturday, April 9, 2011
val counter = metrics.counter("connections")
counter.inc()
counter.dec()
Saturday, April 9, 2011
val counter = metrics.counter("connections")
counter.inc()
counter.dec()
Saturday, April 9, 2011
val counter = metrics.counter("connections")
counter.inc()
counter.dec()
Saturday, April 9, 2011
val counter = metrics.counter("connections")
counter.inc()
counter.dec()
Saturday, April 9, 2011
“There are 594 active sessions on that server.”
Saturday, April 9, 2011
GaugesCounters
MetersHistograms
Timers
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
MeterThe average rate of events
over a period of time.
Saturday, April 9, 2011
# of requests/sec
Saturday, April 9, 2011
val meter = metrics.meter("requests", SECONDS)
meter.mark()
Saturday, April 9, 2011
val meter = metrics.meter("requests", SECONDS)
meter.mark()
Saturday, April 9, 2011
val meter = metrics.meter("requests", SECONDS)
meter.mark()
Saturday, April 9, 2011
val meter = metrics.meter("requests", SECONDS)
meter.mark()
Saturday, April 9, 2011
mean rate = # of events
elapsed time
Saturday, April 9, 2011
time
# ofrequests
Saturday, April 9, 2011
# ofrequests
time
Saturday, April 9, 2011
# ofrequests
time
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
Recency.
Saturday, April 9, 2011
mean rate = # of events
elapsed time
Saturday, April 9, 2011
mean rate = # of events
elapsed time
Saturday, April 9, 2011
COGNITIVE HAZARD
Saturday, April 9, 2011
Exponentially weighted moving average.
Saturday, April 9, 2011
-(1-α)kmt-1 + (1-(1-α)k)Yt
k
Saturday, April 9, 2011
-(1-α)kmt-1 + (1-(1-α)k)Yt
k
Saturday, April 9, 2011
# ofrequests
time
Saturday, April 9, 2011
# ofrequests
time
Saturday, April 9, 2011
# ofrequests
time
Saturday, April 9, 2011
# ofrequests
time
Saturday, April 9, 2011
1-minute rate
Saturday, April 9, 2011
1-minute rate5-minute rate
Saturday, April 9, 2011
1-minute rate5-minute rate15-minute rate
Saturday, April 9, 2011
“We went from 3,000 requests/sec to<500 a second.”
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
HistogramThe statistical distribution of values in a stream of data.
Saturday, April 9, 2011
# of cities returned
Saturday, April 9, 2011
val histogram = metrics.histogram("response-sizes")
histogram.update(response.cities.size)
Saturday, April 9, 2011
val histogram = metrics.histogram("response-sizes")
histogram.update(response.cities.size)
Saturday, April 9, 2011
val histogram = metrics.histogram("response-sizes")
histogram.update(response.cities.size)
Saturday, April 9, 2011
minimum
Saturday, April 9, 2011
minimummaximum
Saturday, April 9, 2011
minimummaximum
mean
Saturday, April 9, 2011
minimummaximum
meanstandard deviation
Saturday, April 9, 2011
Quantiles
Saturday, April 9, 2011
Quantilesmedian
Saturday, April 9, 2011
Quantilesmedian
75th percentile
Saturday, April 9, 2011
Quantilesmedian
75th percentile95th percentile
Saturday, April 9, 2011
Quantilesmedian
75th percentile95th percentile98th percentile
Saturday, April 9, 2011
Quantilesmedian
75th percentile95th percentile98th percentile99th percentile
Saturday, April 9, 2011
Quantilesmedian
75th percentile95th percentile98th percentile99th percentile
99.9th percentileSaturday, April 9, 2011
We can’t keep all of these values.
Saturday, April 9, 2011
1,000 req/sec
Saturday, April 9, 2011
1,000 req/sec
×
Saturday, April 9, 2011
1,000 req/sec
×1,000 actions/req
Saturday, April 9, 2011
1,000 req/sec
×1,000 actions/req
×
Saturday, April 9, 2011
1,000 req/sec
×1,000 actions/req
×1 day
Saturday, April 9, 2011
1,000 req/sec
×1,000 actions/req
×1 day
=
Saturday, April 9, 2011
1,000 req/sec
×1,000 actions/req
×1 day
=>86 billion values
Saturday, April 9, 2011
1,000 req/sec
×1,000 actions/req
×1 day
=>86 billion values
>640GB of data/day
Saturday, April 9, 2011
1,000 req/sec
×1,000 actions/req
×1 day
=>86 billion values
>640GB of data/dayNot gonna happen.
Saturday, April 9, 2011
COGNITIVE HAZARD
Saturday, April 9, 2011
Reservoir sampling.Keep a statistically representative sample
of measurements as they happen.
Saturday, April 9, 2011
Vitter’s Algorithm R.
Vitter, J. (1985).Random sampling with a reservoir.
ACM Transactions on Mathematical Software (TOMS), 11(1), 57.Saturday, April 9, 2011
time
# ofcities
Saturday, April 9, 2011
time
# ofcities
Saturday, April 9, 2011
time
# ofcities
Saturday, April 9, 2011
time
# of cities
Saturday, April 9, 2011
time
# of cities
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
Vitter’s Algorithm R produces uniform
samples.
Saturday, April 9, 2011
Recency.
Saturday, April 9, 2011
SUPER-DUPERCOGNITIVE HAZARD
Saturday, April 9, 2011
Saturday, April 9, 2011
Forward-decaying priority sampling.
Cormode, G., Shkapenyuk, V., Srivastava, D., & Xu, B. (2009).Forward Decay: A Practical Time Decay Model for Streaming Systems.
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering.Saturday, April 9, 2011
Maintain a statistically representative sample of the last 5 minutes.
Saturday, April 9, 2011
time
# of cities
Saturday, April 9, 2011
time
# of cities
Saturday, April 9, 2011
time
# of cities
Saturday, April 9, 2011
time
# of cities
Saturday, April 9, 2011
Uniform Biased
Saturday, April 9, 2011
“95% of autocomplete results return 3 cities or
less.”
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
TimerA histogram of durations and
a meter of calls.
Saturday, April 9, 2011
# of ms to respond
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS)
timer.time { handle(req, resp) }
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS)
timer.time { handle(req, resp) }
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS)
timer.time { handle(req, resp) }
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS)
timer.time { handle(req, resp) }
Saturday, April 9, 2011
val timer = metrics.timer("requests", MILLISECONDS, SECONDS)
timer.time { handle(req, resp) }
Saturday, April 9, 2011
“At ~2,000 req/sec, our 99% latency jumps
from 13ms to 453ms.”
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
Now what?
Saturday, April 9, 2011
Instrument it.
Saturday, April 9, 2011
Instrument it.If it could affect your code’s
business value, add a metric.
Saturday, April 9, 2011
Instrument it.If it could affect your code’s
business value, add a metric.Our services have 40-50 metrics.
Saturday, April 9, 2011
Collect it.
Saturday, April 9, 2011
Collect it.JSON via HTTP.
Saturday, April 9, 2011
Collect it.JSON via HTTP.Every minute.
Saturday, April 9, 2011
Monitor it.
Saturday, April 9, 2011
Monitor it.Nagios/Zabbix/Whatever
Saturday, April 9, 2011
Monitor it.Nagios/Zabbix/Whatever
If it affects business value, someone should get woken up.
Saturday, April 9, 2011
Aggregate it.
Saturday, April 9, 2011
Aggregate it.Ganglia/Graphite/Cacti/Whatever
Saturday, April 9, 2011
Aggregate it.Ganglia/Graphite/Cacti/Whatever
Place current values in historical context.
Saturday, April 9, 2011
Aggregate it.Ganglia/Graphite/Cacti/Whatever
Place current values in historical context.See long-term patterns.
Saturday, April 9, 2011
Go faster.
Saturday, April 9, 2011
Shorten ourdecision-making cycle.
Saturday, April 9, 2011
Observe
Saturday, April 9, 2011
ObserveOrient
Saturday, April 9, 2011
ObserveOrientDecide
Saturday, April 9, 2011
ObserveOrientDecideAct
Saturday, April 9, 2011
ObserveOrientDecideAct
Saturday, April 9, 2011
Observe
What is the 99% latency of our autocomplete service right now?
Saturday, April 9, 2011
Observe
What is the 99% latency of our autocomplete service right now?
~500ms
Saturday, April 9, 2011
Orient
How does this compare toother parts of our system,
both currently and historically?
Saturday, April 9, 2011
Orient
How does this compare toother parts of our system,
both currently and historically?
way slower
Saturday, April 9, 2011
Decide
Should we make it faster?Or should we add feature X?
Saturday, April 9, 2011
Decide
Should we make it faster?Or should we add feature X?
make it faster
Saturday, April 9, 2011
Act!
Write some code.
Saturday, April 9, 2011
Act!
Write some code.
def sort_by(&blk) #sleep(100) # WTF DUDE super(&blk)end
Saturday, April 9, 2011
10 Print "Rinse"20 Print "Repeat"30 Goto 10
Saturday, April 9, 2011
If we do this fasterwe will win.
Saturday, April 9, 2011
Fewer bugs.
Saturday, April 9, 2011
More features.
Saturday, April 9, 2011
Happier users.
Saturday, April 9, 2011
Money.Saturday, April 9, 2011
tl;dr
Saturday, April 9, 2011
We might write code.
Saturday, April 9, 2011
We have to generatebusiness value.
Saturday, April 9, 2011
In order to know how well our code is generating
business value, we need metrics.
Saturday, April 9, 2011
GaugesCountersMeters
HistogramsTimers
Saturday, April 9, 2011
Monitor them for current problems.
Saturday, April 9, 2011
Aggregate them for historical perspective.
Saturday, April 9, 2011
territorymap ≠
Saturday, April 9, 2011
territorymap→
Saturday, April 9, 2011
Improve our mental model of our code.
Saturday, April 9, 2011
MIND THE GAP
Saturday, April 9, 2011
ObserveOrientDecideAct
Saturday, April 9, 2011
If you’re on the JVM, use Metrics.
Saturday, April 9, 2011
If you’re on the JVM, use Metrics.
github.com/codahale/metrics
Saturday, April 9, 2011
If not,you can build this.
Saturday, April 9, 2011
Please build this.
Saturday, April 9, 2011
Make better decisions by using numbers.
Saturday, April 9, 2011
Thank you.
Saturday, April 9, 2011
Saturday, April 9, 2011