rethinking metrics: metrics 2.0

108
rethinking metrics: metrics 2.0

Upload: dieterbe

Post on 11-Jul-2015

234 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Rethinking metrics: metrics 2.0

rethinking metrics:

metrics 2.0

Page 2: Rethinking metrics: metrics 2.0

by niteroi @ panoramio.com

Page 3: Rethinking metrics: metrics 2.0

vimeo.com/43800150

Page 4: Rethinking metrics: metrics 2.0
Page 5: Rethinking metrics: metrics 2.0
Page 6: Rethinking metrics: metrics 2.0
Page 7: Rethinking metrics: metrics 2.0

problems

Metrics 2.0 concepts

implementations uses & ideas

Page 8: Rethinking metrics: metrics 2.0

terminology

sync

Page 9: Rethinking metrics: metrics 2.0

(1234567890, 82)

(1234567900, 123)

(1234567910, 109)

(1234567920, 77)

db15.mysql.queries_running

host=db15 mysql.queries_running

Page 10: Rethinking metrics: metrics 2.0

Problems

Page 11: Rethinking metrics: metrics 2.0

Vimeo.com pagerequests/s?

server X write perf?

Page 12: Rethinking metrics: metrics 2.0

stats.hits.vimeo_com

stats_counts.hits.vimeo_com

stats.*.vimeo_requests

collectd.db.disk.sda1.disk_time.write

Page 13: Rethinking metrics: metrics 2.0

Terminology? Meaning?

Prefix?

Unit?

Aggregation?

Source?

Understanding metrics

Page 14: Rethinking metrics: metrics 2.0
Page 15: Rethinking metrics: metrics 2.0

Unclear, inconsistent terminology, format

tightly coupled

lack information

Page 16: Rethinking metrics: metrics 2.0

http://litlquest.com/forest-trees/see-forest-trees-2

Page 17: Rethinking metrics: metrics 2.0
Page 18: Rethinking metrics: metrics 2.0

O(S*P*A*C) S = # Sources

P = # People

A = # Aggregators

C = #Complexity

Page 19: Rethinking metrics: metrics 2.0
Page 20: Rethinking metrics: metrics 2.0

Graphs and dashboards are a huge time sink.

Page 21: Rethinking metrics: metrics 2.0

metrics 2.0

concepts

Page 22: Rethinking metrics: metrics 2.0

Self-describing

Standardized

Orthogonal dimensions

Page 23: Rethinking metrics: metrics 2.0

stats.timers.dfs5.proxy-server.object.GET.200.

timing.upper_90

Page 24: Rethinking metrics: metrics 2.0

{

server: dfvimeodfsproxy5,

http_method: GET,

http_code: 200,

unit: ms,

metric_type: gauge,

stat: upper_90,

swift_type: object

}

Page 25: Rethinking metrics: metrics 2.0

allow more characters

unit: Req/s, site: vimeo.com, ...

Page 26: Rethinking metrics: metrics 2.0

Metadata

meta: {

src: proxy.py:458,

from: diamond

}

Page 27: Rethinking metrics: metrics 2.0

Datamodel

Page 28: Rethinking metrics: metrics 2.0

Any protocol

Page 29: Rethinking metrics: metrics 2.0

Source format

…service=foo instance=host unit=B 123 1234567890

{s}foo.{i}host.{u}B 123 1234567890

<uuid> 125 1234567890 #seperate data

Page 30: Rethinking metrics: metrics 2.0

metrics20.org

Page 31: Rethinking metrics: metrics 2.0

SI + IEC

B Err Warn ConnJob File Req ...

MB/s Err/dReq/h ...

Page 32: Rethinking metrics: metrics 2.0

Immediate understanding

of metrics

Minimize time to graphs,

alerting rules, debugging

compatibility & flexibility

in tooling

Page 33: Rethinking metrics: metrics 2.0

Implementations examples

Page 34: Rethinking metrics: metrics 2.0
Page 35: Rethinking metrics: metrics 2.0

Carbon-tagger

…stats.gauges.host.foo 125 1234567890

service=foo instance=host target_type=gauge unit=B 123 1234567890

Page 36: Rethinking metrics: metrics 2.0
Page 37: Rethinking metrics: metrics 2.0

Statsdaemon

unit=B

unit=B

...

unit=ms

unit=ms

...

unit=B/s

unit=ms stat=meanunit=ms stat=upper_90...

Page 38: Rethinking metrics: metrics 2.0

Keep metric

tags in sync with data

Page 39: Rethinking metrics: metrics 2.0

Graphing & dashboarding

Visualization

Alerting

Page 40: Rethinking metrics: metrics 2.0

Graphing &Dashboarding

Page 41: Rethinking metrics: metrics 2.0

GraphExplorer

Page 42: Rethinking metrics: metrics 2.0
Page 43: Rethinking metrics: metrics 2.0

Graph-Explorer queries 101

proxy-server swift server:regex unit=ms

(AND)

Page 44: Rethinking metrics: metrics 2.0
Page 45: Rethinking metrics: metrics 2.0
Page 46: Rethinking metrics: metrics 2.0
Page 47: Rethinking metrics: metrics 2.0
Page 48: Rethinking metrics: metrics 2.0
Page 49: Rethinking metrics: metrics 2.0
Page 50: Rethinking metrics: metrics 2.0
Page 51: Rethinking metrics: metrics 2.0

upper_90 (or stat=upper_90)

from <datetime>to <datetime>

avg over <timespec>(5M, 1h, 3d, ...)

Page 52: Rethinking metrics: metrics 2.0

Compare object put/get

stack …

http_method:(PUT|GET)

swift_type=object

avg by http_code,server

Page 53: Rethinking metrics: metrics 2.0
Page 54: Rethinking metrics: metrics 2.0

Comparing servers

http_method:(PUT|GET)

group by unit,target_type

avg by http_code,swift_type,http_method

Page 55: Rethinking metrics: metrics 2.0
Page 56: Rethinking metrics: metrics 2.0

transcode unit=Job/savg over <time>

from <datetime> to <datetime>

Page 57: Rethinking metrics: metrics 2.0

Note: data is obfuscated

Page 58: Rethinking metrics: metrics 2.0

Bucketing

sum by zone:eu-west|us-east|ap-southeast|us-west|

sa-east|vimeo-df|vimeo-lv

group by state

Page 59: Rethinking metrics: metrics 2.0

Note: data is obfuscated

Page 60: Rethinking metrics: metrics 2.0

Compare job states per region (zones bucket)

group by zone

Page 61: Rethinking metrics: metrics 2.0

Note: data is obfuscated

Page 62: Rethinking metrics: metrics 2.0

Unit conversion

unit=Mb/s network server:regexsum by server

Page 63: Rethinking metrics: metrics 2.0
Page 64: Rethinking metrics: metrics 2.0
Page 65: Rethinking metrics: metrics 2.0

Integration

Metric unit=B/s Query unit=TB

Page 66: Rethinking metrics: metrics 2.0
Page 67: Rethinking metrics: metrics 2.0

Deriving

Metric unit=BQuery unit=GB/d

Page 68: Rethinking metrics: metrics 2.0
Page 69: Rethinking metrics: metrics 2.0
Page 70: Rethinking metrics: metrics 2.0

Future work

Faced-based suggestions

Custom trees

Page 71: Rethinking metrics: metrics 2.0

Dashboard definition

queries = [

'cpu usage sum by core',

'mem unit=B !total group by type:swap',

'stack network unit=Mb/s',

'unit=B (free|used) group by =mountpoint'

]

Page 72: Rethinking metrics: metrics 2.0
Page 73: Rethinking metrics: metrics 2.0
Page 74: Rethinking metrics: metrics 2.0
Page 75: Rethinking metrics: metrics 2.0
Page 76: Rethinking metrics: metrics 2.0

Equivalence

servers.host.cpu.total.iowait → “core” : “_sum_”

servers.host.cpu.<core-number>.iowait

servers.host.loadavg.15

Page 77: Rethinking metrics: metrics 2.0

Future Work

Page 78: Rethinking metrics: metrics 2.0

● Storage aggregation rules

● graphite API functions such as cumulative, summarize and smartSummarize

●consolidateBy & Graph renderers

Page 79: Rethinking metrics: metrics 2.0
Page 80: Rethinking metrics: metrics 2.0

Self-describing & standardized

stat=upper/lower/mean/...target_type=counter..

Page 81: Rethinking metrics: metrics 2.0

Visualizations

Page 82: Rethinking metrics: metrics 2.0

From: dygraphs.com

Page 83: Rethinking metrics: metrics 2.0

Select your view

Page 84: Rethinking metrics: metrics 2.0

bin=10

bin=20

bin=30

bin=40

bin=50

bin=100

Page 85: Rethinking metrics: metrics 2.0
Page 86: Rethinking metrics: metrics 2.0

Alerting

Page 87: Rethinking metrics: metrics 2.0
Page 88: Rethinking metrics: metrics 2.0
Page 89: Rethinking metrics: metrics 2.0
Page 90: Rethinking metrics: metrics 2.0

unit=Err/s

Page 91: Rethinking metrics: metrics 2.0
Page 92: Rethinking metrics: metrics 2.0
Page 93: Rethinking metrics: metrics 2.0
Page 94: Rethinking metrics: metrics 2.0
Page 95: Rethinking metrics: metrics 2.0

Automatic cause & effect

Page 96: Rethinking metrics: metrics 2.0

Different algo's for different

things

Page 97: Rethinking metrics: metrics 2.0

Alert criticality & routing based

on tags

Page 98: Rethinking metrics: metrics 2.0

integrating logs & metrics

Page 99: Rethinking metrics: metrics 2.0
Page 100: Rethinking metrics: metrics 2.0
Page 101: Rethinking metrics: metrics 2.0

Algorithms leverage both

logs and metrics

Page 102: Rethinking metrics: metrics 2.0

Changing software

Page 103: Rethinking metrics: metrics 2.0

Conclusion

structuredself-describing standardized

metrics = enabler

Page 104: Rethinking metrics: metrics 2.0

Conclusion

What are your concerns? Ideas?

Let's make this better

Ready for early adopters!

Work with me on next-gen telemetry!

Tips on coordinating spec development?

How does FB/G/AMZ/MS/APL/... do this stuff

Page 105: Rethinking metrics: metrics 2.0

Seen in this presentation:

metrics20.org

vimeo.github.io/graph-explorer

github.com/vimeo/timeserieswidget

github.com/vimeo/carbon-tagger

github.com/vimeo/statsdaemon

github.com/graphite-ng/carbon-relay-ng

github.com/Dieterbe/anthracite

Page 106: Rethinking metrics: metrics 2.0

You might also like:

github.com/vimeo/graphite-influxdbgithub.com/vimeo/graphite-api-influxdb-dockerGithub.com/vimeo/whisper-to-influxdb

github.com/Dieterbe/influx-cli

github.com/graphite-ng/graphite-ng

Github.com/vimeo/smoketcpGithub.com/vimeo/tailgate

Page 107: Rethinking metrics: metrics 2.0

Stay in touch!

groups.google.com/forum/#!forum/metrics20groups.google.com/forum/#!forum/it-telemetry

twitter.com/Dieter_bedieter.plaetinck.be

Page 108: Rethinking metrics: metrics 2.0

Q&A