opentsdb - metrics for a distributed world
DESCRIPTION
These are the slides for my talk at the IPC13/WTC13 in Munich on openTSDB. openTSDB ist the software that we at gutefrage.net use to store about 200 million data points in several thousand time series per day. I will talk about how openTSDB stores the data to efficiently query them afterwards. Some cultural issues and some myths are also covered.TRANSCRIPT
![Page 1: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/1.jpg)
openTSDB - Metrics for a distributed world
Oliver Hankeln / gutefrage.net@mydalon
Mittwoch, 30. Oktober 13
![Page 2: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/2.jpg)
Who am I?
Senior Engineer - Data and Infrastructure at gutefrage.net GmbH
Was doing software development before
DevOps advocate
Mittwoch, 30. Oktober 13
![Page 3: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/3.jpg)
Who is Gutefrage.net?
Germany‘s biggest Q&A platform
#1 German site (mobile) about 5M Unique Users
#3 German site (desktop) about 17M Unique Users
> 4 Mio PI/day
Part of the Holtzbrinck group
Running several platforms (Gutefrage.net, Helpster.de, Cosmiq, Comprano, ...)
Mittwoch, 30. Oktober 13
![Page 4: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/4.jpg)
What you will get
Why we chose openTSDB
What is openTSDB?
How does openTSDB store the data?
Our experiences
Some advice
Mittwoch, 30. Oktober 13
![Page 5: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/5.jpg)
Why we chose openTSDB
Mittwoch, 30. Oktober 13
![Page 6: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/6.jpg)
We were looking at some options
Munin Graphite openTSDB Ganglia
Scales well
no sort of yes yes
Keeps all data
no no yes no
Creating metrics
easy easy easy easy
Mittwoch, 30. Oktober 13
![Page 7: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/7.jpg)
We have a winner!
Munin Graphite openTSDB Ganglia
Scales well
no sort of yes yes
Keeps all data
no no yes no
Creating metrics
easy easy easy easyBing
o!Mittwoch, 30. Oktober 13
![Page 8: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/8.jpg)
Separation of concerns
Mittwoch, 30. Oktober 13
![Page 9: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/9.jpg)
Separation of concerns
UI was not important for our decision
Alerting is not what we are looking for in our time series data base
$ unzip|strip|touch|finger|grep|mount|fsck|more|yes|fsck|fsck|fsck|umount|sleep
Mittwoch, 30. Oktober 13
![Page 10: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/10.jpg)
The ecosystem
App feeds metrics in via RabbitMQ
We base Icinga checks on the metrics
We evaluate Skyline and Oculus by Etsy for anomaly detection
We deploy sensors via chef
Mittwoch, 30. Oktober 13
![Page 11: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/11.jpg)
openTSDB
Written by Benoît Sigoure at StumbleUpon
OpenSource (get it from github)
Uses HBase (which is based on HDFS) as a storage
Distributed system (multiple TSDs)
Mittwoch, 30. Oktober 13
![Page 12: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/12.jpg)
The big picture
HBase
TSD
TSD
TSD
TSDUI
API
tcollector
This is really a cluster
Mittwoch, 30. Oktober 13
![Page 13: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/13.jpg)
Putting data into openTSDB
$ telnet tsd01.acme.com 4242put proc.load.avg5min 1382536472 23.2 host=db01.acme.com
Mittwoch, 30. Oktober 13
![Page 14: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/14.jpg)
It gets even better
tcollector is a python script that runs your collectors
handles network connection, starts your collectors at set intervals
does basic process management
adds host tag, does deduplication
Mittwoch, 30. Oktober 13
![Page 15: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/15.jpg)
A simple tcollector script
#!/usr/bin/php<?php
#Cast a die$die = rand(1,6);
echo "roll.a.d6 " . time() . " " . $die . "\n";
Mittwoch, 30. Oktober 13
![Page 16: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/16.jpg)
What was that HDFS again?
HDFS is a distributed filesystem suitable for Petabytes of data on thousands of machines.
Runs on commodity hardware
Takes care of redundancy
Used by e.g. Facebook, Spotify, eBay,...
Mittwoch, 30. Oktober 13
![Page 17: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/17.jpg)
Okay... and HBase?
HBase is a NoSQL database / data store on top of HDFS
Modeled after Google‘s BigTable
Built for big tables (billions of rows, millions of columns)
Automatic sharding by row key
Mittwoch, 30. Oktober 13
![Page 18: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/18.jpg)
How openTSDB stores the data
Mittwoch, 30. Oktober 13
![Page 19: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/19.jpg)
Keys are key!
Data is sharded across regions based on their row key
You query data based on the row key
You can query row key ranges (say e.g. A...D)
So: think about key design
Mittwoch, 30. Oktober 13
![Page 20: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/20.jpg)
Take 1Row key format: timestamp, metric id
Mittwoch, 30. Oktober 13
![Page 21: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/21.jpg)
Take 1Row key format: timestamp, metric id
1382536472, 5 17
Server A
Server B
Mittwoch, 30. Oktober 13
![Page 22: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/22.jpg)
Take 1Row key format: timestamp, metric id
1382536472, 5 171382536472, 6 24
Server A
Server B
Mittwoch, 30. Oktober 13
![Page 23: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/23.jpg)
Take 1Row key format: timestamp, metric id
1382536472, 5 171382536472, 6 241382536472, 8 121382536473, 5 1341382536473, 6 101382536473, 8 99
Server A
Server B
Mittwoch, 30. Oktober 13
![Page 24: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/24.jpg)
Take 1Row key format: timestamp, metric id
1382536472, 5 171382536472, 6 241382536472, 8 121382536473, 5 1341382536473, 6 101382536473, 8 991382536474, 5 121382536474, 6 42
Server A
Server B
Mittwoch, 30. Oktober 13
![Page 25: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/25.jpg)
Solution: Swap timestamp and metric id
Row key format: metric id, timestamp5, 1382536472 176, 1382536472 248, 1382536472 125, 1382536473 1346, 1382536473 108, 1382536473 995, 1382536474 126, 1382536474 42
Server A
Server B
Mittwoch, 30. Oktober 13
![Page 26: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/26.jpg)
Solution: Swap timestamp and metric id
Row key format: metric id, timestamp5, 1382536472 176, 1382536472 248, 1382536472 125, 1382536473 1346, 1382536473 108, 1382536473 995, 1382536474 126, 1382536474 42
Server A
Server B
Mittwoch, 30. Oktober 13
![Page 27: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/27.jpg)
Take 2
Metric ID first, then timestamp
Searching through many rows is slower than searching through viewer rows. (Obviously)
So: Put multiple data points into one row
Mittwoch, 30. Oktober 13
![Page 28: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/28.jpg)
Take 2 continued
5, 1382608800+23 +35 +94 +142
5, 138260880017 1 23 42
5, 1382612400+13 +25 +88 +89
5, 13826124003 44 12 2
Mittwoch, 30. Oktober 13
![Page 29: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/29.jpg)
Take 2 continued
5, 1382608800+23 +35 +94 +142
5, 138260880017 1 23 42
5, 1382612400+13 +25 +88 +89
5, 13826124003 44 12 2
Row key
Mittwoch, 30. Oktober 13
![Page 30: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/30.jpg)
Take 2 continued
5, 1382608800+23 +35 +94 +142
5, 138260880017 1 23 42
5, 1382612400+13 +25 +88 +89
5, 13826124003 44 12 2
Row key
Cell Name
Mittwoch, 30. Oktober 13
![Page 31: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/31.jpg)
Take 2 continued
5, 1382608800+23 +35 +94 +142
5, 138260880017 1 23 42
5, 1382612400+13 +25 +88 +89
5, 13826124003 44 12 2
Row key
Cell Name Data point
Mittwoch, 30. Oktober 13
![Page 32: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/32.jpg)
Where are the tags stored?
They are put at the end of the row key
Both tag names and tag values are represented by IDs
Mittwoch, 30. Oktober 13
![Page 33: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/33.jpg)
The Row Key
3 Bytes - metric ID
4 Bytes - timestamp (rounded down to the hour)
3 Bytes tag ID
3 Bytes tag value ID
Total: 7 Bytes + 6 Bytes * Number of tags
Mittwoch, 30. Oktober 13
![Page 34: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/34.jpg)
Let‘s look at some graphs
Mittwoch, 30. Oktober 13
![Page 35: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/35.jpg)
Busting some Myths
Mittwoch, 30. Oktober 13
![Page 36: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/36.jpg)
Myth: Keeping Data is expensive
Gartner found the price for enterprise SSDs at 1$/GB in 2013
A data point gets compressed to 2-3 Bytes
A metric that you measure every second then uses disk space for 18.9ct per year.
Usually it is even cheaper
Mittwoch, 30. Oktober 13
![Page 37: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/37.jpg)
If your work costs 50$ per hour and it takes you only one minute to think about
and configure your RRD compaction setting, you could have collected that metric on a second-by-second basis for
4.4 YEARS instead.
Mittwoch, 30. Oktober 13
![Page 38: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/38.jpg)
Myth: the amount of metrics is too limited
Don‘t confuse Graphite metric count with openTSBD metric count.
3 Bytes of metric ID = 16.7M possibilities
3 Bytes tag value ID = 16.7M possibilities
=> at least 280 T metrics (graphite counting)
Mittwoch, 30. Oktober 13
![Page 39: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/39.jpg)
Cultural issues
Mittwoch, 30. Oktober 13
![Page 40: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/40.jpg)
Tools shape culture shapes tools
It is time for a new monitoring culture!
Embrace machine learning!
Monitor everything in your organisation!
Throw of the shackles of fixed intervals!
Come, join the revolution!
Mittwoch, 30. Oktober 13
![Page 41: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/41.jpg)
Our experiences
Mittwoch, 30. Oktober 13
![Page 42: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/42.jpg)
What works well
We store about 200M data points in several thousand time series with no issues
tcollector is decoupling measurement from storage
Creating new metrics is really easy
You are free to choose your rhythm
Mittwoch, 30. Oktober 13
![Page 43: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/43.jpg)
Challenges
The UI is seriously lacking
no annotation support out of the box
no meta data for time series
Only 1s time resolution (and only 1 value/s/time series)
Mittwoch, 30. Oktober 13
![Page 44: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/44.jpg)
salvation is coming
OpenTSDB 2 is around the corner
millisecond precision
annotations and meta data
improved API
improved UI
Mittwoch, 30. Oktober 13
![Page 45: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/45.jpg)
Friendly advice
Pick a naming scheme and stick to it
Use tags wisely (not more than 6 or 7 tags per data point)
Use tcollector
wait for openTSDB 2 ;-)
Mittwoch, 30. Oktober 13
![Page 46: openTSDB - Metrics for a distributed world](https://reader038.vdocument.in/reader038/viewer/2022110306/554dcd74b4c905c7488b5604/html5/thumbnails/46.jpg)
Questions?
Please contact me:
@mydalon
I‘ll upload the slides and tweet about it
Mittwoch, 30. Oktober 13