Download - CAT Graphite replacement lightning talk
![Page 1: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/1.jpg)
Riemann + InfluxDB + Grafana
A more performant, easier to deploy and hopefully more scalable replacement for Graphite
Nick Chappell
https://github.com/nickchappell
![Page 2: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/2.jpg)
Graphite
![Page 3: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/3.jpg)
Where did it come from?Developed internally at Orbitz, then open-sourced
Why did it catch on?
Queryable HTTP API, which RRD didn’t (AFAIK, still doesn’t) have anything similar to
Once installed, easier to work with than RRD
Coolness factor!Lawl
![Page 4: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/4.jpg)
GraphiteWhat is it?
Metrics receiver (carbon-cache)
Metrics relay (carbon-relay)
Metrics data point storage (Whisper)
Metrics data graphing (Graphite web)
Metrics data API (Graphite web)
![Page 5: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/5.jpg)
Graphite architecture3 main components
whisper
graphite-web
Carbon takes in metrics
Whisper stores metrics
graphite-web retrieves metrics
carbon-cache/carbon-relay
API
![Page 6: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/6.jpg)
Other tools in the graphing/metrics space
Old school toolsRRD for data storage
RRDtool for generating graphs
Cacti for managing dashboards of RRD
graphs
![Page 7: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/7.jpg)
GraphiteWhat’s wrong with it?
A few different things, mainly with 2 parts of the stack
carbon-cache/carbon-relay whisper
graphite-web
![Page 8: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/8.jpg)
graphite-webGraphite web is actually quite good and featureful (is that a word?) as an API
The built-in dashboard and graph builder isn’t very stylish, but is great for exploring and finding out what metrics you want to query via the API or graph in another dashboard
tool
![Page 9: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/9.jpg)
What’s wrong with Whisper?TONS of files, 1 file per metric
Disk IO problems when running queries for lots of metrics !
graphite-web ends up having to touch lots and lots of files
![Page 10: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/10.jpg)
Whisper derpage
This is your timeseries data on catnip
![Page 11: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/11.jpg)
What’s wrong with carbon?The stock Python interpreter, CPython, specifically, the GIL
(global interpreter lock)
“Threading” doesn’t really happen because multiple threads are not allowed to execute native code at the same time
https://wiki.python.org/moin/GlobalInterpreterLock
Why? According to the Python folks, memory management is not thread safe
Your time series data is getting munged on somewhere in here
![Page 12: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/12.jpg)
Does Graphite have any redeeming qualities?
The format that Graphite receives metrics in is dead-simple
metric_name value timestamp\n
foo.bar.baz 42 74857843
walle.ece.cecs.pdx.edu.cpu-system 423 74857843
![Page 13: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/13.jpg)
Metrics formatsFor better or worse, Graphite’s format is the lingua franca of
the metrics world right now
Most everything that outputs metrics can output them in Graphite’s format
![Page 14: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/14.jpg)
Metrics 2.0
Graphite’s format is too limited to store more data or more complex types of metrics
http://metrics20.org/
Metrics 2.0 is an informal proposal by Dieterbe, the monitoring/metrics dude at Vimeo
![Page 15: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/15.jpg)
What can we do?
We can tune and scale Graphite by architecting things differently or we can replace some or all of
the components
![Page 16: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/16.jpg)
Scaling GraphiteSome strategies include…
Split up the roles for Carbon into Carbon relays and Carbon caches
…decoupling components
Get a faster CPU
…throwing more hardware at it
Get faster disks (SSDs or set up RAM drives)
Put HAproxy in front of a few Carbon instances to spread the load around
To this end, set up multiple Carbon instances on the same machine, listening on different ports
and tie them to separate CPU cores
![Page 17: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/17.jpg)
Will it work?Maybe. But you get a more complex setup
Once you’ve hit a ceiling in terms of how much disk IO you can provide, Whisper runs out of juice
If you have multiple carbon caches, do they write to the same Whisper data store?
Is it even possible to write to the same Whisper data store?
If not, how do we set up a graphite-web instance to query multiple Whispers?
If we can do that, how does each instance know which Whisper to query for what metric series?
![Page 18: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/18.jpg)
What are other people doing?
Some people are trying to rewrite parts of the stack
https://github.com/graphite-ng/
http://grey-boundary.com/the-architecture-of-clustering-graphite/
Others have set up some pretty impressive (but complex) Graphite architectures:
![Page 19: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/19.jpg)
http://grey-boundary.com/the-architecture-of-clustering-graphite/
![Page 20: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/20.jpg)
What are the Graphite people doing?
The Graphite folks have 2 projects in the works to overcome some of its problems
megacarbon: replacement for Carbon, supposedly will perform better
ceres: replacement for Whisper, supposedly will perform better and natively allow writes from multiple carbon/
megacarbon instances
![Page 21: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/21.jpg)
Will these help?Probably not.
megacarbon is a branch of the main carbon repository that is ahead of carbon’s master branch, but hasn’t yet been merged
back in and may never be
ceres hasn’t been touched since December 2013
For that matter, whisper hasn’t been touched since January 10th, 2014
Issues and pull requests keep piling up
![Page 22: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/22.jpg)
One more thing….Graphite is a PITA to install/deploy
It’s somewhat easier now that carbon, whisper and graphite-web are available in pip
You still need to do a bunch of manual setup, though, and use apt/yum to install a bunch of
pre-requisite Python packages and set up Apache/Nginx with WSGI or Gunicorn to run graphite-web, which is actually a Django app
Before, you had to clone each component’s Git repo and check out a tagged release
![Page 23: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/23.jpg)
OK, what next?
Outside of planet Graphite, other things have been happening in the monitoring/metrics/
logging space
![Page 24: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/24.jpg)
Let’s take a step backFor a long time, “monitoring” was just this:
What a sleepless night will look like
![Page 25: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/25.jpg)
And if it’s all green, things are OK, right?
![Page 26: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/26.jpg)
Proactive vs reactive
I get alerts about disk space going below a certain threshold or load average going above a certain threshold
Can I get alerts about those things before they become problems?
Can I get alerted about disk usage when it may not be problem now, but is on its way up and soon will be critical?
![Page 27: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/27.jpg)
“Traditional” monitoring tools
Nagios, Icinga, Sensu, Shinken, etc. are great at monitoring state changes
They’re not very good at gathering metrics or performance data, and they’re not built to look for trends in things over time
Things going from good (server is up, site is responding, database is running) to bad or bad to worse
They can gather performance stats, but the intervals are too long to be useful
Nagios/Icinga with plugins can generate reports about trends (how many services were critical in the past month) but it’s after the fact
![Page 28: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/28.jpg)
3 newer monitoring tools
Logstash
Heka
Riemann
![Page 29: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/29.jpg)
Logstash
Accepts logs, processes and stores them in Elasticsearch
A web app, Kibana, accesses the index data in Elasticsearch
Sound familiar?
There are tons of other ways to use it, though!
![Page 30: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/30.jpg)
Heka
Newest monitoring tool of the bunchWritten in Go
https://github.com/mozilla-services/heka
![Page 31: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/31.jpg)
Riemann http://riemann.io
![Page 32: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/32.jpg)
What is an event?“Event stream processor” sounds really abstract
Instead of metrics (CPU usage on webserver2 at 10:45:09am) log lines and monitoring alerts (MySQL was down at 12:54am)
being thought of as separate things, why think of them as different variations of the same thing?
Logstash was one of the first tools to work with this assumption
Think about it: a user hitting a 503 error in your web app (that you find out via logs) may not trigger a Nagios alert, but don’t
you want to still know about it?
![Page 33: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/33.jpg)
RiemannAn “event stream” processor
http://riemann.io
![Page 34: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/34.jpg)
RiemannCan take in almost anything (log line, Graphite-format metric,
etc.)
Like Logstash/Heka, can process it and then send it elsewhere
Outputs include…
…IM protocols/services like IRC, Slack, Hipchat
…a more traditional monitoring system like Nagios/Icinga
![Page 35: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/35.jpg)
RiemannCan also output to Graphite and Logstash
I’ll talk about integrations with other tools later on
One advantage over carbon: Riemann has Debian and RPM packages, including init scripts!
![Page 36: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/36.jpg)
Riemann as part of a Graphite replacement
Riemann can actually take Graphite-format metrics as an input
http://riemann.io/api/riemann.transport.graphite.html
OK, so can Riemann be used in the place of Carbon?
Yes. It can be instructed to listen on arbitrary TCP and UDP ports
![Page 37: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/37.jpg)
RiemannWritten in Clojure, a functional programming language
Riemann being written in Clojure is kinda neat, but what makes it special is what Clojure runs on: the JVM
Clojure also brings one benefit that Graphite desperately needs: safe threading!
Unlike CPython, the JVM can actually run more than 1 thread at a time and can use more than 1 CPU
core
Clojure has software transactional memory and other tools for parallel/concurrent programming
![Page 38: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/38.jpg)
Riemann events
Like Logstash, events in Riemann are pieces of text with multiple fields
{ :host riemann1.local, :service cpu-0.cpu-wait, :metric 3.399911, :tags collectd, :time 1405715017, :ttl 30 }
![Page 39: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/39.jpg)
Other parts of the stackSo, we have Carbon replaced. What else do we need?
A replacement for Whispher
A replacement for the API component of graphite-web
A replacement for the web UI component of graphite-web
![Page 40: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/40.jpg)
InfluxDB
![Page 41: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/41.jpg)
InfluxDB
A fairly young project (about a year old)
A time series database written in Go
Uses LevelDB as the underlying datastore, but about to switch to RocksDB as a default (LevelDB, HyperLevelDB and LMDB can also be used)
RocksDB and HyperLevelDB are based on LevelDB
All 3 LevelDB variants compress data on disk
All 4 storage engines can compact storage engine files when getting rid of old metrics
![Page 42: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/42.jpg)
InfluxDB specificsEven though it’s called a time series database, it isn’t
like a regular SQL database
A database in InfluxDB is just like a DB in a SQL system
A series in InfluxDB is like a table in SQL
A point in a series is like a row in a table
Points can have columns of values
Points in a series don’t all have to have the same columns, so InfluxDB is sort of schema-less
![Page 43: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/43.jpg)
InfluxDB advantages over WhisperIt has a query language (usable via an HTTP API)
All of the underlying storage engines can perform better than Whisper
http://influxdb.com/blog/2014/06/20/leveldb_vs_rocksdb_vs_hyperleveldb_vs_lmdb_performance.html
Storage engine benchmarks:
InfluxDB is going to move to RocksDB as the default in the next version
They have Debian and RPM packages! (with init scripts too!)
![Page 44: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/44.jpg)
Can InfluxDB replace Whisper?Yes!
One particular feature it has that Whisper does not is clustering
A group of InfluxDB nodes can communicate via a Raft-based protocol to coordinate writes and reads and split up data into
shards
Whisper is still single-instance only
InfluxDB ends up taking the place of Whisper and graphite-web
![Page 45: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/45.jpg)
InfluxDB disadvantages
API and built-in functions are not as featureful as graphite-web, at least not yet
Data retention specifics and best practices are still being worked out
But, both of these can be solved with development work, and unlike the Graphite projects, InfluxDB is being actively
developed!
![Page 46: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/46.jpg)
Riemann + InfluxDBGuess what Riemann can write data out to?
![Page 47: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/47.jpg)
InfluxDB data partitioningData partitioning is deciding how to organize your data (1
series per host? 1 series per metric? 1 series per hostname + metric name combo?)
InfluxDB works best with large numbers of series with fewer columns in each one
Why? Points are indexed by time, not by any other columns.
Arbitrary column indexes are going to be added in the future, though
Having only 1 hostname+metric combo’s set of data in a series means InfluxDB only has to do fast indexed lookups by time,
not fast indexed lookups by time, then slower non-index lookups of other columns through several hosts’ and metrics’
worth of data
![Page 48: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/48.jpg)
InfluxDB data partitioning
Time Name Host Metric Service32141234 cpu web0
178 cpu
32141235 disk_io web02
98844 disk_io32141236 load db1 5 load32141237 eth0_in ldap0
35875 eth0_in
Time Name Host Metric Service32141234 cpu web01 78 cpu32141235 cpu web01 45 cpu32141236 cpu web01 38 cpu32141237 cpu web01 92 cpu
Time Name Host Metric Service32141234 disk_io web01 87323 disk_io32141235 disk_io web01 98844 disk_io32141236 disk_io web01 9233 disk_io32141237 disk_io web01 93262 disk_io
Bad: only using 1 series in a DB
Good: Using multiple series in a DB
Time Name Host Metric Service32141234 cpu web02 78 cpu32141235 cpu web02 45 cpu32141236 cpu web02 38 cpu32141237 cpu web02 92 cpu
Time Name Host Metric Service32141234 disk_io web02 87323 disk_io32141235 disk_io web02 98844 disk_io32141236 disk_io web02 9233 disk_io32141237 disk_io web02 93262 disk_io
![Page 49: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/49.jpg)
InfluxDB data partitioningIsn't a series for every host+service combo excessive?
Because of the way InfluxDB's storage engines
work, no!
We can include the series name as the first part of our
query:
By doing that, InfluxDB can ignore all of the other data in the other series and will only have to access 1
LevelDB file on disk per Grafana query(show Riemann config that creates a series per host+service)
![Page 50: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/50.jpg)
InfluxDB data partitioningBy doing this, we’re not
really querying by hostname or metric name, ie. querying
by those columns
It’s kind of a hack, but it takes advantage of the characteristics of how
InfluxDB’s storage engines work
![Page 51: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/51.jpg)
InfluxDB data partitioningI did this originally to get
around a quirk of Grafana’s UI
For InfluxDB data sources, Grafana doesn’t let you use where or do selects by
more than 1 column
Without splitting data up into more than 1 series, there’s no way to get metric values for an individual host, metric
or host+metric combo
![Page 52: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/52.jpg)
Riemann and InfluxDB data partitioningRiemann’s InfluxDB output function looks like this:
![Page 53: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/53.jpg)
Riemann and InfluxDB data partitioning:series "host.service"
…tells Riemann to take this:{ :host riemann1.local, :service cpu-0.cpu-wait, :metric 3.399911, :tags collectd, :time 1405715017, :ttl 30 }
…and write it to InfluxDB with riemann1.local.cpu-0.cpu-wait as the automatically generated series name
InfluxDB behaves like Graphite with new metrics: it will automatically create a new series if it’s for a hostname.metric
combo it doesn’t already have a series for
![Page 54: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/54.jpg)
GrafanaNow, we just need a dashboard...
![Page 55: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/55.jpg)
Grafana data sourcesBased on Kibana 3 (it's just HTML, JS and CSS)
Deploy it by unzipping a tarball to a place where your webserver can serve the contents
Edit config.js to add data sources:
Grafana can graph data from Graphite, InfluxDB and
OpenTSDB
![Page 56: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/56.jpg)
A new Graphite stack
Riemann
collectdcollectdcollectdcollectd
statsd statsd
InfluxDB
Grafana
![Page 57: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/57.jpg)
Show me some graphs!
(do a demo)
![Page 58: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/58.jpg)
Demo time!
(do an install of Riemann)
(do an install of InfluxDB)
(do an install of Grafana)
![Page 59: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/59.jpg)
Disadvantages of this new stackGrafana doesn't take advantage of all of InfluxDB's features
InfluxDB doesn't have as many built-in functions as graphite-web
Riemann's documentation is sparse, and if you've never written Clojure, there's a learning curve for
writing configs to do more than basic stuff
There are a bunch of dashboard tools that can talk to graphite-web's API
Not as many can talk to InfluxDB
![Page 60: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/60.jpg)
Each of the components are easier to deploy than Graphite and are still being
developed and maintained!
Why use this new stack, then?
![Page 61: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/61.jpg)
InfluxDB and GraphiteInfluxDB actually has a Graphite listener built in
So why use Riemann in the middle? Alerting on metrics!
Riemann actually holds events that it processes in memory for a short time in what it calls the index
Having events in the index means we can keep track of metrics over short periods of time
![Page 62: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/62.jpg)
Riemann and alertingHaving events in the index means we can keep track of
metrics over short periods of time
In the collectd metrics for load average across every machine, calculate an average over the last 5 mins and send an email
alert if it's over a certain threshold
In the collectd metrics for load average on an individual machine, calculate a derivative over the last 5 mins and send
a Nagios alert if the derivative is above a certain level
![Page 63: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/63.jpg)
Native Riemann outputs
Some tools, like CollectD, can output data to Riemann in Riemann's native binary protobuf format
Graphite's format for metrics has become the most commonly used format
For things that only output Graphite format metrics, Riemann's Graphite server functionality is incredibly
useful
![Page 64: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/64.jpg)
Next steps: Scaling Riemann
Because of the way it works internally, Riemann doesn't have support for clustering the way InfluxDB does
https://github.com/jdmaturen/reimann/blob/master/riemann.config.guide#L234
You can set up multiple Riemann servers, and they can forward events to a central one, or one of several behind
HAproxy
Specifically, there's no way to share the in-memory index across nodes
![Page 65: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/65.jpg)
Next steps: Scaling RiemannThis sounds like how you would scale Graphite, but because
we have InfluxDB, each Riemann instance can write to the same InfluxDB instance, or one of many nodes in a cluster
Because InfluxDB can take data in over a network connection, can cluster and is not plain-file-based like Whisper, multiple
Riemanns writing to 1 or more InfluxDBs in a cluster shouldn't be an issue
Some of the monitoring uses of Riemann (calculating moving averages or derivatives) will break because the in-memory
indexes can’t be shared, though
![Page 66: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/66.jpg)
Next steps: Scaling InfluxDBInfluxDB has OotB support for clustering
Uses Raft for a consensus protocol
Sharding is done by blocks of time (time periods are configurable
Metadata is shared via Raft that lets each node know which shards covering what time periods and series/DBs are on
each node
http://sssslide.com/speakerdeck.com/pauldix/the-internals-of-influxdb
Also has a WAL (write ahead log)
Databases and series can be broken up into shards
![Page 67: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/67.jpg)
Tool integrationsRiemann and Logstash can output events to each other
Both can output events to Graphite
This can get really confusing, really quickly
![Page 68: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/68.jpg)
Pies in the skyWhat I want to experiment with and get working:
Email alerts from Riemann
Riemann telling Nagios/Icinga to send alerts based on thresholds for averages or derivatives of metric values
Send web logs to Logstash and make Logstash output metrics from them to Riemann (how many HTTP 404, 503, etc. responses is my web server sending out?)
![Page 69: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/69.jpg)
Linkshttp://riemann.io/
http://influxdb.com/
http://grafana.org/
![Page 70: CAT Graphite replacement lightning talk](https://reader030.vdocument.in/reader030/viewer/2022020217/540d967b8d7f72767e8b4a98/html5/thumbnails/70.jpg)
LinksMonitorama PDX 2014 Grafana workshop:
http://vimeo.com/95316672
Monitorama PDX 2014 InfluxDB talk: http://vimeo.com/95311877
Monitorama Boston 2013 Riemann talk: http://vimeo.com/67181466