influxdb and the raft consensus protocol - amazon s3 · previous approaches v0.9-rc20 all data,...

InfluxDB and the Raft consensus protocolPhilip O'Toole, Director of Engineering (SF)InfluxDB San Francisco Meetup, December 2015

Agenda

● What is InfluxDB and why is it distributed?

● What is the problem with distribution?

● How can it be solved?

● InfluxDB and distributed consensus

What is InfluxDB?

● Open-source time-series database

● Distributed by design

● Written in Go

Why distribute InfluxDB?

● A distributed database provides reliability– Your data is located in multiple places

● A distributed database offers scalability– For both write and query load

What is the problem?

It's easy to set the value of a single node reliably

X=5X=5ClientClientACK

What if the value should be replicated?

X=5X=5ClientClient X=5X=5

Multi-node system - receiving node replicates data

Node A Node B

X=5X=5 ??

Node A Node B

What if nodes can't communicate?

This failure is known as a partition

ClientClientACK

X=5X=5

ClientClient

Client replication: multi-node system - each node must support changing state

Node A

Node B

What if the value should be replicated?

What if nodes can't communicate?

X=6X=6

X=5X=5

Client 1Client 1

Client 2Client 2

Multi-node system - each node receives mutually-exclusive operation

Node A

Node B

Communication restored

X=6X=6

X=5X=5

ClientClient

Node A

Node B

What value should be read for the node?

Real world scenario

Client 2Client 2

Client 1Client 1

Modify database

Backup anddelete database

Node A

Node B

This problem is known as Distributed Consensus

What is the solution?

● Paxos– Famously difficult-to-understand

● ZAB (ZooKeeper Atomic Broadcast)– Created by Yahoo! Research

● Raft– Diego Ongaro and John Ousterhout at Stanford

Secret Lives of DataBy Ben Johnson @benbjohnson

What is the cost?

● Latency in decision making

● Low throughput

● System rigidity

● Consensus under load is tricky

InfluxDB and RaftWhat follows is subject to change

Accurate as of 0.9.6

What is stored via consensus?

● Cluster membership● Databases● Retention Policies

– Enforcement run by leader

● Users● Continuous Queries

– Queries initiated by leader

● Shard Metadata– Locations on cluster

● Assignment performed by leader

– Start and end times

Leader

Follower

What is not stored via consensus?

● Time-series data– And time-series indexes

● Time-series data schema

System Implications

Data Ingestion

● Write traffic can stall if a shard does not exist– Because the cluster must reach consensus

– This delay is usually very short

– Shard pre-creation mitigates this issue

Schema varies by shard and time

Shard A Shard Bcpu,host=a load=100 cpu,host=b load=99.4

floatinteger

What is the type of the field load on the measurement cpu? It depends! Each shard is its own local authority

No global schema exists

> CREATE DATABASE foo

> DROP DATABASE fooer

ERR: database not found: fooer

> USE foo

Using database foo

> INSERT cpu value=100

> SELECT value FROM cpu

name: cpu

time value

20151212T05:39:11.51832836Z 100

> SELECT valuexxx FROM cpu

> SELECT value FROM cpuxxx

Unexpected error-free cases

Consensus regarding existingdatabases

No consensus i.e. central authority for schema

Fault Tolerance

● 3-node Raft clusters can tolerate failure of 1 node

● 5-node Raft clusters can tolerate failure of 2 nodes– 0.9.6 clusters restricted to 3 Raft nodes

● Write and query traffic may tolerate further failure– Until the data held in consensus must be changed

Previous Approachesv0.9-rc20

● All data, including write load, went through Raft.– A streaming model

● AppendEntries decomposed into two calls

– Potentially high performance

– Global schema which meant improved user experience

– Easier to reason about when all data in consensus

● In practice was inefficient and complex to implement correctly.– Approach was abandoned by 0.9.0

Future Designs

0.9.6 Design

LeaderFollower

Follower

Data Data

● Each node may be data Raft, or both

0.10.0 Design

Leader

Follower

Data Data

● Distinct Raft sub-system● Data-only nodes

A lightly-loaded consensus system may result in morerobust clusters.

References

● https://groups.google.com/forum/#!msg/raft-dev/6GK22lL_Y1c/vSR3_XrRVpMJ

● http://thesecretlivesofdata.com/raft/

● https://raft.github.io/

● https://github.com/hashicorp/raft

● https://github.com/otoolep/hraftd

● https://github.com/influxdb/influxdb/tree/0.9.6/meta

influxdb and the raft consensus protocol - amazon s3 · previous approaches v0.9-rc20 all data,...

Documents

beautiful monitoring with grafana and influxdb

paul dix (founder influxdb) - organising metrics at #doxlon

paul dix, influxdb // open-source time series database

time series database (influxdb) - university of...

graphing performance with collectd influxdb grafana

time series database, influxdb & php

real-time monitoring with grafana, statsd and influxdb

raft and soil-structure interaction pdisp and gsa raft

vs. influxdb for time-series data - outfluxdata.com ·...

raft foundation

mongodb, influxdb...

raft polymerization

influxdb and time series data

openhab grafana and influxdb

measure your app internals with influxdb and symfony2

epics and open source data analytics...

devoxx france 2015 influxdb

time series database open source distributed introducing...

@influxdb - usenix...influxdb u ser v a l u e time? 1 9 / 6...

inside the influxdb storage engine · © 2017 influxdata....