influxdb and the raft consensus protocol - amazon s3 · previous approaches v0.9-rc20 all data,...

Post on 03-Jun-2020

16 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

InfluxDB and the Raft consensus protocolPhilip O'Toole, Director of Engineering (SF)InfluxDB San Francisco Meetup, December 2015

Agenda

● What is InfluxDB and why is it distributed?

● What is the problem with distribution?

● How can it be solved?

● InfluxDB and distributed consensus

What is InfluxDB?

● Open-source time-series database

● Distributed by design

● Written in Go

Why distribute InfluxDB?

● A distributed database provides reliability– Your data is located in multiple places

● A distributed database offers scalability– For both write and query load

What is the problem?

It's easy to set the value of a single node reliably

X=5X=5ClientClientACK

What if the value should be replicated?

X=5X=5ClientClient X=5X=5

Multi-node system - receiving node replicates data

Node A Node B

ACK

X=5X=5 ??

Node A Node B

X

What if nodes can't communicate?

This failure is known as a partition

ClientClientACK

X=5X=5

X=5X=5

ClientClient

Client replication: multi-node system - each node must support changing state

Node A

Node B

What if the value should be replicated?

What if nodes can't communicate?

X=6X=6

X=5X=5

Client 1Client 1

Client 2Client 2

Multi-node system - each node receives mutually-exclusive operation

Node A

Node B

X

ACK

ACK

Communication restored

X=6X=6

X=5X=5

ClientClient

Node A

Node B

What value should be read for the node?

Real world scenario

Client 2Client 2

Client 1Client 1

Modify database

Backup anddelete database

Node A

Node B

ACK

X

ACK

This problem is known as Distributed Consensus

What is the solution?

● Paxos– Famously difficult-to-understand

● ZAB (ZooKeeper Atomic Broadcast)– Created by Yahoo! Research

● Raft– Diego Ongaro and John Ousterhout at Stanford

Secret Lives of DataBy Ben Johnson @benbjohnson

What is the cost?

● Latency in decision making

● Low throughput

● System rigidity

● Consensus under load is tricky

InfluxDB and RaftWhat follows is subject to change

Accurate as of 0.9.6

What is stored via consensus?

● Cluster membership● Databases● Retention Policies

– Enforcement run by leader

● Users● Continuous Queries

– Queries initiated by leader

● Shard Metadata– Locations on cluster

● Assignment performed by leader

– Start and end times

Leader

Follower

Follower

What is not stored via consensus?

● Time-series data– And time-series indexes

● Time-series data schema

System Implications

Data Ingestion

● Write traffic can stall if a shard does not exist– Because the cluster must reach consensus

– This delay is usually very short

– Shard pre-creation mitigates this issue

Schema varies by shard and time

Shard A Shard Bcpu,host=a load=100 cpu,host=b load=99.4

floatinteger

What is the type of the field load on the measurement cpu? It depends! Each shard is its own local authority

No global schema exists

           > CREATE DATABASE foo

           > DROP DATABASE fooer

           ERR: database not found: fooer

           > USE foo

           Using database foo

           > INSERT cpu value=100

           > SELECT value FROM cpu

           name: cpu

           ­­­­­­­­­

           time                            value

           2015­12­12T05:39:11.51832836Z   100

           > SELECT valuexxx FROM cpu

           > SELECT value FROM cpuxxx

           >

Unexpected error-free cases

Consensus regarding existingdatabases

No consensus i.e. central authority for schema

Fault Tolerance

Fault Tolerance

● 3-node Raft clusters can tolerate failure of 1 node

● 5-node Raft clusters can tolerate failure of 2 nodes– 0.9.6 clusters restricted to 3 Raft nodes

● Write and query traffic may tolerate further failure– Until the data held in consensus must be changed

Previous Approachesv0.9-rc20

Previous Approachesv0.9-rc20

● All data, including write load, went through Raft.– A streaming model

● AppendEntries decomposed into two calls

– Potentially high performance

– Global schema which meant improved user experience

– Easier to reason about when all data in consensus

● In practice was inefficient and complex to implement correctly.– Approach was abandoned by 0.9.0

Future Designs

0.9.6 Design

LeaderFollower

Follower

Data Data

Data

Data

Data

Data Data

Data

● Each node may be data Raft, or both

0.10.0 Design

Leader

Follower

Follower

Data Data

Data

Data

Data

Data Data

Data

● Distinct Raft sub-system● Data-only nodes

A lightly-loaded consensus system may result in morerobust clusters.

References

● https://groups.google.com/forum/#!msg/raft-dev/6GK22lL_Y1c/vSR3_XrRVpMJ

● http://thesecretlivesofdata.com/raft/

● https://raft.github.io/

● https://github.com/hashicorp/raft

● https://github.com/otoolep/hraftd

● https://github.com/influxdb/influxdb/tree/0.9.6/meta

top related