high availability postgresql with zalando patroni

HA PostgreSQL with PatroniOleksii Kliukin, Zalando SE

@alexeyklyukinFOSDEM PGDay 2016

January 29th, 2016, Brussels

What happens if the master is down?

● Built-in streaming replication is great!

● Only one writable node (primary, master)

● Multiple read-only standbys (replicas)

● Manual failover

pg_ctl promote -D /home/postgres/data

Re-joining the former masterBefore 9.3:

rm -rf /home/postgres/data && pg_basebackup …

Before 9.5

git clone -b PGREWIND1_0_0_PG9_4 --depth 1 https://github.

com/vmware/pg_rewind.git \ && cd pg_rewind && apt-get source

postgresql-9.4 -y && USE_PGXS=1 make top_srcdir=$(find . -name

"postgresql*" -type d) install;

pg_rewind in 9.5 and above

● pg_rewind available in contrib (apt-get install postgresql-contrib-9.5)

● wal_log_hints = ‘on’ or enable data checksums

● rewind your former master to be able to follow the current one:

pg_rewind -D /home/postgres/data --source-server=’

host=localhost port=5433 sslmode=prefer’

● requires superuser access

No fixed address

● Pgbouncer

● Pgpool

● HAProxy

● Floating IP/DNS

MASTER REPLICA

FORMERMASTER

WAL storage

connection router

CLIENTS

Streaming replication

pg_rewind

archiv

e com

mand restore command

How much downtime can you tolerate?

Automatic failover

master

replica

master

replica

promote

replica

master

Network issues

master

replica

master

replica

promote

master

master

?

What about an arbiter?

replica

master

arbiterping

ping

master

master

arbiter

vote

master

replica

Do we need a distributed consensus?

Master election

The consensus problem requires agreement among a number of processes (or agents) for a single data value.

● leader (master) value defines the current master

● no leader - which node takes the master key

● leader is present - should be the same for all nodes

● leader has disappeared - should be the same for all nodes

● etcd from CoreOS

● distributed key-value storage

● directory-tree like

● implements RAFT

● talks REST

● key expiration with TTL and test and set operations

3-rd party to enforce a consensus

RAFT

● Distributed consensus algorithm (like Paxos)

● Achieves consensus by directing all changes to the leader

● Only commit the change if it’s acknowledged by the majority of nodes

● 2 stages○ leader election

○ log replication

● Implemented in etcd, consul.

http://thesecretlivesofdata.com/raft/



Patroni

● Manages a single PostgreSQL node

● Commonly runs on the same host as PostgreSQL

● Talks to etcd

● Promotes/demotes the managed node depending on the leader key

PostgreSQL master election

set leader lock

set leader lo

ck set leader lock

● every node tries to set the leader lock (key)

● the leader lock can only be set when it’s not present

● once the leader lock is set - no one else can obtain it


http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql0"

ttl=30

HTTP/1.1 201 Created

...

X-Etcd-Cluster-Id: 7e27652122e8b2ae

X-Etcd-Index: 2045

X-Raft-Index: 13006

X-Raft-Term: 2

{

"action": "create",

"node": {

"createdIndex": 2045,

"expiration": "2016-01-28T13:38:19.717822356Z",

"key": "/service/fosdem/leader",

"modifiedIndex": 2045,

"ttl": 30,

"value": "postgresql0"

}

}

ELECTED

http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql1"

ttl=30

HTTP/1.1 412 Precondition Failed

...


X-Etcd-Index: 2047

{

"cause": "/service/fosdem/leader",

"errorCode": 105,

"index": 2047,

"message": "Key already exists"

}

Only one leader at a time


I’m the member

I’m the leader w

ith the lockI’m the member

Streaming replication

How do you know the leader is alive?

● leader updates its key periodically (by default every 10 seconds)

● only the leader is allowed to update the key (via compare and swap)

● if the key is not updated in 30 seconds - it expires (via TTL)

http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="bar" value="bar"

HTTP/1.1 412 Precondition Failed

Content-Length: 89

Content-Type: application/json

Date: Thu, 28 Jan 2016 13:45:27 GMT


X-Etcd-Index: 2090

{

"cause": "[bar != postgresql0]",

"errorCode": 101,

"index": 2090,

"message": "Compare failed"

}

Only the leader can update the lock

http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="postgresql0" value="postgresql0" ttl=30

{

"action": "compareAndSwap",

"node": {


"expiration": "2016-01-28T13:47:05.38531821Z",



"ttl": 30,


},

"prevNode": {


"expiration": "2016-01-28T13:47:05.226784451Z",



"ttl": 22,


}

}

How do you know where to connect?$ etcdctl ls --recursive /service/fosdem

/service/fosdem/members

/service/fosdem/members/postgresql0

/service/fosdem/members/postgresql1

/service/fosdem/initialize

/service/fosdem/leader

/service/fosdem/optime

/service/fosdem/optime/leader

$ http http://127.0.0.1:2379/v2/keys/service/fosdem/members/postgresql0

HTTP/1.1 200 OK

...


X-Etcd-Index: 3114

X-Raft-Index: 20102

X-Raft-Term: 2

{

"action": "get",

"node": {


"expiration": "2016-01-28T14:28:25.221011955Z",

"key": "/service/fosdem/members/postgresql0",


"ttl": 22,

"value": "{\"conn_url\":\"postgres://replicator:[email protected]:5432/postgres\",\"

api_url\":\"http://127.0.0.1:8008/patroni\",\"tags\":{\"nofailover\":false,\"noloadbalance\":false,

\"clonefrom\":false},\"state\":\"running\",\"role\":\"master\",\"xlog_location\":234881568}"

}

}

Avoiding the split brain

Worst case scenario

Streaming replication in 140 characters

Patroni configuration parameters● YAML file with sections● general parameters

○ ttl: time to leave for the leader and member keys○ loop_wait: minimum time one iteration of the eventloop takes○ scope: name of the cluster to run○ auth: ‘username:password’ string for the REST API

● postgresql section○ name - name of the postgresql member (should be unique)○ listen - address:port to listen to (or multiple, i.e. 127.0.0.1,127.0.0.2:5432)○ connect_address: address:port to advertise to other members (only one, i.e. 127.0.0.5:5432)○ data_dir: PGDATA (can be initially not empty)○ maximum_lag_on_failover: do not failover if slave is more than this number of bytes behind○ use_slots: whether to use replication slots (9.4 and above)

postgresql subsections● initdb: section to specify initdb options (i.e. encoding, default auth mode)● pg_rewind: section with username/password for the user used by pg_rewind● pg_hba: entries to be added to pg_hba.conf● replication: replication user, password, and network (for pg_hba.conf)● superuser: username/password for the superuser account (to be created)● admin: username/password for the user with createdb/createrole permissions● create_replica_methods: list of methods to image replicas from the master:● recovery.conf: parameters put into the recovery.conf (primary_conninfo is

written automatically)● parameters: postgresql.conf parameters (i.e. wal_log_hints or shared_buffers)

tags (patroni configuration)tags modify behavior of the node they are applied to

● nofailover: the node should not participate in elections or ever become the master

● noloadbalance: the node should be excluded from the load balancer (TODO)● clonefrom: this node should be bootstrapped from (TODO)● replicatefrom: this node should do streaming replication from (pull request)

REST API● command and control interface● GET /master and /replica endpoints for the load balancer● GET /patroni in order to get system information● POST /restart in order to restart the node● POST /reinitialize in order to remove the data directory and reinitialize from

the master● POST /failover with leader and optional member names in order to do a

controlled failover● patronictl to do it in a more user-friendly way

REST API (master)$ http http://127.0.0.1:8008/masterHTTP/1.0 200 OK...Server: BaseHTTP/0.3 Python/2.7.10

{ "postmaster_start_time": "2016-01-27 23:23:21.873 CET", "role": "master", "state": "running", "tags": { "clonefrom": false, "nofailover": false, "noloadbalance": false }, "xlog": { "location": 301990984 }}

REST API (replica)http http://127.0.0.1:8009/masterHTTP/1.0 503 Service Unavailable...Server: BaseHTTP/0.3 Python/2.7.10

{ "postmaster_start_time": "2016-01-27 23:23:24.367 CET", "role": "replica", "state": "running", "tags": { "clonefrom": false, "nofailover": false, "noloadbalance": false }, "xlog": { "paused": false, "received_location": 301990984, "replayed_location": 301990984 }

Configuring HA Proxy for Patroniglobalmaxconn 100

defaultslog globalmode tcpretries 2timeout client 30mtimeout connect 4stimeout server 30mtimeout check 5s

frontend ft_postgresqlbind *:5000default_backend bk_db

backend bk_dboption httpchk

server postgresql_127.0.0.1_5432 127.0.0.1:5432 maxconn 100 check port 8008 server postgresql_127.0.0.1_5433 127.0.0.1:5433 maxconn 100 check port 8009

Implementation details

Separate nodes for etcd and patroni

Multi-threading to avoid blocking the event loop

Use synchronous_standby_names=’*’ for synchronous replication

Use etcd/Zookeeper watches to speed up the failover

CallbacksCall monitoring code or do some application-specific actions (i.e. change pgbouncer configuration)

User-defined scripts set in the configuration file.

● on start

● on stop

● on restart

● on change role

pg_rewind support● remove recovery.conf if present

● run a checkpoint on a promoted master (due to the fast promote)

● remove archive status to avoid losing archived segments to be removed

● start in a single-user mode with archive_command set to false

● stop to produce a clean shutdown

● only if checksums or enabled or wal_log_hints are set (via pg_controldata)

● Many installations already have Zookeeper running

● No TTL

● Session-specific (ephemeral) keys

● No dynamic nodes (use Exhibitor)

Zookeeper support

Spilo: Patroni on AWS

Up next

● scheduled failovers

● full support for cascading replication

● consul joins etcd and zookeeper

● manage BDR nodes

Thank you!Feedback: @alexeyklyukin

[email protected]

Links

github.com/zalando/patroni

spilo.readthedocs.org

coreos.com/etcd/docs/latest/getting-started-with-etcd.html

raft.github.io

https://github.com/zalando/patroni

https://github.com/zalando/patroni

http://spilo.readthedocs.org/

http://spilo.readthedocs.org/

https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html

https://coreos.com/etcd/docs/latest/getting-started-with-etcd.html

https://raft.github.io/

https://raft.github.io/

high availability postgresql with zalando patroni

Technology