high availability postgresql with zalando patroni
TRANSCRIPT
HA PostgreSQL with PatroniOleksii Kliukin, Zalando SE
@alexeyklyukinFOSDEM PGDay 2016
January 29th, 2016, Brussels
What happens if the master is down?
● Built-in streaming replication is great!
● Only one writable node (primary, master)
● Multiple read-only standbys (replicas)
● Manual failover
pg_ctl promote -D /home/postgres/data
Re-joining the former masterBefore 9.3:
rm -rf /home/postgres/data && pg_basebackup …
Before 9.5
git clone -b PGREWIND1_0_0_PG9_4 --depth 1 https://github.
com/vmware/pg_rewind.git \ && cd pg_rewind && apt-get source
postgresql-9.4 -y && USE_PGXS=1 make top_srcdir=$(find . -name
"postgresql*" -type d) install;
pg_rewind in 9.5 and above
● pg_rewind available in contrib (apt-get install postgresql-contrib-9.5)
● wal_log_hints = ‘on’ or enable data checksums
● rewind your former master to be able to follow the current one:
pg_rewind -D /home/postgres/data --source-server=’
host=localhost port=5433 sslmode=prefer’
● requires superuser access
No fixed address
● Pgbouncer
● Pgpool
● HAProxy
● Floating IP/DNS
MASTER REPLICA
FORMERMASTER
WAL storage
connection router
CLIENTS
Streaming replication
pg_rewind
archiv
e com
mand restore command
How much downtime can you tolerate?
Automatic failover
master
replica
master
replica
promote
replica
master
Network issues
master
replica
master
replica
promote
master
master
?
What about an arbiter?
replica
master
arbiterping
ping
master
master
arbiter
vote
master
replica
Do we need a distributed consensus?
Master election
The consensus problem requires agreement among a number of processes (or agents) for a single data value.
● leader (master) value defines the current master
● no leader - which node takes the master key
● leader is present - should be the same for all nodes
● leader has disappeared - should be the same for all nodes
● etcd from CoreOS
● distributed key-value storage
● directory-tree like
● implements RAFT
● talks REST
● key expiration with TTL and test and set operations
3-rd party to enforce a consensus
RAFT
● Distributed consensus algorithm (like Paxos)
● Achieves consensus by directing all changes to the leader
● Only commit the change if it’s acknowledged by the majority of nodes
● 2 stages○ leader election
○ log replication
● Implemented in etcd, consul.
http://thesecretlivesofdata.com/raft/
Patroni
● Manages a single PostgreSQL node
● Commonly runs on the same host as PostgreSQL
● Talks to etcd
● Promotes/demotes the managed node depending on the leader key
PostgreSQL master election
set leader lock
set leader lo
ck set leader lock
● every node tries to set the leader lock (key)
● the leader lock can only be set when it’s not present
● once the leader lock is set - no one else can obtain it
PostgreSQL master election
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql0"
ttl=30
HTTP/1.1 201 Created
...
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 2045
X-Raft-Index: 13006
X-Raft-Term: 2
{
"action": "create",
"node": {
"createdIndex": 2045,
"expiration": "2016-01-28T13:38:19.717822356Z",
"key": "/service/fosdem/leader",
"modifiedIndex": 2045,
"ttl": 30,
"value": "postgresql0"
}
}
ELECTED
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevExist=false value="postgresql1"
ttl=30
HTTP/1.1 412 Precondition Failed
...
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 2047
{
"cause": "/service/fosdem/leader",
"errorCode": 105,
"index": 2047,
"message": "Key already exists"
}
Only one leader at a time
PostgreSQL master election
I’m the member
I’m the leader w
ith the lockI’m the member
Streaming replication
How do you know the leader is alive?
● leader updates its key periodically (by default every 10 seconds)
● only the leader is allowed to update the key (via compare and swap)
● if the key is not updated in 30 seconds - it expires (via TTL)
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="bar" value="bar"
HTTP/1.1 412 Precondition Failed
Content-Length: 89
Content-Type: application/json
Date: Thu, 28 Jan 2016 13:45:27 GMT
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 2090
{
"cause": "[bar != postgresql0]",
"errorCode": 101,
"index": 2090,
"message": "Compare failed"
}
Only the leader can update the lock
http -f PUT http://127.0.0.1:2379/v2/keys/service/fosdem/leader?prevValue="postgresql0" value="postgresql0" ttl=30
{
"action": "compareAndSwap",
"node": {
"createdIndex": 2052,
"expiration": "2016-01-28T13:47:05.38531821Z",
"key": "/service/fosdem/leader",
"modifiedIndex": 2119,
"ttl": 30,
"value": "postgresql0"
},
"prevNode": {
"createdIndex": 2052,
"expiration": "2016-01-28T13:47:05.226784451Z",
"key": "/service/fosdem/leader",
"modifiedIndex": 2116,
"ttl": 22,
"value": "postgresql0"
}
}
How do you know where to connect?$ etcdctl ls --recursive /service/fosdem
/service/fosdem/members
/service/fosdem/members/postgresql0
/service/fosdem/members/postgresql1
/service/fosdem/initialize
/service/fosdem/leader
/service/fosdem/optime
/service/fosdem/optime/leader
$ http http://127.0.0.1:2379/v2/keys/service/fosdem/members/postgresql0
HTTP/1.1 200 OK
...
X-Etcd-Cluster-Id: 7e27652122e8b2ae
X-Etcd-Index: 3114
X-Raft-Index: 20102
X-Raft-Term: 2
{
"action": "get",
"node": {
"createdIndex": 3111,
"expiration": "2016-01-28T14:28:25.221011955Z",
"key": "/service/fosdem/members/postgresql0",
"modifiedIndex": 3111,
"ttl": 22,
"value": "{\"conn_url\":\"postgres://replicator:[email protected]:5432/postgres\",\"
api_url\":\"http://127.0.0.1:8008/patroni\",\"tags\":{\"nofailover\":false,\"noloadbalance\":false,
\"clonefrom\":false},\"state\":\"running\",\"role\":\"master\",\"xlog_location\":234881568}"
}
}
Avoiding the split brain
Worst case scenario
Streaming replication in 140 characters
Patroni configuration parameters● YAML file with sections● general parameters
○ ttl: time to leave for the leader and member keys○ loop_wait: minimum time one iteration of the eventloop takes○ scope: name of the cluster to run○ auth: ‘username:password’ string for the REST API
● postgresql section○ name - name of the postgresql member (should be unique)○ listen - address:port to listen to (or multiple, i.e. 127.0.0.1,127.0.0.2:5432)○ connect_address: address:port to advertise to other members (only one, i.e. 127.0.0.5:5432)○ data_dir: PGDATA (can be initially not empty)○ maximum_lag_on_failover: do not failover if slave is more than this number of bytes behind○ use_slots: whether to use replication slots (9.4 and above)
postgresql subsections● initdb: section to specify initdb options (i.e. encoding, default auth mode)● pg_rewind: section with username/password for the user used by pg_rewind● pg_hba: entries to be added to pg_hba.conf● replication: replication user, password, and network (for pg_hba.conf)● superuser: username/password for the superuser account (to be created)● admin: username/password for the user with createdb/createrole permissions● create_replica_methods: list of methods to image replicas from the master:● recovery.conf: parameters put into the recovery.conf (primary_conninfo is
written automatically)● parameters: postgresql.conf parameters (i.e. wal_log_hints or shared_buffers)
tags (patroni configuration)tags modify behavior of the node they are applied to
● nofailover: the node should not participate in elections or ever become the master
● noloadbalance: the node should be excluded from the load balancer (TODO)● clonefrom: this node should be bootstrapped from (TODO)● replicatefrom: this node should do streaming replication from (pull request)
REST API● command and control interface● GET /master and /replica endpoints for the load balancer● GET /patroni in order to get system information● POST /restart in order to restart the node● POST /reinitialize in order to remove the data directory and reinitialize from
the master● POST /failover with leader and optional member names in order to do a
controlled failover● patronictl to do it in a more user-friendly way
REST API (master)$ http http://127.0.0.1:8008/masterHTTP/1.0 200 OK...Server: BaseHTTP/0.3 Python/2.7.10
{ "postmaster_start_time": "2016-01-27 23:23:21.873 CET", "role": "master", "state": "running", "tags": { "clonefrom": false, "nofailover": false, "noloadbalance": false }, "xlog": { "location": 301990984 }}
REST API (replica)http http://127.0.0.1:8009/masterHTTP/1.0 503 Service Unavailable...Server: BaseHTTP/0.3 Python/2.7.10
{ "postmaster_start_time": "2016-01-27 23:23:24.367 CET", "role": "replica", "state": "running", "tags": { "clonefrom": false, "nofailover": false, "noloadbalance": false }, "xlog": { "paused": false, "received_location": 301990984, "replayed_location": 301990984 }
Configuring HA Proxy for Patroniglobalmaxconn 100
defaultslog globalmode tcpretries 2timeout client 30mtimeout connect 4stimeout server 30mtimeout check 5s
frontend ft_postgresqlbind *:5000default_backend bk_db
backend bk_dboption httpchk
server postgresql_127.0.0.1_5432 127.0.0.1:5432 maxconn 100 check port 8008 server postgresql_127.0.0.1_5433 127.0.0.1:5433 maxconn 100 check port 8009
Implementation details
Separate nodes for etcd and patroni
Multi-threading to avoid blocking the event loop
Use synchronous_standby_names=’*’ for synchronous replication
Use etcd/Zookeeper watches to speed up the failover
CallbacksCall monitoring code or do some application-specific actions (i.e. change pgbouncer configuration)
User-defined scripts set in the configuration file.
● on start
● on stop
● on restart
● on change role
pg_rewind support● remove recovery.conf if present
● run a checkpoint on a promoted master (due to the fast promote)
● remove archive status to avoid losing archived segments to be removed
● start in a single-user mode with archive_command set to false
● stop to produce a clean shutdown
● only if checksums or enabled or wal_log_hints are set (via pg_controldata)
● Many installations already have Zookeeper running
● No TTL
● Session-specific (ephemeral) keys
● No dynamic nodes (use Exhibitor)
Zookeeper support
Spilo: Patroni on AWS
Up next
● scheduled failovers
● full support for cascading replication
● consul joins etcd and zookeeper
● manage BDR nodes
Thank you!Feedback: @alexeyklyukin
Links
github.com/zalando/patroni
spilo.readthedocs.org
coreos.com/etcd/docs/latest/getting-started-with-etcd.html
raft.github.io