automating postgresql failover for high...

53
Automating PostgreSQL failover for high availability David Pacheco (@dapsays) Joyent

Upload: others

Post on 14-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Automating PostgreSQL failover for high availability

David Pacheco (@dapsays)Joyent

Page 2: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

This talk

• Using PostgreSQL to build Joyent’s Manta service

• Operational experiences

• Incidents

• Tooling

• Wish list

Page 3: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Background

• Most of this work was done in 2012 - 2013, with a major rework in late 2014

• None of us is primarily a database engineer or database administrator (at least not when this work was done)

• We learned a lot since then!

Page 4: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

The Problem

• Building Joyent’s Manta storage service

• HTTP-based, multi-tenant object store

• PUT/GET/DELETE of arbitrary byte streams

• ... with in situ, Unix-like compute as a first-class operation

Page 5: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Manta: storage architecture

• Front door handles API requests toPUT/GET/DELETE objects

• Storage tier stores user data on ZFS.

• Metadata tier maps user-facing paths to backend storage objects

Page 6: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Metadata tier

• Durability is top priority.

• Strongly consistent (CP): because you can build AP atop CP, but not the reverse

• Workload:

• heavy read/write (small records)

• 24/7/365 duty cycle

Page 7: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Modern cloud system

• Fault tolerant

• Horizontally scalable

Page 8: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Modern cloud system

• Fault tolerant (replication)

• Horizontally scalable (sharding)

Page 9: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

PostgreSQL for metadata

• Going-in position: use PostgreSQL, at least until we know why it won’t work

• Known to scale well vertically

• Good operability (tools, observability)

• Built-in replication

Page 10: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Fault tolerance

• Three-peer cluster:

• Primary peer services all requests

• PostgreSQL synchronous replication to a synchronous peer (“secondary”, or “S”)

• Asynchronous replication to a third peer (“async”, or “A”)

• In three-datacenter config, each peer is in a separate datacenter.

Page 11: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

ClientsClients

Async 1Async 1

Manatee cluster

Primary Sync Async 1

syncreplication

asyncreplication

Clients

Page 12: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Sync fails

A(prim)

B(sync)

C(async)initial

Page 13: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Sync fails

A(prim)

B(sync)

C(async)initial

B fails,A promotes C

C(sync)

A(prim)

Page 14: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Sync fails

A(prim)

B(sync)

C(async)initial

B fails,A promotes C

C(sync)

later: B returns

B(async)

C(sync)

A(prim)

A(prim)

Page 15: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Unexpected constraint

• If P is synchronously replicating to S, it’s not guaranteed that switching P and S will work. And it often doesn’t.

Page 16: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Primary fails

A(prim)

B(sync)

C(async)initial

Page 17: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Primary fails

A(prim)

B(sync)

C(async)initial

A fails,B takes over, promotes C

B(prim)

C(sync)

Page 18: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Primary fails

A(prim)

B(sync)

C(async)initial

A fails,B takes over, promotes C

A(deposed)

B(prim)

C(sync)

Page 19: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Primary fails

A(prim)

B(sync)

C(async)initial

A fails,B takes over, promotes C

A(deposed)

B(prim)

C(sync)

later: A rebuilt

B(prim)

C(sync)

A(async)

Page 20: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Fault tolerance

• If P fails, S takes over and promotes A.

• If S fails, P promotes A.

• If A fails, no immediate impact.

• Although writes to S and A will fail, the client needs to know which peer is primary in order to work.

• Moray takes care of this (more later).

Page 21: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

PGPGPG

ZooKeeperZooKeeper

Cluster components

ClientsClients

Async 1Async 1Primary Sync Async 1

Moray ZooKeeper

pg queries

monitorcluster state

monitor, updatecluster state

PG PGsyncrepl

asyncrepl

Page 22: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Consensus

• PostgreSQL peers run by a babysitter process called Manatee

• Manatee instances are connected to ZooKeeper for consensus.

• Cluster state stored in ZooKeeper.

• Clients read ZooKeeper state.

• Still non-trivial.

Page 23: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Demo

Page 24: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Cluster state

• Identities of primary peer, sync peer, async peers (in order), deposed peers

• Generation number

• WAL position at start of generation

Page 25: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances
Page 26: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Cluster state management

• Distributed algorithm guarantees no unsafe transitions

• Tools clearly report current state and history of all transitions

• From-scratch rebuild required when primary becomes deposed.

Page 27: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Moray: custom client

• Connection pooling

• Decouple application (Manta) from database

• Handles client side of cluster state (keeps track of which peer is primary)

Page 28: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Moray

• Moray: key-value store built atop PostgreSQL

• “buckets”: tables

• “objects” rows

• Operations

• put/get/delete bucket config

• put/get/delete/update/find objects

• transactions of the above

Page 29: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Horizontal scalability

• Shard based on object’s directory name

• All objects on a single peer makes listing possible.

• Use consistent hashing implementation

• Hidden behind Electric Moray service.

Page 30: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

All open-source

• Manatee: https://github.com/joyent/manatee

• Much love still needed to make this useful for others (sorry we’re behind on PRs!)

• Manatee state machine: https://github.com/joyent/manatee-state-machine(includes more formal write-up)

• Moray: https://github.com/joyent/moray

• Manta: https://github.com/joyent/manta

Page 31: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Manatee today

• Still running version 9.2. (Are we stuck?)

• Moray/Manatee stack incorporated into SDC (SmartDataCenter) for storing all SDC data: users, compute nodes, containers, networks, ...

Page 32: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Manatee/Moray for jobs

• We also use the Manatee/Moray/PostgreSQL stack to manage execution of compute jobs

• All state in database: easy to debug, easy to reason about fault tolerance

• Arbitrary numbers of inputs, tasks, and outputs(jobs purged after 24h)

• Most queries based on jobId or workerId(all state is denormalized)

Page 33: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances
Page 34: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 1: job performance

• Job queries became very slow

• 7-second job takes 10 minutes

Page 35: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 1: job performance

• Typical reason: insufficient vacuum, analyze(exacerbated by long transactions)

• Problem: 24/7 duty cycle: cron-based “vacuum analyze” is not okay.

Page 36: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 2: queries hung

• Observation: queries hung

Page 37: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 2: queries hung

• Observation: queries hung

• Cause: poor client error handling(uncaughtExcept handler FTL)

Page 38: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 3: queries hung(again)

• Observation: queries hung

• Observation: autovacuum running: “to prevent wraparound”

Page 39: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 3: queries hung

• We spent a while trying to speed up the autovacuum

• prefetching

• disabling sleeping (!)

Page 40: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 3: queries hung

• Five hours later: autovacuum completes

• Service is back online

Page 41: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 3: queries hung

• 5 minutes later: autovacuum starts again (“to prevent wraparound”)

Page 42: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 3: queries hung

• Solution: tune up autovacuum_freeze_max_age(from 200M to 500M)

Page 43: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 3: queries hung

• But why were we blocked?

• In the heat of battle, saved pg_locks information

Page 44: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 3: queries hung

• Wrote“pglockanalyze” to summarize locks

• Root cause: DROP TRIGGER

Page 45: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

PG locking

Queries Locks

data path ACCESS SHARE

autovacuum SHARE UPDATE EXCLUSIVE

DROP TRIGGER ACCESS EXCLUSIVE

Page 46: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident 3: queries hung

• Root cause was our application

• … but we could not hit this without a wraparound auto vacuum

• … and now that we know about it, there’s no way to verify that exclusive locks are not being used (DTrace probes are close!)

Page 47: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Incident: 70K autovacuums

• At one point, dubious design choice resulted in 70,000 tiny tables created around the same time

• Months later, nearly all of them ran into transaction wraparound autovacuum

• Surprisingly: this was mostly fine, except for the tons of I/O being run to update the 130MB stats collector file

Page 48: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Tools we built

• https://github.com/joyent/pgsqlstat(DTrace-based tools for watching overall activity, queries, transactions, lock events, and latency)

• https://github.com/joyent/pglockanalyze(Summarize lock dependencies)

• https://github.com/joyent/pgstatsmon(Periodically report basic stats to statsd)

Page 49: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Silly stuff we did

• Used ZK join order for initial cluster config

• Tried to use PostgreSQL as a queue

• Autovacuum configuration

• Connection management(especially for super-users, replication)

• Created 70,000 tables

• Periodic CREATE/DROP TRIGGER

• Didn’t look carefully at existing ecosystem.

Page 50: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Still not root-caused• manual “vacuum” as root succeeded, but

didn’t update table’s vacuum time

• “freeze” autovacuum succeeded, but didn’t update freeze age

• Several times: PostgreSQL crashes on startup during WAL replay.

• Typically: primary fine, sync crashes

• Rebuild sync from latest primary snapshot. Scary.

Page 51: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Stuff we’ve loved

• Durability, performance

• Observability:

• EXPLAIN

• DTrace probes

• internal views (pg_stat_activity, pg_stat_*_tables, pg_locks)

• We really value root-cause-analysis!

Page 52: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Wish list

• Some of these seem to be implemented...

• Online upgrade (or migration)

• What locks are being taken?(not based on sampling)

• Option for reads to go through sync replication path (to guarantee read-after-write consistency)

Page 53: Automating PostgreSQL failover for high availabilityinfo.citusdata.com/rs/...PostgreSQL_Failover_for...• PostgreSQL peers run by a babysitter process called Manatee • Manatee instances

Automating PostgreSQL failover for high availability

David Pacheco (@dapsays)Joyent