who wants a service with zero downtime? · postgresql’s replication part of core (fully open...

Post on 25-Jun-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

WHO WANTS A SERVICE WITH ZERO DOWNTIME?

… EVERYBODY

IS IT THAT GOOD?

NOT JUST TECHNOLOGY. RISKS, PROCEDURES, PEOPLE

FROM 0 TO ~100: BUSINESS CONTINUITY WITH POSTGRESQL

Giulio Calacoci Senior Developer @ 2ndQuadrant

DataOps 2019 Barcellona

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ABOUT MYSELF

▸ Open Source passionate since early 2k

▸ Member of the Italian and European PostgreSQL community

▸ Lean and DevOps practitioner

▸ Open Source Developer

▸ Member of the Barman team

▸ Continuous Delivery Architect @2ndQuadrant

▸ 24/7 support engineer @2ndQuadrant

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

BUSINESS CONTINUITY

▸ Disaster Recovery

▸ High Availability

▸ Types of disaster/failures

▸ Availability = Uptime / (Uptime + Downtime)

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

OBJECTIVES

▸ Recovery Point Objective (RPO)

▸ How much data can I afford to lose?

▸ Recovery Time Objective (RTO)

▸ How long will it take me to recover?

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

SERVICE RELIABILITY

▸ Cost of downtime

▸ How many €/$/£/AUD/AED/…?

▸ Risk management

▸ SLI, SLO and SLA

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

SOME NOTES FOR THIS PRESENTATION

▸ PostgreSQL on Linux

▸ Servers can be either physical or virtual

▸ Storage must be redundant

▸ RAID is required

▸ VOLUME: redundant disk mounted on a system

LET’S START

0. ONE POSTGRES SERVER

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ARCHITECTURE

Server name: HOPE

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

RECAP

▸ Why is RPO = ∞?

▸ Why is RTO = n/a?

▸ “Hope is not a strategy” (cit. Google)

▸ More common than you’d expect

10. ONE POSTGRES SERVER + LOGICAL BACKUPS

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ARCHITECTURE

Add systematic backups with pg_dump

LOGICAL BACKUP LOGICAL

BACKUPLOGICAL BACKUP …

Day 04AM

Day -1 4AM

Day -2 4AM

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

RECAP

▸ How do you feel now?

▸ Still: RPO = ∞ and RTO = n/a. Why?

▸ A backup is valid only if you have tested it

▸ Unfortunately, this is very common

20. ONE POSTGRES SERVER + LOGICAL BACKUPS + LOGICAL RESTORES

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ARCHITECTURE

Test your backups with pg_restore

LOGICAL BACKUP

Day 04AM

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

DEFINING SOME OBJECTIVES

▸ Measure time for pg_restore

▸ RPO = backup frequency

▸ RTO = maximum time of recovery

▸ Provision another server

▸ Configure another server (automated, right?)

▸ Time to restore the last backup (measure it)

HAVE WE REALLY THOUGHT ABOUT EVERYTHING?

TIME OF REACTION

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

RECAP

▸ Can this architecture work for you?

▸ We need reliable monitoring

▸ From now on, we assume we have it in place!

▸ We need to reduce both RPO and RTO

HOW?POINT-IN-TIME-RECOVERY

Using a time machine

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

POSTGRESQL’S PITR

▸ Part of core (fully open source)

▸ Rebuild a cluster at a point in time

▸ From crash recovery to sync streamrep (physical/logical)

▸ RPO = 0 (zero data loss)

▸ Hot base backup, continuous WAL archiving, Recovery

▸ API

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

BASIC CONCEPTS

▸ Continuous copy of WAL data (continuous archiving)

▸ Physical base backups

▸ Recovery:

▸ copy base backup to another location

▸ recovery mode (replay of WALs until target)

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

BARMAN

▸ Latest version: Barman 2.8

▸ Open Source (GNU GPL 3)

▸ Written in Python

▸ Developed and maintained by 2ndQuadrant

▸ Available at www.pgbarman.org

40. ONE POSTGRES SERVER + ONE BARMAN SERVER

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ARCHITECTURE

Continuous backup

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

BASIC CONCEPTS

▸ Remote backup and recovery

▸ Multiple server management

▸ Backup catalogue and WAL archive

▸ Retention policies

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

COPY METHOD

▸ PostgreSQL streaming

▸ Practical/Windows/Docker

▸ Rsync/SSH

▸ Incremental backup and recovery (via hard links)

▸ Parallel backup and recovery

▸ Network compression and bandwidth limitation

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

WAL SHIPPING METHOD

▸ “archiving”, through “archive_command”:

▸ RPO ~ 16MB of WAL data, or

▸ “archive_timeout”

▸ “streaming”, through streaming replication:

▸ “pg_receivewal” or “pg_receivexlog”

▸ continuous stream, RPO ~ 0

▸ PostgreSQL 9.2+ required

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

EXAMPLE FROM POSTGRESQL.CONF

archive_mode = on

wal_level = logical

max_wal_senders = 10

max_replication_slots = 10

archive_command = 'rsync -a %p

barman@HOST:/var/lib/barman/ID/incoming'

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

EXAMPLE FROM BARMAN.CONF[stark] description = “Tony Stark database" ssh_command = ssh postgres@stark conninfo = user=barman-avengers dbname=postgres host=stark retention_policy = RECOVERY WINDOW OF 6 MONTHS copy_method = rsync reuse_backup = link parallel_jobs = 4 archiver = true streaming_archiver = true slot_name = barman_streaming

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

RECAP

▸ How do you feel now?

▸ Still: RPO = ∞ and RTO = n/a. Why?

▸ A backup is valid only if you have tested it

▸ Barman reduces backup risks, does not exclude them

▸ Systematic tests (especially custom scripts)

▸ Business risk is very high

60. ONE POSTGRES SERVER + ONE BARMAN SERVER + ONE RECOVERY SERVER

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ARCHITECTURE

Test your backups with barman

recover

WHAT A WASTE!

TESTING OR BI?HAVE YOU EVER THOUGHT OF USING IT FOR

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

HOOK SCRIPTS

▸ Barman has hook scripts:

▸ pre and post backup

▸ pre and post archiving

▸ with retry option (until the script returns SUCCESS)

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

EXAMPLE OF RECOVERY SCRIPT

▸ Write a bash script that:

▸ connects to a remote server via SSH

▸ stops the PostgreSQL server

▸ issues a “barman recover” with target “immediate”

▸ starts the PostgreSQL

▸ Set it as post-backup script

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

SOME FOOD FOR THOUGHT

▸ Outcomes:

▸ Systematically test your backup

▸ Measure your recovery time

▸ Identical server? This is a backup server ready to start

▸ You can use a different data centre

▸ Be creative, PostgreSQL gives you infinite freedom!

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

RECAP

▸ RPO ~ 0 (your backups work, every time)

▸ RTO = Time of reaction + Recovery time

▸ Example: RPO ~0 and RTO < 1 day

▸ Acceptable or not acceptable?

▸ Entry level architecture for business continuity

▸ Priority now: improve RTO

HOW?REPLICATION

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

POSTGRESQL’S REPLICATION

▸ Part of core (fully open source)

▸ One master, multiple standby servers

▸ Evolution of PITR

▸ Standby server is in continuous recovery mode

▸ Hot standby (read-only)

▸ Both streaming (9.0+) and file based pulling of WAL

▸ Cascading from a standby

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

SYNCHRONOUS REPLICATION

▸ Fine control (from global down to transaction level)

▸ 2-safe replication

▸ COMMIT of a write transactions waits until written on both the master and a standby (or more from 9.6)

▸ Read consistency of a cluster

▸ RPO = 0 (zero data loss)

80. TWO POSTGRES SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ARCHITECTURE

barman_restore_wal

barman recover

Symmetric Cluster

master standby

STARK ROGERS

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

EXCERPT FROM ROGERS POSTGRESQL’S CONFIGURATIONpostgresql.conf:

hot_standby = on

recovery.conf:

standby_mode = ‘on' # Streaming primary_conninfo = ‘host=stark user=replica application_name=ha sslmode=require’ # Fallback via Barman restore_command = 'barman-wal-restore -U barman avengers stark %f %p’

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

SWITCHOVER (PLANNED)

▸ Applications are paused (start of downtime)

▸ Shut down the master

▸ Allow the standby to catch up with the master

▸ Promote the standby

▸ Switch virtual IPs

▸ Resume applications (end of downtime)

▸ Reconfigure the former master as standby

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

FAILOVER (UNPLANNED)

▸ The master is down (start of downtime)

▸ Promote the standby

▸ Change the virtual IP

▸ DEGRADED SYSTEM

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

MANUAL SWITCHOVER AND FAILOVER

▸ Manual switchover != manual switchover procedure

▸ Manual switchover = manually triggered

▸ Automate the procedure!!!

▸ bash (good)

▸ Ansible (better)

▸ Enhance gradually

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

RECAP

▸ RPO ~ 0 (your backups work, every time)

▸ RTO = Time of reaction + Time of promotion

▸ Criticality: manual intervention

▸ Reliable monitoring

▸ Trained people (practice & docs!)

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

MANUAL FAILOVER VS AUTOMATED FAILOVER

▸ Risk management

▸ Split brain nightmare

▸ Automated is built on manual (test!)

▸ Your choice

▸ Very good solution for business continuity

▸ Uptime > 99.99% in a year

90. TWO POSTGRES SYNC SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ARCHITECTURE

barman_restore_wal

barman recover

Synchronous

ZERO DATA LOSS

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

SYNCHRONOUS REPLICATION

▸ Primary: Barman

▸ Zero data loss backup

▸ Primary: Standby

▸ Zero data loss cluster (reduce RTO)

▸ Just one configuration line in PostgreSQL

▸ synchronous_standby_names = '1 (ha, barman_receive_wal)'

~100. TWO POSTGRES SYNC SERVERS + ONE BARMAN SERVER + ONE RECOVERY SERVER + REPMGR (AUTO-FAILOVER)

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ARCHITECTURE

Potential synchronous

Synchronous

repmgr repmgr

repmgr witness

WHAT’S MORE?

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

PUSH THE BOUNDARIES

▸ Repeatable architectures in multiple data centres

▸ PgBouncer

▸ Virtual IPs

▸ S3 relay via Barman hook scripts

▸ Multiple standby servers and cascading replication

▸ Docker containers

▸ Logical replication backups

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

CONCLUSIONS

▸ Babysteps and KISS

▸ New? Explore and learn

▸ Practice is the only way to mastery (drills)

▸ Plan regular healthy downtimes

▸ Use switchovers to perform PostgreSQL updates

▸ Smart downtimes increase long-term uptime

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

ANY QUESTIONS?

▸ PostgreSQL: www.postgresql.org

▸ Barman: www.pgbarman.org #pgbarman

▸ PgBouncer: pgbouncer.github.io

▸ Repmgr: www.repmgr.org

▸ Our blog: blog.2ndquadrant.com

2ndquadrant.com

@asdmaster @2ndQuad #PostgreSQL #DataOps #Barcellona #BusinessContinuity

LICENCE

Attribution 4.0 International (CC BY 4.0)

You are free to:

▸ Share — copy and redistribute the material in any medium or format

▸ Adapt — remix, transform, and build upon the material for any purpose, even commercially.

The licensor cannot revoke these freedoms as long as you follow the license terms.

top related