how not to screw up when building ha - postgresql europe€¦ · the patroni guy...

44
How not to screw up when building HA cluster FOSDEM PGDay 2019, Brussels Alexander Kukushkin 01-02-2018

Upload: others

Post on 01-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

Please write title, subtitle and speaker name in all capital letters

How not to screw up when building HA

cluster

FOSDEM PGDay 2019, Brussels

Alexander Kukushkin

01-02-2018

Please write title, subtitle and speaker name in all capital letters

Page 2: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

2

Please write the title in all capital letters

Put images in the grey dotted box "unsupported placeholder"

Use bullet points to summarize information rather than writing long paragraphs in the text box

ABOUT ME

Alexander KukushkinDatabase Engineer @ZalandoTech

The Patroni guy

[email protected]

Twitter: @cyberdemn

Page 3: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

3

Please write the title in all capital letters

Put images in the grey dotted box "unsupported placeholder"

WE BRING FASHION TO PEOPLE IN 17 COUNTRIES

17 markets

7 fulfillment centers

23 million active customers

4.5 billion € net sales 2017

200 million visits per month

15,000 employees in Europe

Page 4: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

4

Please write the title in all capital letters

> 650 clustersin the Cloud (AWS)

FACTS & FIGURES

> 300 databases

on premise

Page 5: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

5

Put images in the grey dotted box "unsupported placeholder"

Please write the title in all capital letters

What is High Availability?

What HA will not solve?

Disaster recovery

Automatic failover done right

Examples of real incidents

AGENDA

Wrap it up

Page 6: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

What is High Availability?What is High Availability?

Page 7: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

7

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

Availability

Page 8: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

8

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● Scheduled downtime (often excluded from availability)○ Hardware/BIOS/Firmware upgrade○ Software update

● Unscheduled downtime○ Datacenter failure (natural disasters, fire, power outage)○ Network splits○ Hardware failure (CPU, network card, disk controller, disk)○ Software/Data corruption (Bugs in application/OS code)○ User error (rm -fr $PGDATA, DROP/TRUNCATE table,

UPDATE/DELETE without WHERE clause)

Causes of Downtime

Page 9: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

9

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

AvailabilityDowntime

Year Month Week Day

99% (“Two nines”) 3.65 d 7.31 h 1.68 h 14.4 m

99.9% (“Three nines”) 8.77 h 43.83 m 10.08 m 1.44 m

99.95% (“Three and a half nines”) 4.38 h 21.92 m 5.04 m 43.2 s

99.99% (“Four nines”) 52.6 m 4.38 m 1.01 m 8.64 s

99.999% (“Five nines”) 5.26 m 26.3 s 6.05 s 864 ms

99.9999% (“Six nines”) 31.56 s 2.63 s 604.8 ms 86.4 ms

99.99999% (“Seven nines”) 3.16 s 262.98 ms 60.48 ms 864 μs

Page 10: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

10

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● No Official Definition appears to exist!

● Wikipedia:○ High availability (HA) is a characteristic of a

system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.

What is HA anyway?

Page 11: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

11

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● A Service-Level Agreement (SLA) is an agreement between a service provider and a client.○ Type of service to be provided○ Desired performance level (especially availability, reliability and responsiveness)○ Monitoring process and service level reporting○ Steps for reporting issues○ Response and issue resolution time-frame

● A Service-Level Indicator (SLI) is a measure of the service level provided by a service provider to a customer

○ Availability○ Latency○ Throughput

● A Service-Level Objective (SLO) is a key element of SLA; a goal that service provider wants to reach

SLA, SLI, and SLO

Page 12: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

12

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● Hardware failure● Network splits● Datacenter failure● Software failure/Data corruption● User error

Causes of Unscheduled Downtime

Automatic failover

Disaster recovery

Page 13: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

13

Put images in the grey dotted box "unsupported placeholder"

Please write the title in all capital letters

What is High Availability?

What HA will not solve?

Disaster recoveryAutomatic failover done right

Examples of real incidents

Wrap it up

Page 14: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

14

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● Involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster

● Recovery point objective (RPO) and recovery time objective (RTO) are two important measurements in disaster recovery and downtime

○ A recovery point objective (RPO) is defined by business continuity planning. It is the maximum targeted period in which data (transactions) might be lost from an IT service due to a major incident

○ The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity

Disaster recovery

Page 15: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

15

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

Disaster recovery

https://en.wikipedia.org/wiki/File:RPO_RTO_example_converted.png

$

Data lossprice

Servicedowntime price

Data recoveryprice

High Availabilityprice

Page 16: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

16

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● Automatic failover won’t help to backup and restore data○ Enable backups and log archiving

■ archive_timeout - how often postgres should archive WALs■ pg_receivewal

○ Recovery from the backup might take hours■ Consider having a delayed replica (recovery_min_apply_delay)

● if RTO is higher than 15 minutes, you don’t need automatic failover!○ Unless you are running hundreds of clusters

● synchronous replication - to prevent data loss during failover

RPO, RTO & PostgreSQL

Page 17: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

17

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● In general it is possible, but VERY expensive

● This is a price for complexity of such system

○ Complexity is often decreasing availability

○ The more elements a system has, the more reliable each element

has to be

● Trade-off between the speed of failure detection and false positives

Sub-second Automatic Failover

Page 18: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

High Availability and

Disaster Recovery Need

Each Other!

Page 19: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

19

Put images in the grey dotted box "unsupported placeholder"

Please write the title in all capital letters

What is High Availability?

What HA will not solve?

Disaster recovery

Automatic failover done rightExamples of real incidents

Wrap it up

Page 20: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

20

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● PostgreSQL XC/XL

○ Data nodes + Coordinators + 2PC + GTM(SPOF)

● BDR

○ logical replication + conflict resolution

■ eventual consistency

● Postgres Pro Enterprise (proprietary)

○ logical replication + E3PC

Multimaster?

Page 21: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

21

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● Quorum

○ Helps to deal with network splits

○ Requires at least 3 nodes

● Fencing

○ Make sure the old primary is unaccessible. STONITH!

● Watchdog

○ Primary should not run if supervising HA process failed

A good HA system

Page 22: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

22

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

No Quorum and no Fencing

Primary Standbywal stream

health check

Page 23: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

23

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

https://github.com/MasahikoSawada/pg_keeper

No Quorum and no Fencing

Primary Primary

Page 24: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

24

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

Witness node is making decisions

Primary

witness

health checkhealth check

wal streamStandby

Page 25: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

25

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

Witness node dies

Primary

witness

wal streamStandby

Page 26: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

26

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

Witness and no Fencing

Primary

witness

health checkhealth check

wal streamStandby

Page 27: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

27

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

Witness and no Fencing

Primary

witness

health check

Primary

Page 28: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

28

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

Automatic failover done rightStandby

I am the leaderLeader changed?

Quorum

Primary

Page 29: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

29

Put images in the grey dotted box "unsupported placeholder"

Please write the title in all capital letters

What is High Availability?

What HA will not solve?

Disaster recovery

Automatic failover done right

Examples of real incidents

Wrap it up

Page 30: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

30

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

GoCardless Incident

Learn your HA system

Pacemaker Pacemaker Pacemaker

VIPAsyncPrimarySync

Page 31: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

31

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● Failed raid controller on the primary

● Primary was manually terminating (hope on auto failover)

● Auto Failover didn’t work due to coincident crash of postgres on sync

replica

● Spend 1h30m trying to trigger a failover using Pacemaker!

● Manually promoted sync replica

● Total outage 1h50m

GoCardless Incident

Page 32: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

32

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● HA systems usually are quite complex

● Running them is similar to flying modern airplane

○ Mostly autopilot

○ But sometimes it fails

○ You need to know how to “fly” manually

● Learn your HA system

○ Try to break it and fix afterwards

GoCardless Lessons

Page 33: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

33

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text boxGitHub Incident

Resource Planning

Latency 60ms

replica

replica

pages

notifications

US West

replica

primary

replica

pages

jobs

notifications

github.com

US East

Page 34: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

34

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● Network split due to network maintenance

● Automatic failover from East to West Coast datacenter

● Applications from East are slow due to latency between East and West

● Switchback to East wasn’t possible due to a few seconds of writes

which were not replicated

● Rebuild of all replicas in the East took nearly 16 hours

● Total time of incident 24h11m

GitHub Incident

Page 35: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

35

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● Avoid doing cross-region failover if you don’t have 100% resources

symmetry

● MySQL can’t do pg_rewind :)

GitHub Lessons

Page 36: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

36

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

GitLab Incident

● Primary-Replica setup (no automatic failover)● Increased database load on the primary resulted in replica falling behind

○ WAL segment needed for replica was recycled by primary● A few attempts to rebuild replica with pg_basebackup (--checkpoint=spread)● rm -fr $PGDATA on the primary! (human error)● Three different backups were done only once a day (no WAL archiving)

○ pg_dump was always failing due to major version mismatch!○ Azure disk snapshots were disabled for database servers!○ LVM snapshots were working and periodically tested by restoring them to staging

■ Incident happened nearly 24 hours after the last snapshot was taken!■ “Luckily”, someone manually created the snapshot 6 hours before the incident

● Recovery from LVM snapshot took longer than 18 hours

Broken Disaster Recovery procedures

Page 37: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

37

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● RPO and RTO were not set or not adequate to their business needs

○ Daily snapshots only and no WAL archiving (RPO = 24 hours)

○ Streaming replication can’t be used for Disaster Recovery

■ Unless it is a “delayed” replica

● Runbooks can’t replace fire-drills

● Backups must be monitored and tested

GitLab Lessons

Page 38: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

38

Put images in the grey dotted box "unsupported placeholder"

Please write the title in all capital letters

What is High Availability?

What HA will not solve?

Disaster recovery

Automatic failover done right

Examples of real incidents

Wrap it up

Page 39: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

39

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● HA doesn’t solve all problems with postgres, it won't cover:

○ Hardware errors, CPU load, Memory, etc...

○ Disk space for $PGDATA, tablespaces and pg_wal

○ autovacuums, checkpoints

○ Tables and indexes bloat

○ Queries performance

○ etc...

● Depending on RPO you maybe don’t need HA at all, but monitoring is a must

○ Don’t forget to monitor your HA system!

Monitoring

Page 40: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

40

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

Everything must be monitored

Monitoring

High Availability Disaster Recovery

Page 41: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

41

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● OS configuration tuning

○ Huge pages, shared memory, semaphores, overcommit, etc...

● PostgreSQL configuration tuning

○ shared_buffers, max_wal_size, checkpoint completion_target,

random_page_cost, etc...

● HA won’t do it for you!

Configuration tuning

Page 42: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

42

Put images in the grey dotted box "unsupported placeholder"

Please write the title in all capital letters

What is High Availability?

What HA will not solve?

Disaster recovery

Automatic failover done right

Examples of real incidents

Wrap it up

Page 43: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

43

Please write the title in all capital letters

Use bullet points to summarize information rather than writing long paragraphs in the text box

● Always start with Disaster Recovery planning

○ Define RPO and RTO

■ Depending on RTO you maybe don’t need HA

○ Build the Availability you need, not the Availability you want

● Test everything

○ High Availability system

○ Backups!

● Do regular fire-drills

Wrap it up

Page 44: How not to screw up when building HA - PostgreSQL Europe€¦ · The Patroni guy alexander.kukushkin@zalando.de Twitter: @cyberdemn. 3 Please write the title in all capital letters

Put images in the grey dotted box "unsupported placeholder" - behind the orange box and quote in capital letters

Thank you! Questions?

Feedback: https://2019.fosdempgday.org/f