postgresql + kafka: the delight of change data capture
TRANSCRIPT
PostgreSQL + Kafka The Delight of Change Data CaptureJeff Klukas - Data Engineer at Simple
1
2
Overview
Commit logs: what are they?
Write-ahead logging (WAL)
Commit logs as a data store
Demo: change data capture
Use cases
3
https://www.confluent.io/blog/hands-free-kafka-replication-a-lesson-in-operational-simplicity/
Commit Logs
4
Ordered Immutable Durable
Commit Logs
5
Commit Logs
Ordered Immutable Durable
In practice, old logs can be deleted or archived
6
Write-Ahead Logging (WAL)
7
– https://www.postgresql.org/docs/current/static/wal-intro.html
“WAL's central concept is that changes to data files (where tables and indexes reside) must be written only after those changes have been logged, that is, after log records describing the changes have been flushed to permanent storage”
8
– https://www.postgresql.org/docs/9.4/static/logicaldecoding-explanation.html
“Logical decoding is the process of extracting all persistent changes to a database's tables into a coherent, easy to understand format which can be interpreted without detailed knowledge of the database's internal state.”
9
10
Topic Partitions
11
Topics
12
Compacted Topics
13
https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
14
INSERT INTO transactions VALUES (56789, 20.00);
{ "transaction_id": {"int": 56789}, "amount": {"double": 20.00} }
Bottled Water - Message Key
{ "transaction_id": { "int": 56789 } }
Bottled Water - Message Value
15
UPDATE transactions SET amount = 25.00 WHERE transaction_id = 56789;
{ "transaction_id": {"int": 56789}, "amount": {"double": 25.00} }
Bottled Water - Message Key
{ "transaction_id": { "int": 56789 } }
Bottled Water - Message Value
16
DELETE FROM transactions WHERE transaction_id = 56789;
null
Bottled Water - Message Key
{ "transaction_id": { "int": 56789 } }
Bottled Water - Message Value
17
tx-service
tx-postgres
Use Cases
18
tx-service
tx-postgres
tx-pgkafka
Kafka topic: tx-pgkafka
19
tx-service
tx-postgres
tx-pgkafka
demux-service
Kafka topic: tx-pgkafka
20
tx-service
tx-postgres
tx-pgkafka
demux-service
Kafka topic: tx-pgkafka
Kafka topic: customers-table
Kafka topic: transactions-table
21
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Kafka topic: tx-pgkafka
Kafka topic: customers-table
Kafka topic: transactions-table
Kafka topic: activity-pgkafka
22
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift (Data Warehouse)
Amazon S3 (Data Lake)
analytics-service
Kafka topic: tx-pgkafka
Kafka topic: customers-table
Kafka topic: transactions-table
Kafka topic: activity-pgkafka
23
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift (Data Warehouse)
Amazon S3 (Data Lake)
analytics-service
Kafka topic: tx-pgkafka
Kafka topic: customers-table
Kafka topic: transactions-table
Kafka topic: activity-pgkafka
Change Data Capture
24
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift (Data Warehouse)
Amazon S3 (Data Lake)
analytics-service
Kafka topic: tx-pgkafka
Kafka topic: customers-table
Kafka topic: transactions-table
Kafka topic: activity-pgkafka
Messaging
25
tx-service
tx-postgres
tx-pgkafka
demux-service
activity-service
activity-postgres
activity-pgkafka
Amazon Redshift (Data Warehouse)
Amazon S3 (Data Lake)
analytics-service
Kafka topic: tx-pgkafka
Kafka topic: customers-table
Kafka topic: transactions-table
Kafka topic: activity-pgkafka
Analytics
26
Recap
Commit logs: what are they?
Write-ahead logging (WAL)
Commit logs as a data store
Demo: change data capture
Use cases
27
• Blog post on Simple’s CDC pipeline
• https://www.simple.com/engineering
• Bottled Water: https://github.com/confluentinc/bottledwater-pg
• Debezium (CDC to Kafka from Postgres, MySQL, or MongoDB)
• http://debezium.io/
• https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka
• https://www.confluent.io/kafka-summit-sf17/
• Martin Kleppmann, Making Sense of Stream Processing eBook
Also See…
Thank You
28
Extras
29
30
The Dual Write Problem
https://www.confluent.io/blog/bottled-water-real-time-integration-of-postgresql-and-kafka/
31
Redshift Architecture Amazon Redshift
Replicating to Redshift
32
33
Table Schema
CREATE TABLE pgkafka_txservice_transactions ( pg_lsn NUMERIC(20,0) ENCODE raw, pg_txn_id BIGINT ENCODE lzo, pg_operation CHAR(6) ENCODE bytedict, pg_txn_timestamp TIMESTAMP ENCODE lzo, ingestion_timestamp TIMESTAMP ENCODE lzo, transaction_id INT ENCODE lzo, amount NUMERIC(18,2) ENCODE lzo ) DISTKEY transaction_id SORTKEY (transaction_id, pg_lsn, pg_operation);
Amazon Redshift
34
Deduplication
CREATE TABLE deduped LIKE pgkafka_txservice_transactions;
INSERT INTO deduped SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY pg_lsn ORDER BY ingestion_timestamp DESC) FROM pgkafka_txservice_transactions ) WHERE row_number = 1;
DROP TABLE pgkafka_txservice_transactions;
ALTER TABLE deduped RENAME TO pgkafka_txservice_transactions;
Amazon Redshift
35
View of Current StateCREATE VIEW current_txservice_transactions AS SELECT transaction_id, amount, FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY transaction_id ORDER BY pg_lsn, pg_operation) AS n, COUNT(*) OVER (PARTITION BY transaction_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS c FROM pgkafka_txservice_transactions) WHERE n = c AND pg_operation <> 'delete';
Amazon Redshift