resilient predictive data pipelines v4 - qcon london 2020 · resilient predictive data pipelines...

60
Resilient Predictive Data Pipelines Sid Anand ( @r39132 ) QCon London 2016 1

Upload: others

Post on 24-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Resilient Predictive Data PipelinesSid Anand (@r39132) QCon London 2016

1

Page 2: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

About Me

2

Work [ed | s] @

Maintainer on

Report to

Co-Chair for

airbnb/airflow

Page 3: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

MotivationWhy is a Data Pipeline talk in this High Availability Track?

3

Page 4: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Different Types of Data Pipelines

4

ETL

• used for : loading data related to business health into a Data Warehouse • user-engagement stats (e.g.

social networking) • product success stats (e.g.

e-commerce)

• audience : Business, BizOps

• downtime? : 1-3 days

Predictive

• used for : •building recommendation

products (e.g. social networking, shopping)

•updating fraud prevention endpoints (e.g. security, payments, e-commerce)

• audience : Customers

• downtime? : < 1 hour

VS

Page 5: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

5

ETL

• used for : loading data into a Data Warehouse —> Reports • user-engagement stats (e.g.

social networking) • product success stats (e.g.

e-commerce)

• audience : Business

• downtime? : 1-3 days

Predictive

• used for : •building recommendation

products (e.g. social networking, shopping)

•updating fraud prevention endpoints (e.g. security, payments, e-commerce)

• audience : Customers

• downtime? : < 1 hour

VS

Different Types of Data Pipelines

Page 6: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Why Do We Care About Resilience?

6

d4d5d6d7d8

DB

Search Engines Recommenders

d1 d2 d3

Page 7: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

DB

Why Do We Care About Resilience?

7

d4

d6d7d8

Search Engines Recommenders

d1 d2 d3d5

Page 8: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Why Do We Care About Resilience?

8

d7d8

DB

Search Engines Recommenders

d1 d2 d3 d4

d5

d6

Page 9: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Why Do We Care About Resilience?

9

d7d8

DB

Search Engines Recommenders

d1 d2 d3 d4

d5

d6

Page 10: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Any Take-aways?

10

d7d8

DB

Search Engines Recommenders

d1 d2 d3 d4

d5

d6• Bugs happen!

• Bugs in Predictive Data Pipelines have a large blast radius • The bugs can affect customers and a company’s

profits & reputation!

Page 11: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Design GoalsDesirable Qualities of a Resilient Data Pipeline

11

Page 12: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

12

• Scalable

• Available

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Desirable Qualities of a Resilient Data Pipeline

Page 13: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

13

• Scalable • Build your pipelines using [infinitely] scalable components

• The scalability of your system is determined by its least-scalable component

• Available

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Desirable Qualities of a Resilient Data Pipeline

Page 14: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Desirable Qualities of a Resilient Data Pipeline

14

• Scalable • Build your pipelines using [infinitely] scalable components

• The scalability of your system is determined by its least-scalable component

• Available • Ditto

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Page 15: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Instrumented

15

Instrumentation must reveal SLA metrics at each stage of the pipeline!

What SLA metrics do we care about? Correctness & Timeliness

• Correctness • No Data Loss • No Data Corruption • No Data Duplication • A Defined Acceptable Staleness of Intermediate Data

• Timeliness • A late result == a useless result • Delayed processing of now()’s data may delay the processing of future

data

Page 16: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Instrumented, Monitored, & Alert-enabled

16

• Instrument • Instrument Correctness & Timeliness SLA metrics at each stage of the

pipeline

• Monitor • Continuously monitor that SLA metrics fall within acceptable bounds (i.e.

pre-defined SLAs)

• Alert • Alert when we miss SLAs

Page 17: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Desirable Qualities of a Resilient Data Pipeline

17

• Scalable • Build your pipelines using [infinitely] scalable components

• The scalability of your system is determined by its least-scalable component

• Available • Ditto

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Page 18: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Quickly Recoverable

18

• Bugs happen!

• Bugs in Predictive Data Pipelines have a large blast radius

• Optimize for MTTR

Page 19: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

ImplementationUsing AWS to meet Design Goals

19

Page 20: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

SQSSimple Queue Service

20

Page 21: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

SQS - Overview

21

AWS’s low-latency, highly scalable, highly available message queue

Infinitely Scalable Queue (though not FIFO)

Low End-to-end latency (generally sub-second)

Pull-based

How it Works!

Producers publish messages, which can be batched, to an SQS queue

Consumers

consume messages, which can be batched, from the queue

commit message contents to a data store

ACK the messages as a batch

Page 22: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

visibility timer

SQS - Typical Operation Flow

22

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 1: A consumer reads a message from SQS. This starts a visibility timer!

Page 23: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

visibility timer

SQS - Typical Operation Flow

23

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 2: Consumer persists message contents to DB

Page 24: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

visibility timer

SQS - Typical Operation Flow

24

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 3: Consumer ACKs message in SQS

Page 25: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

visibility timer

SQS - Time Out Example

25

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 1: A consumer reads a message from SQS

Page 26: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

visibility timer

SQS - Time Out Example

26

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 2: Consumer attempts persists message contents to DB

Page 27: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

visibility time out

SQS - Time Out Example

27

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 3: A Visibility Timeout occurs & the message becomes visible again.

Page 28: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

visibility timer

SQS - Time Out Example

28

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1

m1

SQS

Step 4: Another consumer reads and persists the same message

Page 29: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

visibility timer

SQS - Time Out Example

29

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1

SQS

Step 5: Consumer ACKs message in SQS

Page 30: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

SQS - Dead Letter Queue

30

SQS - DLQ

visibility timer

Producer

Producer

Producer

m2m3m4m5

Consumer

Consumer

Consumer

DB

m1

SQS

Redrive rule : 2x

m1

Page 31: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

SNSSimple Notification Service

31

Page 32: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

SNS - Overview

32

Highly Scalable, Highly Available, Push-based Topic Service

Whereas SQS ensures each message is seen by at least 1 consumer

SNS ensures that each message is seen by every consumer

Reliable Multi-Push

Whereas SQS is pull-based, SNS is push-based

There is no message retention & there is a finite retry count

No Reliable Message Delivery

Can we work around this limitation?

Page 33: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

SNS + SQS Design Pattern

33

m1m2

m1m2

m1m2

SQS Q1

SQS Q2

SNS T1

Reliable Multi Push

Reliable Message Delivery

Page 34: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

SNS + SQS

34

Producer

Producer

Producer

m1m2Consumer

Consumer

Consumer

DB

m1

m1m2

m1m2

SQS Q1

SQS Q2

SNS T1

Consumer

Consumer

Consumer

ES

m1

Page 35: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

S3 + SNS + SQS Design Pattern

35

m1m2

m1m2

m1m2

SQS Q1

SQS Q2

SNS T1

Reliable Multi Push

Transactions

S3d1

d2

Page 36: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Batch Pipeline ArchitecturePutting the Pieces Together

36

Page 37: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Architecture

37

Page 38: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Architectural Elements

38

A Schema-aware Data format for all data (Avro)

The entire pipeline is built from Highly-Available/Highly-Scalable components

S3, SNS, SQS, ASG, EMR Spark (exception DB)

The pipeline is never blocked because we use a DLQ for messages we cannot process

We use queue-based auto-scaling to get high on-demand ingest rates

We manage everything with Airflow

Every stage in the pipeline is idempotent

Every stage in the pipeline is instrumented

Page 39: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

ASGAuto Scaling Group

39

Page 40: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

ASG - Overview

40

What is it?

A means to automatically scale out/in clusters to handle variable load/traffic

A means to keep a cluster/service always up

Fulfills AWS’s pay-per-use promise!

When to use it?

Feed-processing, web traffic load balancing, zone outage, etc…

Page 41: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

ASG - Data Pipeline

41

importer

importer

importer

importer

Importer ASG

scale out / inSQS

DB

Page 42: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

ASG : CPU-based

42

Sent

CPU

ACKd/Recvd

CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant

Page 43: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

ASG : CPU-based

43

Sent

CPU

Recv

Premature Scale-in

Premature Scale-in: The CPU drops to noise-levels before all messages are consumed. This causes scale in to occur while the last few messages are still being committed resulting in a long time-to-drain for the queue!

Page 44: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

ASG - Queue-based

44

Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)

Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)

This causes the ASG to grow

This causes the ASG to shrink

Page 45: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Architecture

45

Page 46: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Reliable Hourly Job SchedulingWorkflow Automation & Scheduling

46

Page 47: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

47

Historical Context

Our first cut at the pipeline used cron to schedule hourly runs of Spark

Problem

We only knew if Spark succeeded. What if a downstream task failed?

We needed something smarter than cron that

Reliably managed a graph of tasks (DAG - Directed Acyclic Graph)

Orchestrated hourly runs of that DAG

Retried failed tasks

Tracked the performance of each run and its tasks

Reported on failure/success of runs

Our Needs

Page 48: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

AirflowWorkflow Automation & Scheduling

48

Page 49: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Airflow - DAG Dashboard

49

Airflow: It’s easy to manage multiple DAGs

Page 50: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Airflow - Authoring DAGs

50

Airflow: Visualizing a DAG

Page 51: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

51

Airflow: Author DAGs in Python! No need to bundle many config files!

Airflow - Authoring DAGs

Page 52: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Airflow - Performance Insights

52

Airflow: Gantt chart view reveals the slowest tasks for a run!

Page 53: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

53

Airflow: …And we can easily see performance trends over time

Airflow - Performance Insights

Page 54: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Airflow - Alerting

54

Airflow: …And easy to integrate with Ops tools!

Page 55: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Airflow - Monitoring

55

Page 56: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

56

Airflow - Join the Community

With >30 Companies, >100 Contributors , and >2500 Commits, Airflow is growing rapidly!

We are looking for more contributors to help support the community!

Disclaimer : I’m a maintainer on the project

Page 57: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Design Goal ScorecardAre We Meeting Our Design Goals?

57

Page 58: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

58

• Scalable

• Available

• Instrumented, Monitored, & Alert-enabled

• Quickly Recoverable

Desirable Qualities of a Resilient Data Pipeline

Page 59: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

59

• Scalable • Build using scalable components from AWS

• SQS, SNS, S3, ASG, EMR Spark • Exception = DB (WIP)

• Available • Build using available components from AWS • Airflow for reliable job scheduling

• Instrumented, Monitored, & Alert-enabled • Airflow

• Quickly Recoverable • Airflow, DLQs, ASGs, Spark & DB

Desirable Qualities of a Resilient Data Pipeline

Page 60: Resilient Predictive Data Pipelines v4 - QCon London 2020 · Resilient Predictive Data Pipelines Sid Anand (@r39132) QCon London 2016 1. About Me 2 Work [ed | s] @ Maintainer on Report

Questions? (@r39132)

60