airflow @ agari

58
Apache Airflow @ Agari Sid Anand ( @r39132 ) Apache Airflow Meet-up June 2016 1

Upload: sid-anand

Post on 16-Apr-2017

1.043 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Airflow @ Agari

Apache Airflow @ Agari

Sid Anand (@r39132) Apache Airflow Meet-up June 2016

1

Page 2: Airflow @ Agari

Agari

2

What We Do!

Page 3: Airflow @ Agari

Agari : What We Do

3

Page 4: Airflow @ Agari

4

Agari : What We Do

Page 5: Airflow @ Agari

5

Agari : What We Do

Page 6: Airflow @ Agari

6

Agari : What We Do

Page 7: Airflow @ Agari

7

Agari : What We Do

Page 8: Airflow @ Agari

8

Enterprise Customers

email metadata

apply trust

models

email md + trust score

Agari’s Current Product

Agari : What We Do

Page 9: Airflow @ Agari

9

email metadata

apply trust

models

email md+ trust score

Agari’s Future ProductEnterprise Customers

Agari : What We Do

Page 10: Airflow @ Agari

Apache Airflow @ AgariHow Do We Use It?

10

Page 11: Airflow @ Agari

2 Classes of Orchestration

11

apply trust models

(message scoring)

build trust models

cron++ (general

job scheduler)

New Product (Enterprise Protect)

Operational Automation

Page 12: Airflow @ Agari

2 Classes of Orchestration

12

apply trust models

(message scoring)

build trust models

cron++ (general

job scheduler)

New Product (Enterprise Protect)

Operational Automation

This Talk

Page 13: Airflow @ Agari

Use-Case : Message ScoringBatch Pipeline Architecture

13

Page 14: Airflow @ Agari

Use-Case : Message Scoring

14

enterprise Aenterprise Benterprise C

S3

S3 uploads every 15 minutes

Page 15: Airflow @ Agari

Use-Case : Message Scoring

15

enterprise Aenterprise Benterprise C

S3

Airflow kicks of a Spark message scoring job

every hour

Page 16: Airflow @ Agari

Use-Case : Message Scoring

16

enterprise Aenterprise Benterprise C

S3

Spark job writes scored messages and stats to

another S3 bucket

S3

Page 17: Airflow @ Agari

Use-Case : Message Scoring

17

enterprise Aenterprise Benterprise C

S3

This triggers SNS/SQS messages events

S3

SNS

SQS

Page 18: Airflow @ Agari

Use-Case : Message Scoring

18

enterprise Aenterprise Benterprise C

S3

An Autoscale Group (ASG) of Importers spins up when it detects SQS

messages

S3

SNS

SQS

Importers

ASG

Page 19: Airflow @ Agari

19

enterprise Aenterprise Benterprise C

S3

The importers rapidly ingest scored messages and aggregate statistics into

the DB

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 20: Airflow @ Agari

20

enterprise Aenterprise Benterprise C

S3

Users can review untrusted emails in

the web app

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 21: Airflow @ Agari

21

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Airflow manages the entire process

Use-Case : Message Scoring

Page 22: Airflow @ Agari

Use-Case : Message ScoringAirflow DAG

22

Page 23: Airflow @ Agari

23

Airflow DAG

Page 24: Airflow @ Agari

24

5 minute wait for S3 eventual consistency

Airflow DAG

Page 25: Airflow @ Agari

25

1 hour a day, we also build new models

Airflow DAG

Page 26: Airflow @ Agari

26

build models

Airflow DAG

Page 27: Airflow @ Agari

27

dummy needed for branch operator

Airflow DAG

Page 28: Airflow @ Agari

28

• trigger_rule: one_success

• ExternalTaskSensor : on the previous hour of the same DAG

Airflow DAG

Page 29: Airflow @ Agari

29

Prep Spark Run

Airflow DAG

Page 30: Airflow @ Agari

30

• Run Spark • Verify a record written

to the DB • Wait for the SQS

queue to empty

Airflow DAG

Page 31: Airflow @ Agari

31

• Compute discrepancies

• Send email report

• Update monitoring graphs

• Raise SLA (correctness) alerts

Airflow DAG

Page 32: Airflow @ Agari

SLAs & InsightsAirflow

32

Page 33: Airflow @ Agari

33

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs (e.g. 1 hour)

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• Pay-as-you-go

Page 34: Airflow @ Agari

34

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs (e.g. 1 hour)

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• Pay-as-you-go

SLA

SLA

Page 35: Airflow @ Agari

35

Correctness : Email Reporting

For each org, we check for duplicate or missing data as a count & percentage

orgs

Page 36: Airflow @ Agari

36

Correctness : Email Reporting

These are the 3 stages of the pipeline. We can detect where a discrepancy is coming from - often related to a code push!

orgs

Page 37: Airflow @ Agari

37

Correctness : Monitoring

Page 38: Airflow @ Agari

38

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Page 39: Airflow @ Agari

39

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

Page 40: Airflow @ Agari

40

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

dag = DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args, sla_miss_callback=sla_alert_func)

Page 41: Airflow @ Agari

41

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

Correctness SLA miss

Page 42: Airflow @ Agari

42

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness & Correctness SLA misses sent to PagerDuty/VictorOps

Page 43: Airflow @ Agari

Operations

43

Managing Airflow

Page 44: Airflow @ Agari

44

• Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance

•We have 3 environments (Dev, Stage, Prod), each with its own AWS account

Agari in AWS

Dev • A Developer’s playground • developers prototype new services here

Page 45: Airflow @ Agari

45

• Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance

•We have 3 environments (Dev, Stage, Prod), each with its own AWS account

Agari in AWS

Dev • A Developer’s playground • developers prototype new services here

Stage• A Pre-Production Env

• All new code & system upgrades are baked here prior to Prod deployment

• It gets a copy of prod data & stage data : helps us test for scale!

Page 46: Airflow @ Agari

46

• Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance

•We have 3 environments (Dev, Stage, Prod), each with its own AWS account

Agari in AWS

Dev • A Developer’s playground • developers prototype new services here

Stage

Prod • Prod Env

• A Pre-Production Env • All new code & system upgrades are baked here prior to Prod

deployment • It gets a copy of prod data & stage data : helps us test for scale!

Page 47: Airflow @ Agari

47

Security

•Each environment has its own AWS account for isolation

•We use Security Groups & VPCs to lock every EC2 instance down

• In other words, EC2 nodes only accept inbound traffic from other hosts in the same or peered VPCs

•Exception : Each env has one bastion host •The bastion host provides indirect access to the nodes in the VPC

Agari in AWS

Page 48: Airflow @ Agari

Agari’s Airflow in AWS

48

Airflow DB

Apache

bastion host• An Apache Reverse-proxy, running on the bastion host, does:

• SSH termination & • Authentication via mod_auth

sched

workflow host

web server

SSL

• The workflow machine hosts • airflow webserver • airflow scheduler (running LocalExecutor)

• A Postgresql DB instance hosts the Airflow database

Page 49: Airflow @ Agari

Fault Tolerance

49

Airflow DB

Apache

bastion host

sched

workflow-00

web server

SSL

sched

workflow-01

• At Agari, we use Monit to keep processes running on any EC2 instance, e.g. : • Apache webserver on the bastion host • Airflow webserver & scheduler on the workflow host • Postgresql on the DB host

• In Prod, we run 2 schedulers for increased scalability and fault-tolerance

Page 50: Airflow @ Agari

50

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

Page 51: Airflow @ Agari

51

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

Deployed to stage & prod

Page 52: Airflow @ Agari

52

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

Agari / Airflow pinned to a release version (e.g. 1.6.2)

Page 53: Airflow @ Agari

53

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

r39132 / Airflow kept in sync upstream master

Page 54: Airflow @ Agari

54

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

r39132 / Airflow sends PR 191

Page 55: Airflow @ Agari

55

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

PR 191 is cherry-picked into Agari / Airflow

Page 56: Airflow @ Agari

56

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

Deployed to stage & prod

What is deployed is 1.6.2 + cherry-picked (e.g. PR 191) changes

Page 57: Airflow @ Agari

Acknowledgments

57

• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones

• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle

None of this work would be possible without the contributions of the strong team below

Page 58: Airflow @ Agari

Questions? (@r39132)

58