airflow @ agari

Post on 16-Apr-2017

1.043 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache Airflow @ Agari

Sid Anand (@r39132) Apache Airflow Meet-up June 2016

1

Agari

2

What We Do!

Agari : What We Do

3

4

Agari : What We Do

5

Agari : What We Do

6

Agari : What We Do

7

Agari : What We Do

8

Enterprise Customers

email metadata

apply trust

models

email md + trust score

Agari’s Current Product

Agari : What We Do

9

email metadata

apply trust

models

email md+ trust score

Agari’s Future ProductEnterprise Customers

Agari : What We Do

Apache Airflow @ AgariHow Do We Use It?

10

2 Classes of Orchestration

11

apply trust models

(message scoring)

build trust models

cron++ (general

job scheduler)

New Product (Enterprise Protect)

Operational Automation

2 Classes of Orchestration

12

apply trust models

(message scoring)

build trust models

cron++ (general

job scheduler)

New Product (Enterprise Protect)

Operational Automation

This Talk

Use-Case : Message ScoringBatch Pipeline Architecture

13

Use-Case : Message Scoring

14

enterprise Aenterprise Benterprise C

S3

S3 uploads every 15 minutes

Use-Case : Message Scoring

15

enterprise Aenterprise Benterprise C

S3

Airflow kicks of a Spark message scoring job

every hour

Use-Case : Message Scoring

16

enterprise Aenterprise Benterprise C

S3

Spark job writes scored messages and stats to

another S3 bucket

S3

Use-Case : Message Scoring

17

enterprise Aenterprise Benterprise C

S3

This triggers SNS/SQS messages events

S3

SNS

SQS

Use-Case : Message Scoring

18

enterprise Aenterprise Benterprise C

S3

An Autoscale Group (ASG) of Importers spins up when it detects SQS

messages

S3

SNS

SQS

Importers

ASG

19

enterprise Aenterprise Benterprise C

S3

The importers rapidly ingest scored messages and aggregate statistics into

the DB

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

20

enterprise Aenterprise Benterprise C

S3

Users can review untrusted emails in

the web app

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

21

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Airflow manages the entire process

Use-Case : Message Scoring

Use-Case : Message ScoringAirflow DAG

22

23

Airflow DAG

24

5 minute wait for S3 eventual consistency

Airflow DAG

25

1 hour a day, we also build new models

Airflow DAG

26

build models

Airflow DAG

27

dummy needed for branch operator

Airflow DAG

28

• trigger_rule: one_success

• ExternalTaskSensor : on the previous hour of the same DAG

Airflow DAG

29

Prep Spark Run

Airflow DAG

30

• Run Spark • Verify a record written

to the DB • Wait for the SQS

queue to empty

Airflow DAG

31

• Compute discrepancies

• Send email report

• Update monitoring graphs

• Raise SLA (correctness) alerts

Airflow DAG

SLAs & InsightsAirflow

32

33

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs (e.g. 1 hour)

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• Pay-as-you-go

34

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs (e.g. 1 hour)

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• Pay-as-you-go

SLA

SLA

35

Correctness : Email Reporting

For each org, we check for duplicate or missing data as a count & percentage

orgs

36

Correctness : Email Reporting

These are the 3 stages of the pipeline. We can detect where a discrepancy is coming from - often related to a code push!

orgs

37

Correctness : Monitoring

38

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

39

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

40

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

dag = DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args, sla_miss_callback=sla_alert_func)

41

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness SLA miss

Correctness SLA miss

42

Airflow: …And easy to integrate with Ops tools!

Correctness & Timeliness : Alerting

Timeliness & Correctness SLA misses sent to PagerDuty/VictorOps

Operations

43

Managing Airflow

44

• Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance

•We have 3 environments (Dev, Stage, Prod), each with its own AWS account

Agari in AWS

Dev • A Developer’s playground • developers prototype new services here

45

• Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance

•We have 3 environments (Dev, Stage, Prod), each with its own AWS account

Agari in AWS

Dev • A Developer’s playground • developers prototype new services here

Stage• A Pre-Production Env

• All new code & system upgrades are baked here prior to Prod deployment

• It gets a copy of prod data & stage data : helps us test for scale!

46

• Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance

•We have 3 environments (Dev, Stage, Prod), each with its own AWS account

Agari in AWS

Dev • A Developer’s playground • developers prototype new services here

Stage

Prod • Prod Env

• A Pre-Production Env • All new code & system upgrades are baked here prior to Prod

deployment • It gets a copy of prod data & stage data : helps us test for scale!

47

Security

•Each environment has its own AWS account for isolation

•We use Security Groups & VPCs to lock every EC2 instance down

• In other words, EC2 nodes only accept inbound traffic from other hosts in the same or peered VPCs

•Exception : Each env has one bastion host •The bastion host provides indirect access to the nodes in the VPC

Agari in AWS

Agari’s Airflow in AWS

48

Airflow DB

Apache

bastion host• An Apache Reverse-proxy, running on the bastion host, does:

• SSH termination & • Authentication via mod_auth

sched

workflow host

web server

SSL

• The workflow machine hosts • airflow webserver • airflow scheduler (running LocalExecutor)

• A Postgresql DB instance hosts the Airflow database

Fault Tolerance

49

Airflow DB

Apache

bastion host

sched

workflow-00

web server

SSL

sched

workflow-01

• At Agari, we use Monit to keep processes running on any EC2 instance, e.g. : • Apache webserver on the bastion host • Airflow webserver & scheduler on the workflow host • Postgresql on the DB host

• In Prod, we run 2 schedulers for increased scalability and fault-tolerance

50

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

51

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

Deployed to stage & prod

52

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

Agari / Airflow pinned to a release version (e.g. 1.6.2)

53

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

r39132 / Airflow kept in sync upstream master

54

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

r39132 / Airflow sends PR 191

55

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

PR 191 is cherry-picked into Agari / Airflow

56

Airflow Github Repos

Apache/ Airflow

Agari/ Airflow (fork)

r39132/ Airflow (fork)

Deployed to stage & prod

What is deployed is 1.6.2 + cherry-picked (e.g. PR 191) changes

Acknowledgments

57

• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones

• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle

None of this work would be possible without the contributions of the strong team below

Questions? (@r39132)

58

top related