airflow @ agari
TRANSCRIPT
Apache Airflow @ Agari
Sid Anand (@r39132) Apache Airflow Meet-up June 2016
1
Agari
2
What We Do!
Agari : What We Do
3
4
Agari : What We Do
5
Agari : What We Do
6
Agari : What We Do
7
Agari : What We Do
8
Enterprise Customers
email metadata
apply trust
models
email md + trust score
Agari’s Current Product
Agari : What We Do
9
email metadata
apply trust
models
email md+ trust score
Agari’s Future ProductEnterprise Customers
Agari : What We Do
Apache Airflow @ AgariHow Do We Use It?
10
2 Classes of Orchestration
11
apply trust models
(message scoring)
build trust models
cron++ (general
job scheduler)
New Product (Enterprise Protect)
Operational Automation
2 Classes of Orchestration
12
apply trust models
(message scoring)
build trust models
cron++ (general
job scheduler)
New Product (Enterprise Protect)
Operational Automation
This Talk
Use-Case : Message ScoringBatch Pipeline Architecture
13
Use-Case : Message Scoring
14
enterprise Aenterprise Benterprise C
S3
S3 uploads every 15 minutes
Use-Case : Message Scoring
15
enterprise Aenterprise Benterprise C
S3
Airflow kicks of a Spark message scoring job
every hour
Use-Case : Message Scoring
16
enterprise Aenterprise Benterprise C
S3
Spark job writes scored messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
17
enterprise Aenterprise Benterprise C
S3
This triggers SNS/SQS messages events
S3
SNS
SQS
Use-Case : Message Scoring
18
enterprise Aenterprise Benterprise C
S3
An Autoscale Group (ASG) of Importers spins up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
19
enterprise Aenterprise Benterprise C
S3
The importers rapidly ingest scored messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
20
enterprise Aenterprise Benterprise C
S3
Users can review untrusted emails in
the web app
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
21
enterprise Aenterprise Benterprise C
S3 S3
SNS
SQS
Importers
ASGDB
Airflow manages the entire process
Use-Case : Message Scoring
Use-Case : Message ScoringAirflow DAG
22
23
Airflow DAG
24
5 minute wait for S3 eventual consistency
Airflow DAG
25
1 hour a day, we also build new models
Airflow DAG
26
build models
Airflow DAG
27
dummy needed for branch operator
Airflow DAG
28
• trigger_rule: one_success
• ExternalTaskSensor : on the previous hour of the same DAG
Airflow DAG
29
Prep Spark Run
Airflow DAG
30
• Run Spark • Verify a record written
to the DB • Wait for the SQS
queue to empty
Airflow DAG
31
• Compute discrepancies
• Send email report
• Update monitoring graphs
• Raise SLA (correctness) alerts
Airflow DAG
SLAs & InsightsAirflow
32
33
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…) • Expected data distributions
• All output within time-bound SLAs (e.g. 1 hour)
• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
34
Desirable Qualities of a Resilient Data Pipeline
OperabilityCorrectness
Timeliness Cost
• Data Integrity (no loss, etc…) • Expected data distributions
• All output within time-bound SLAs (e.g. 1 hour)
• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs
• Quick Recoverability
• Pay-as-you-go
SLA
SLA
35
Correctness : Email Reporting
For each org, we check for duplicate or missing data as a count & percentage
orgs
36
Correctness : Email Reporting
These are the 3 stages of the pipeline. We can detect where a discrepancy is coming from - often related to a code push!
orgs
37
Correctness : Monitoring
38
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
39
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA miss
40
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA miss
dag = DAG(DAG_NAME, schedule_interval='@hourly', default_args=default_args, sla_miss_callback=sla_alert_func)
41
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA miss
Correctness SLA miss
42
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness & Correctness SLA misses sent to PagerDuty/VictorOps
Operations
43
Managing Airflow
44
• Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance
•We have 3 environments (Dev, Stage, Prod), each with its own AWS account
Agari in AWS
Dev • A Developer’s playground • developers prototype new services here
45
• Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance
•We have 3 environments (Dev, Stage, Prod), each with its own AWS account
Agari in AWS
Dev • A Developer’s playground • developers prototype new services here
Stage• A Pre-Production Env
• All new code & system upgrades are baked here prior to Prod deployment
• It gets a copy of prod data & stage data : helps us test for scale!
46
• Agari runs in the AWS US-West-2 Region & leverages multiple Availability Zones (AZs) for Fault Tolerance
•We have 3 environments (Dev, Stage, Prod), each with its own AWS account
Agari in AWS
Dev • A Developer’s playground • developers prototype new services here
Stage
Prod • Prod Env
• A Pre-Production Env • All new code & system upgrades are baked here prior to Prod
deployment • It gets a copy of prod data & stage data : helps us test for scale!
47
Security
•Each environment has its own AWS account for isolation
•We use Security Groups & VPCs to lock every EC2 instance down
• In other words, EC2 nodes only accept inbound traffic from other hosts in the same or peered VPCs
•Exception : Each env has one bastion host •The bastion host provides indirect access to the nodes in the VPC
Agari in AWS
Agari’s Airflow in AWS
48
Airflow DB
Apache
bastion host• An Apache Reverse-proxy, running on the bastion host, does:
• SSH termination & • Authentication via mod_auth
sched
workflow host
web server
SSL
• The workflow machine hosts • airflow webserver • airflow scheduler (running LocalExecutor)
• A Postgresql DB instance hosts the Airflow database
Fault Tolerance
49
Airflow DB
Apache
bastion host
sched
workflow-00
web server
SSL
sched
workflow-01
• At Agari, we use Monit to keep processes running on any EC2 instance, e.g. : • Apache webserver on the bastion host • Airflow webserver & scheduler on the workflow host • Postgresql on the DB host
• In Prod, we run 2 schedulers for increased scalability and fault-tolerance
50
Airflow Github Repos
Apache/ Airflow
Agari/ Airflow (fork)
r39132/ Airflow (fork)
51
Airflow Github Repos
Apache/ Airflow
Agari/ Airflow (fork)
r39132/ Airflow (fork)
Deployed to stage & prod
52
Airflow Github Repos
Apache/ Airflow
Agari/ Airflow (fork)
r39132/ Airflow (fork)
Agari / Airflow pinned to a release version (e.g. 1.6.2)
53
Airflow Github Repos
Apache/ Airflow
Agari/ Airflow (fork)
r39132/ Airflow (fork)
r39132 / Airflow kept in sync upstream master
54
Airflow Github Repos
Apache/ Airflow
Agari/ Airflow (fork)
r39132/ Airflow (fork)
r39132 / Airflow sends PR 191
55
Airflow Github Repos
Apache/ Airflow
Agari/ Airflow (fork)
r39132/ Airflow (fork)
PR 191 is cherry-picked into Agari / Airflow
56
Airflow Github Repos
Apache/ Airflow
Agari/ Airflow (fork)
r39132/ Airflow (fork)
Deployed to stage & prod
What is deployed is 1.6.2 + cherry-picked (e.g. PR 191) changes
Acknowledgments
57
• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones
• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle
None of this work would be possible without the contributions of the strong team below
Questions? (@r39132)
58