cloud native data pipelines anand.pdf · 2016-10-20 · data pipeline correctness operability...

73
Cloud Native Data Pipelines Sid Anand QCon Shanghai & Tokyo 2016 1

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Cloud Native Data Pipelines

Sid Anand QCon Shanghai & Tokyo 2016

1

Page 2: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

About Me

2

Work [ed | s] @

Committer & PPMC on

Father of 2

Co-Chair for

Apache Airflow

Page 3: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Agari

3

What We Do!

Page 4: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Agari : What We Do

4

Page 5: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

5

Agari : What We Do

Page 6: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

6

Agari : What We Do

Page 7: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

7

Agari : What We Do

Page 8: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

8

Agari : What We Do

Page 9: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

9

Enterprise Customers

email metadata

apply trust

models

email md + trust score

Agari’s Previous EP Version

Agari : What We Do

Batch

Page 10: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

10

email metadata

apply trust

modelsemail md + trust score

Agari’s Current EP VersionEnterprise Customers

Agari : What We Do

Near-real time

Quarantine

Page 11: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Data PipelinesBI vs Predictive

11

Page 12: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Data Pipelines (BI)

12

WebServers

OLTPDB

DataWarehouse

Repor6ngTools

QueryBrowsers

ETL(batch)MySQL,Oracle,Cassandra

Terradata,RedShi;BigQuery

Page 13: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Data Pipelines (Predictive)

13

OLTPDBorcache

ETL(batchorstreaming)

MySQL,Oracle,Cassandra,Redis

Spark,Flink,Beam,Storm

WebServers

DataProductsRanking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon

DataSource

Page 14: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Data Products

14

Page 15: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

BI Predictive

Common Focus of this talk

Data Pipelines

15

WebServers

OLTPDB

DataWarehouse

Repor6ngTools

QueryBrowsers

ETL(batch)MySQL,Oracle,Cassandra

Terradata,RedShi;BigQuery

OLTPDBorcache

ETL(batchorstreaming)

MySQL,Oracle,Cassandra,Redis

Spark,Flink,Beam,Storm

WebServers

Ranking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon

DataSource

Page 16: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

MotivationCloud Native Data Pipelines

16

Page 17: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Cloud Native Data Pipelines

17

Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers

Page 18: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Cloud Native Data Pipelines

18

Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers

Most start-ups run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?

Page 19: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Cloud Native Data Pipelines

19

Cloud Native Techniques

Open Source Technogies

Custom Data Pipeline Stacks seen in Big Data companies

~

Page 20: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Design GoalsDesirable Qualities of a Resilient Data Pipeline

20

Page 21: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

21

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

Page 22: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

22

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• Pay-as-you-go

Page 23: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Quickly Recoverable

23

• Bugs happen!

• Bugs in Predictive Data Pipelines have a large blast radius

• Optimize for MTTR

Page 24: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Predictive Analytics @ AgariUse Cases

24

Page 25: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use Cases

25

Apply trust models (message scoring)

batch + near real time

Build trust models

batch

(Enterprise Protect)

Page 26: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring (batch)Batch Pipeline Architecture

26

Page 27: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

27

enterprise Aenterprise Benterprise C

S3

S3 uploads an Avro file every 15 minutes

Page 28: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

28

enterprise Aenterprise Benterprise C

S3

Airflow kicks of a Spark message scoring job

every hour (EMR)

Page 29: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

29

enterprise Aenterprise Benterprise C

S3

Spark job writes scored messages and stats to

another S3 bucket

S3

Page 30: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

30

enterprise Aenterprise Benterprise C

S3

This triggers SNS/SQS messages events

S3

SNS

SQS

Page 31: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

31

enterprise Aenterprise Benterprise C

S3

An Autoscale Group (ASG) of Importers spins up when it detects SQS

messages

S3

SNS

SQS

Importers

ASG

Page 32: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

32

enterprise Aenterprise Benterprise C

S3

The importers rapidly ingest scored messages and aggregate statistics into

the DB

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 33: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

33

enterprise Aenterprise Benterprise C

S3

Users receive alerts of untrusted emails & can review them in

the web app

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 34: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

34

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Airflow manages the entire process

Use-Case : Message Scoring

Page 35: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Tackling Cost & TimelinessLeveraging the AWS Cloud

35

Page 36: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Tackling Cost

36

Between Daily Runs During Daily Runs

When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR

Page 37: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Tackling Cost

37

Between Hourly Runs During Hourly Runs

When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR

This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!

Page 38: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Tackling TimelinessAuto Scaling Group (ASG)

38

Page 39: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

ASG - Overview

39

What is it?

A means to automatically scale out/in clusters to handle variable load/traffic

A means to keep a cluster/service of a fixed size always up

Page 40: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

ASG - Data Pipeline

40

importer

importer

importer

importer

Importer ASG

scale out / inSQS

DB

Page 41: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

41

Sent

CPU

ACKd/Recvd

CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant

ASG : CPU-based

Page 42: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

ASG : CPU-based

42

Sent

CPU

Recv

Premature Scale-in

Premature Scale-in:

• The CPU drops to noise-levels before all messages are consumed

• This causes scale in to occur while the last few messages are still being committed

Page 43: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

43

Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)

Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)

This causes the ASG to grow

This causes the ASG to shrink

ASG : Queue-based

Page 44: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

44

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost• ASG • EMR Spark

Daily • ASG • EMR Spark Hourly ASG • No Cost Savings

Page 45: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Tackling Operability & CorrectnessLeveraging Tooling

45

Page 46: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

46

A simple way to author and manage workflows

Provides visual insight into the state & performance of workflow runs

Integrates with our alerting and monitoring tools

Tackling Operability : Requirements

Page 47: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Apache AirflowWorkflow Automation & Scheduling

47

Page 48: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

48

Airflow: Author DAGs in Python! No need to bundle many config files!

Apache Airflow - Authoring DAGs

Page 49: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

49

Airflow: Visualizing a DAG

Apache Airflow - Authoring DAGs

Page 50: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

50

Airflow: It’s easy to manage multiple DAGs

Apache Airflow - Managing DAGs

Page 51: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Apache Airflow - Perf. Insights

51

Airflow: Gantt chart view reveals the slowest tasks for a run!

Page 52: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

52

Apache Airflow - Perf. InsightsAirflow: Task Duration chart view show task completion time trends!

Page 53: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

53

Airflow: …And easy to integrate with Ops tools!Apache Airflow - Alerting

Page 54: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

54

Apache Airflow - Correctness

Page 55: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

55

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

Page 56: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring (near-real time)NRT Pipeline Architecture

56

Page 57: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

57

enterprise Aenterprise Benterprise C

Kinesis batch put every second

K

Page 58: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

58

enterprise Aenterprise Benterprise C

K

As ASG of scorers is scaled up to one process per core per kinesis shard

Scorers

ASG

Page 59: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

59

enterprise Aenterprise Benterprise C

KScorers

ASG

KinesisScorers apply the trust model and send scored messages downstream

Page 60: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

60

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

As ASG of importers is scaled up to rapidly import messages

DB

Page 61: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

61

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

Page 62: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Use-Case : Message Scoring

62

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

Page 63: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

InnovationsNRT Pipeline Architecture

63

Page 64: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

64

The Architecture is composed of repeated patterns of :

ASG-based compute consumer

Kinesis transport streams (i.e. AWS’ managed “Kafka”)

A Lambda-based Avro Schema Registry

Innovation 1 : Repeatable Units

ComputeiKinesisi

ASGi

SR

Page 65: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

65

You can chain these repeatable units together to make arbitrary DAGs (Directed Acyclic Graphs)

The example above is a simple Linear DAG with 3 units

Innovation 1 : Repeatable Units

ComputeiKinesisi

ASGi

SR

ComputeiKinesisi

ASGi

SR

ComputeiKinesisi

ASGi

SR

Page 66: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

66

The message body is Avro-encoded, with one detail:

The schema is not included in the Kinesis message!

The schema would be 99% overhead for the message

Instead, a schema_id is sent in the message header

Innovation 2 : Avro Schema Registry

ASG1

Compute1 Compute2Kinesis2

ASG2

SR

Page 67: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

67

When the Compute 2 consumer receives the message, it

First reads the Schema_id out of the message header

Contacts the Schema Registry for the Schema (and caches it)

Deserialized the Avro body using the newly acquired schema

Innovation 2 : Avro Schema Registry

ASG

Compute1 Compute2Kinesis2

ASG

SR SR.getSchemaById()…

Page 68: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

68

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

SR

SR

SR

Innovation 2 : Avro Schema Registry

Page 69: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Airflow Job Reactively Scales

Innovation 3 : Reactive-Scaling (WIP)

69

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASGDB

K

Alerters

ASG

SR

SR

SR

Page 70: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

70

If the ADR is triggered and a model build or code push was recently done to Compute 1, ADR will revert the last code or model push to ASG Compute 1

Innovation 4 : Anomaly-based Rollback (WIP)

ASG

Compute1 Compute2Kinesis

ASG

SR

Anomaly-detector&Reverter

Page 71: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Open Source Plans

71

Follow us to be notified when the following is open-sourced

• Avro Schema Registry

• Agari (Kinesis+ASG) scaling tool (Airflow Job)

• Anomaly-detector & Reverter

To be notified, follow @AgariEng & @r39132

Page 72: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Acknowledgments

72

• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones

• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle

None of this work would be possible without the contributions of the strong team below

Page 73: Cloud Native Data Pipelines Anand.pdf · 2016-10-20 · Data Pipeline Correctness Operability Timeliness Cost • Data Integrity (no loss, etc…) • Expected data distributions

Questions? (@r39132)

73