cloud native data pipelines (in eng & japanese) - qcon tokyo

176
Cloud Native Data Pipelines Sid Anand QCon Shanghai & Tokyo 2016 1

Upload: sid-anand

Post on 06-Jan-2017

187 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Cloud Native Data Pipelines

Sid Anand QCon Shanghai & Tokyo 2016

1

Page 2: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Sid Anand QCon Shanghai & Tokyo 2016

2

Japanese Translation: Kiro Harada (@haradakiro)

Page 3: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

About Me

3

Work [ed | s] @

Committer & PPMC on

Father of 2

Co-Chair for

Apache Airflow

Page 4: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

4

[ | ] @

& PPMC Apache Airflow

Page 5: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

5

Live Stream

Page 6: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

6

Page 7: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Agari

7

What We Do!

Page 8: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Agari

8

!

Page 9: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Agari : What We Do

9

Page 10: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Agari :

10

Page 11: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

11

Agari : What We Do

Page 12: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

12

Agari :

Page 13: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

13

Agari : What We Do

Page 14: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

14

Agari :

Page 15: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

15

Agari : What We Do

Page 16: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

16

Agari :

Page 17: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

17

Agari : What We Do

Page 18: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

18

Agari :

Page 19: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

19

Enterprise Customers

email metadata

apply trust

models

email md + trust score

Agari’s Previous EP Version

Agari : What We Do

Batch

Page 20: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

20

+

Agari

Agari :

Page 21: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

21

email metadata

apply trust

modelsemail md + trust score

Agari’s Current EP VersionEnterprise Customers

Agari : What We Do

Near-real time

Quarantine

Page 22: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

22

+

Agari

Agari :

Page 23: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Data PipelinesBI vs Predictive

23

Page 24: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

BI

24

Page 25: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Data Pipelines (BI)

25

WebServers

OLTPDB

DataWarehouse

Repor6ngTools

QueryBrowsers

ETL(batch)MySQL,Oracle,Cassandra

Terradata,RedShi;BigQuery

Page 26: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

(BI)

26

WebServers

OLTPDB

DataWarehouse

Repor6ngTools

QueryBrowsers

ETL(batch)MySQL,Oracle,Cassandra

Terradata,RedShi;BigQuery

Page 27: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Data Pipelines (Predictive)

27

OLTPDBorcache

ETL(batchorstreaming)

MySQL,Oracle,Cassandra,Redis

Spark,Flink,Beam,Storm

WebServers

DataProductsRanking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon

DataSource

Page 28: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

(Predictive)

28

OLTPDBorcache

ETL(batchorstreaming)

MySQL,Oracle,Cassandra,Redis

Spark,Flink,Beam,Storm

WebServers

DataProductsRanking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon

DataSource

Page 29: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Data Products

29

Page 30: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

30

Page 31: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

BI Predictive

Common Focus of this talk

Data Pipelines

31

WebServers

OLTPDB

DataWarehouse

Repor6ngTools

QueryBrowsers

ETL(batch)MySQL,Oracle,Cassandra

Terradata,RedShi;BigQuery

OLTPDBorcache

ETL(batchorstreaming)

MySQL,Oracle,Cassandra,Redis

Spark,Flink,Beam,Storm

WebServers

Ranking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon

DataSource

Page 32: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

BI Predictive

32

WebServers

OLTPDB

DataWarehouse

Repor6ngTools

QueryBrowsers

ETL(batch)MySQL,Oracle,Cassandra

Terradata,RedShi;BigQuery

OLTPDBorcache

ETL(batchorstreaming)

MySQL,Oracle,Cassandra,Redis

Spark,Flink,Beam,Storm

WebServers

Ranking(Search,NewsFeed),RecommenderProducts,FraudDetecGon/PrevenGon

DataSource

Page 33: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

MotivationCloud Native Data Pipelines

33

Page 34: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Cloud Native Data Pipelines

34

Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers

Page 35: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

35

LinkedIn Facebook Twitter Google

Page 36: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Cloud Native Data Pipelines

36

Big Data Companies like LinkedIn, Facebook, Twitter, & Google build custom, large scale data pipelines that run in their own Data Centers

Most start-ups run in the public cloud. Can they leverage aspects of the public cloud to build comparable pipelines?

Page 37: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

37

LinkedIn Facebook Twitter Google

Page 38: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Cloud Native Data Pipelines

38

Cloud Native Techniques

Open Source Technogies

Custom Data Pipeline Stacks seen in Big Data companies

~

Page 39: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

39

~

Page 40: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Design GoalsDesirable Qualities of a Resilient Data Pipeline

40

Page 41: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

41

Page 42: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

42

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

Page 43: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

43

Page 44: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

44

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• Pay-as-you-go

Page 45: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

45

• • ( …)

• SLA

SLA •

Page 46: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Quickly Recoverable

46

• Bugs happen!

• Bugs in Predictive Data Pipelines have a large blast radius

• Optimize for MTTR

Page 47: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

47

• !

• MTTR

Page 48: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Predictive Analytics @ AgariUse Cases

48

Page 49: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Predictive Analytics @ Agari

49

Page 50: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use Cases

50

Apply trust models (message scoring)

batch + near real time

Build trust models

batch

(Enterprise Protect)

Page 51: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

51

(message scoring)

+

(Enterprise Protect)

Page 52: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring (batch)Batch Pipeline Architecture

52

Page 53: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

: (batch)

Batch Pipeline Architecture

53

Page 54: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

54

enterprise Aenterprise Benterprise C

S3

S3 uploads an Avro file every 15 minutes

Page 55: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case :

55

enterprise Aenterprise Benterprise C

S3

Avro 15S3

Page 56: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

56

enterprise Aenterprise Benterprise C

S3

Airflow kicks of a Spark message scoring job

every hour (EMR)

Page 57: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case :

57

enterprise Aenterprise Benterprise C

S3

Airflow Spark

(EMR)

Page 58: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

58

enterprise Aenterprise Benterprise C

S3

Spark job writes scored messages and stats to

another S3 bucket

S3

Page 59: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case :

59

enterprise Aenterprise Benterprise C

S3

Spark

S3

S3

Page 60: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

60

enterprise Aenterprise Benterprise C

S3

This triggers SNS/SQS messages events

S3

SNS

SQS

Page 61: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case :

61

enterprise Aenterprise Benterprise C

S3

SNS/SQS

S3

SNS

SQS

Page 62: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

62

enterprise Aenterprise Benterprise C

S3

An Autoscale Group (ASG) of Importers spins up when it detects SQS

messages

S3

SNS

SQS

Importers

ASG

Page 63: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case :

63

enterprise Aenterprise Benterprise C

S3

SQS

(ASG)

S3

SNS

SQS

Importers

ASG

Page 64: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

64

enterprise Aenterprise Benterprise C

S3

The importers rapidly ingest scored messages and aggregate statistics into

the DB

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 65: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

65

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Use-Case :

Page 66: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

66

enterprise Aenterprise Benterprise C

S3

Users receive alerts of untrusted emails & can review them in

the web app

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 67: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

67

enterprise Aenterprise Benterprise C

S3

WebApp

S3

SNS

SQS

Importers

ASGDB

Use-Case :

Page 68: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

68

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Airflow manages the entire process

Use-Case : Message Scoring

Page 69: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

69

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Airflow

Use-Case :

Page 70: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Tackling Cost & TimelinessLeveraging the AWS Cloud

70

Page 71: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

AWS

71

Page 72: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Tackling Cost

72

Between Daily Runs During Daily Runs

When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR

Page 73: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

73

23ASG EMR

Page 74: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Tackling Cost

74

Between Hourly Runs During Hourly Runs

When running daily, for 23 hours of a day, we didn’t pay for instances in the ASG or EMR

This does not help when runs are hourly since AWS charges at an hourly rate for EC2 instances!

Page 75: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Tackling Cost

75

23 ASG EMR

AWS

Page 76: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Tackling TimelinessAuto Scaling Group (ASG)

76

Page 77: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

(ASG)

77

Page 78: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

ASG - Overview

78

What is it?

A means to automatically scale out/in clusters to handle variable load/traffic

A means to keep a cluster/service of a fixed size always up

Page 79: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

ASG -

79

ASG

/

Page 80: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

ASG - Data Pipeline

80

importer

importer

importer

importer

Importer ASG

scale out / inSQS

DB

Page 81: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

ASG -

81

importer

importer

importer

importer

ASG

scale out / inSQS

DB

Page 82: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

82

Sent

CPU

ACKd/Recvd

CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant

ASG : CPU-based

Page 83: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

83

Sent

CPU

ACKd/Recvd

CPU-CPU

ASG : CPU-

Page 84: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

ASG : CPU-based

84

Sent

CPU

Recv

Premature Scale-in

Premature Scale-in:

• The CPU drops to noise-levels before all messages are consumed

• This causes scale in to occur while the last few messages are still being committed

Page 85: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

ASG : CPU-

85

Sent

CPU

Recv

Premature Scale-in

:

• CPU

Page 86: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

86

Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)

Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)

This causes the ASG to grow

This causes the ASG to shrink

ASG : Queue-based

Page 87: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

87

Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)

Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)

This causes the ASG to grow

This causes the ASG to shrink

ASG : Queue-

Page 88: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

88

ASG : Queue-based

Shoyu Koto Da!!!!

Page 89: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

89

ASG : Queue-

Shoyu Koto Da!!!!

Page 90: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

90

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost• ASG • EMR Spark

Daily • ASG • EMR Spark Hourly ASG • No Cost Savings

Page 91: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

91

• ASG • EMR Spark

• ASG • EMR Spark

ASG •

Page 92: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Tackling Operability & CorrectnessLeveraging Tooling

92

Page 93: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

93

Page 94: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

94

A simple way to author and manage workflows

Provides visual insight into the state & performance of workflow runs

Integrates with our alerting and monitoring tools

Tackling Operability : Requirements

Page 95: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

95

ns

:

Page 96: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Apache AirflowWorkflow Automation & Scheduling

96

Page 97: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Apache Airflow

97

Page 98: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

98

Airflow: Author DAGs in Python! No need to bundle many config files!

Apache Airflow - Authoring DAGs

Page 99: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

99

Airflow: DAG Python !

Apache Airflow - DAG

Page 100: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

100

Airflow: Visualizing a DAG

Apache Airflow - Authoring DAGs

Page 101: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

101

Airflow: DAG

Apache Airflow - DAG

Page 102: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

102

Airflow: It’s easy to manage multiple DAGs

Apache Airflow - Managing DAGs

Page 103: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

103

Airflow: DAG

Apache Airflow - DAG

Page 104: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Apache Airflow - Perf. Insights

104

Airflow: Gantt chart view reveals the slowest tasks for a run!

Page 105: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Apache Airflow -

105

Airflow:

Page 106: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

106

Apache Airflow - Perf. InsightsAirflow: Task Duration chart view show task completion time trends!

Page 107: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

107

Apache Airflow - Airflow:

Page 108: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

108

Airflow: …And easy to integrate with Ops tools!Apache Airflow - Alerting

Page 109: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

109

Airflow: …And easy to integrate with Ops tools!Apache Airflow -

Page 110: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

110

Apache Airflow - Correctness

Page 111: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

111

Apache Airflow -

Page 112: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

112

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

Page 113: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

113

Page 114: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring (near-real time)NRT Pipeline Architecture

114

Page 115: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

:

( )NRT

115

Page 116: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

116

enterprise Aenterprise Benterprise C

Kinesis batch put every second

K

Page 117: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

:

117

enterprise Aenterprise Benterprise C

Kinesis

K

Page 118: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

118

enterprise Aenterprise Benterprise C

K

As ASG of scorers is scaled up to one process per core per kinesis shard

Scorers

ASG

Page 119: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

:

119

enterprise Aenterprise Benterprise C

K

ASGKinesis CPU

1

Scorers

ASG

Page 120: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

120

enterprise Aenterprise Benterprise C

KScorers

ASG

KinesisScorers apply the trust model and send scored messages downstream

Page 121: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

:

121

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Page 122: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

122

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

As ASG of importers is scaled up to rapidly import messages

DB

Page 123: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

:

123

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

ASG

DB

Page 124: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

124

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

Page 125: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

:

125

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASGDB

K

Alerters

ASG

Page 126: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Use-Case : Message Scoring

126

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

Quarantine Email

Page 127: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

:

127

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASGDB

K

Alerters

ASG

Email

Page 128: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

InnovationsNRT Pipeline Architecture

128

Page 129: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

NRT

129

Page 130: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Apache AvroWhat is Avro?

130

Page 131: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Apache AvroAvro ?

131

Page 132: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

132

What is Avro?

Avro is a self-describing serialization format that supports

primitive data types : int, long, boolean, float, string, bytes, etc…

complex data types : records, arrays, unions, maps, enums, etc…

many language bindings : Java, Scala, Python, Ruby, etc…

Page 133: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

133

What is Avro?

Avro

: int, long, boolean, float, string, bytes, etc…

: records, arrays, unions, maps, enums, etc…

: Java, Scala, Python, Ruby, etc…

Page 134: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

134

What is Avro?

Avro is a self-describing serialization format that supports

primitive data types : int, long, boolean, float, string, bytes, etc…

complex data types : records, arrays, unions, maps, enums, etc…

many language bindings : Java, Scala, Python, Ruby, etc…

The most common format for storing structured Big Data at rest in HDFS, S3, Google Cloud Storage, etc…

Supports Schema Evolution!

Page 135: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

135

What is Avro?

Avro

: int, long, boolean, float, string, bytes, etc…

: records, arrays, unions, maps, enums, etc…

: Java, Scala, Python, Ruby, etc…

HDFS, S3, Google Cloud Storage

!

Page 136: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

136

Avro Schema Example

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Page 137: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

137

Avro {"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Page 138: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

138

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

complex type (record)

Avro Schema Example

Page 139: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

139

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

complex type (record)

Avro

Page 140: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

140

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

complex type (record)Schema name : User

Avro Schema Example

Page 141: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

141

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

complex type (record)Schema name : User

Avro

Page 142: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

142

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

complex type (record)Schema name : User

3 fields in the record: 1 required, 2 optional

Avro Schema Example

Page 143: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

143

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

complex type (record)Schema name : User

3 1 2

Avro

Page 144: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

144

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Data

x 1,000,000,000

Avro Schema Data File Example

Schema

Data

0.0001 %

99.999 %

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Page 145: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

145

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Data

x 1,000,000,000

Avro

0.0001 %

99.999 %

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Page 146: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

146

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Binary Data block

Avro Schema Streaming Example

Schema

Data

99 %

1 %

Data

Page 147: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

147

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Binary Data block

Avro

99 %

1 %

Data

Page 148: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

148

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Binary Data block

Avro Schema Streaming Example

Schema

Data

99 %

1 %

Data

OVERHEAD!!

Page 149: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

149

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Binary Data block

Avro

99 %

1 %

Data

!!

Page 150: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

150

Schema Registry

(Lambda)

Innovation 1 : Avro Schema Registry

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

register_schema

Message Producer (P)

Page 151: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

151

(Lambda)

1 : Avro

{"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

register_schema

(P)

Page 152: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

152

Schema Registry

(Lambda)

Innovation 1 : Avro Schema Registry

register_schema returns a UUID

Message Producer (P)

Page 153: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

153

(Lambda)

1 : Avro

register_schema UUID

(P)

Page 154: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

154

Schema Registry

(Lambda)

Innovation 1 : Avro Schema Registry

Message Producer sends UUID +

Message Producer (P)

Data

Message Consumer (C)

Page 155: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

155

(Lambda)

1 : Avro

UUID +

(P)

Data

(C)

Page 156: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

156

Schema Registry

(Lambda)

Innovation 1 : Avro Schema Registry

Message Producer (P)

Data

Message Consumer (C)

getSchemaById (UUID)

Page 157: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

157

(Lambda)

1 : Avro

(P)

(C)

getSchemaById (UUID)

Page 158: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

158

Schema Registry

(Lambda)

Innovation 1 : Avro Schema Registry

Message Producer (P)

Data

Message Consumer (C)

getSchemaById (UUID){"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Page 159: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

159

(Lambda)

1 : Avro

(P)

(C)

getSchemaById (UUID){"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Page 160: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

160

Schema Registry

(Lambda)

Innovation 1 : Avro Schema Registry

Message Producer (P)

Message Consumer (C)

getSchemaById (UUID){"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Message Consumers • download & cache the schema

• then decode the data

Page 161: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

161

(Lambda)

1 : Avro

(P)

(C)

getSchemaById (UUID){"namespace": "agari", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

• & •

Page 162: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

162

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

Imported messages are also consumed by the

alerter

DB

K

Alerters

ASG

SR

SR

SR

Innovation 1 : Avro Schema Registry

Page 163: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

163

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASG

alterer

DB

K

Alerters

ASG

SR

SR

SR

1 : Avro

Page 164: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

164

The Architecture is composed of repeated patterns of :

ASG-based compute consumer

Kinesis transport streams (i.e. AWS’ managed “Kafka”)

A Lambda-based Avro Schema Registry

Innovation 2 : Repeatable Units

ComputeiKinesisi

ASGi

SR

Page 165: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

165

You can chain these repeatable units together to make arbitrary DAGs (Directed Acyclic Graphs)

User Hashicorp’s Terraform to compose your DAG through automation

The example above is a simple Linear DAG with 3 units

Innovation 2 : Repeatable Units

ComputeiKinesisi

ASGi

SR

ComputeiKinesisi

ASGi

SR

ComputeiKinesisi

ASGi

SR

Page 166: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

166

DAG( )

Hashicorp’s Terraform DAG

DAG

2 :

ComputeiKinesisi

ASGi

SR

ComputeiKinesisi

ASGi

SR

ComputeiKinesisi

ASGi

SR

Page 167: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Airflow Job Reactively Scales

Innovation 3 : Reactive-Scaling (WIP)

167

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASGDB

K

Alerters

ASG

SR

SR

SR

Page 168: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Airflow

3 :

168

enterprise Aenterprise Benterprise C

KScorers

ASG

Kinesis

Importers

ASGDB

K

Alerters

ASG

SR

SR

SR

Page 169: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

169

If the ADR is triggered and a model build or code push was recently done to Compute 1, ADR will revert the last code or model push to ASG Compute 1

Innovation 4 : Anomaly-based Rollback (WIP)

ASG

Compute1 Compute2Kinesis

ASG

SR

Anomaly-detector&Reverter

Page 170: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

170

ADR Compute 1ADR

Compute1

4 : (WIP)

ASG

Compute1 Compute2Kinesis

ASG

SR

Anomaly-detector&Reverter

Page 171: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Open Source Plans

171

Follow us to be notified when the following is open-sourced

• Avro Schema Registry

• Agari (Kinesis+ASG) scaling tool (Airflow Job)

• Anomaly-detector & Reverter

To be notified, follow @AgariEng & @r39132

Page 172: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

172

Twitter

• Avro Schema Registry

• Agari (Kinesis+ASG) scaling tool (Airflow Job)

• Anomaly-detector & Reverter

@AgariEng & @r39132

Page 173: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Acknowledgments

173

• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones

• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle

None of this work would be possible without the contributions of the strong team below

Page 174: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

174

• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones

• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang • Julian Mehnle

Page 175: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

Questions? (@r39132)

175

Page 176: Cloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo

? (@r39132)

176