a data streaming architecture with apache flink

31
A Data Streaming Architecture with Apache Flink Robert Metzger @rmetzger_ [email protected] Berlin Buzzwords, June 7, 2016

Upload: duongkhanh

Post on 13-Feb-2017

242 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Data Streaming Architecture with Apache Flink

A Data Streaming Architecture with Apache Flink

Robert Metzger@[email protected]

Berlin Buzzwords,June 7, 2016

Page 2: A Data Streaming Architecture with Apache Flink

Talk overview

My take on the stream processing space, and how it changes the way we think about data

Transforming an existing data analysis pattern into the streaming world (“Streaming ETL”)

Demo

2

Page 3: A Data Streaming Architecture with Apache Flink

Apache Flink

Apache Flink is an open source stream processing framework• Low latency• High throughput• Stateful• Distributed

Developed at the Apache Software Foundation, 1.0.0 released in March 2016,used in production

3

Page 4: A Data Streaming Architecture with Apache Flink

Entering the streaming era

4

Page 5: A Data Streaming Architecture with Apache Flink

5

Streaming is the biggest change in data infrastructure

since Hadoop

Page 6: A Data Streaming Architecture with Apache Flink

6

1. Radically simplified infrastructure2. Do more with your data, faster3. Can completely subsume batch

Page 7: A Data Streaming Architecture with Apache Flink

7

Real-world data is produced in a continuous fashion.

New systems like Flink and Kafka embrace streaming

nature of data.

Web serverWeb server Kafka topic Stream

processorStream

processor

Page 8: A Data Streaming Architecture with Apache Flink

Apache Flink stack

8

Gelly

Gelly

Table

/ S

QL

Table

/ S

QL

ML

ML

SA

MO

AS

AM

OA

DataSet (Java/Scala)DataSet (Java/Scala)DataStream (Java / Scala)

DataStream (Java / Scala)

Hadoop M

/RH

adoop M

/RLocalLocalClusterClusterYARNYARN

Apach

e B

eam

Apach

e B

eam

Apach

e B

eam

Apach

e B

eam

Table

/

Str

eam

SQ

LTa

ble

/

Str

eam

SQ

L

Casc

adin

gC

asc

adin

g

Streaming dataflow runtimeSto

rm A

PI

Sto

rm A

PI

Zeppelin

Zeppelin

CEP

CEP

Page 9: A Data Streaming Architecture with Apache Flink

What makes Flink flink?

9

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Page 10: A Data Streaming Architecture with Apache Flink

Moving existing (batch) data analysis into streaming

10

Page 11: A Data Streaming Architecture with Apache Flink

Extract, Transform, Load (ETL)

ETL: Move data from A to B and transform it on the way

Old approach:

Server Logs

Server LogsServer

LogsServer Logs

Server Logs

Server Logs

Mobile

IoT

Page 12: A Data Streaming Architecture with Apache Flink

Extract, Transform, Load (ETL)

ETL: Move data from A to B and transform it on the way

Old approach:

Server Logs

Server Logs

HDFS / S3

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Tier 0: Raw data

Page 13: A Data Streaming Architecture with Apache Flink

Extract, Transform, Load (ETL)

ETL: Move data from A to B and transform it on the way

Old approach:

Server Logs

Server Logs

HDFS / S3

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Tier 0: Raw data

Tier 1: Normalized, cleansed data

Tier 1: Normalized, cleansed data

Periodic jobsPeriodic jobs Parquet /

ORC in HDFS

Parquet /ORC in HDFS

User

Page 14: A Data Streaming Architecture with Apache Flink

Extract, Transform, Load (ETL)

ETL: Move data from A to B and transform it on the way

Old approach:

Server Logs

Server Logs

HDFS / S3

HDFS / S3

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Tier 0: Raw data

Tier 1: Normalized, cleansed data

Tier 1: Normalized, cleansed data

Periodic jobsPeriodic jobs Parquet /

ORC in HDFS

Parquet /ORC in HDFS

Tier 2: Aggregated data

Tier 2: Aggregated data

Periodic jobsPeriodic jobs

User

User

“Data Warehouse”

Page 15: A Data Streaming Architecture with Apache Flink

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

Server Logs

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 0: Raw data

Tier 0: Raw data

Page 16: A Data Streaming Architecture with Apache Flink

Stream Processor

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

Server Logs

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Kafka Connect

or

Kafka Connect

or

Tier 0: Raw data

Tier 0: Raw data

Cleansing

Cleansing

Transformation

Transformation

Time-WindowTime-

Window

AlertsAlerts

Time-WindowTime-

Window

Page 17: A Data Streaming Architecture with Apache Flink

Stream Processor

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

Server Logs

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 1: Normalized, cleansed data

Tier 1: Normalized, cleansed data

Parquet /ORC in HDFS

Parquet /ORC in HDFS

Kafka Connect

or

Kafka Connect

or

ES Connect

or

ES Connect

or

Rolling file sinkRolling file sink

Tier 0: Raw data

Tier 0: Raw data

Cleansing

Cleansing

Transformation

Transformation

Time-WindowTime-

Window

AlertsAlerts

Time-WindowTime-

Window

User

Batch Processing

Page 18: A Data Streaming Architecture with Apache Flink

Stream Processor

Extract, Transform, Load (Streaming ETL) ETL: Move data from A to B and transform it on the

way Streaming approach:

Server Logs

Server Logs

“Data Lake”

Server Logs

Server Logs

Server Logs

Server Logs

Mobile

IoT

Tier 1: Normalized, cleansed data

Tier 1: Normalized, cleansed data

Parquet /ORC in HDFS

Parquet /ORC in HDFS

Tier 2: Aggregated dataTier 2: Aggregated data

User

Kafka Connect

or

Kafka Connect

or

ES Connect

or

ES Connect

or

Rolling file sinkRolling file sink

JDBC sinkJDBC sink

Cassandrasink

Cassandrasink

Tier 0: Raw data

Tier 0: Raw data

Cleansing

Cleansing

Transformation

Transformation

Time-WindowTime-

Window

AlertsAlerts

Time-WindowTime-

Window

User

Batch Processing

Page 19: A Data Streaming Architecture with Apache Flink

Streaming ETL: Low Latency

19

Less than 500 ms*

Less than 250 ms** Your mileage may vary. These are rule of thumb estimates.

Events are processed immediately No need to wait until the next “load” batch job is running

hours minutes milliseconds

Periodic batch job

Batch processor with micro-batchesLatenc

y

Approach

seconds

Stream processor

Page 20: A Data Streaming Architecture with Apache Flink

Streaming ETL: Event-time aware

20

Events derived from the same real-world activity might arrive out of order in the system

Flink is event-time aware

11:28

11:29

11:28

11:29

11:28

11:29

Same real-world activity Out of sync

clocksOut of sync clocks

Network delaysNetwork delays Machine failuresMachine failures

Page 21: A Data Streaming Architecture with Apache Flink

Demo

21

Page 22: A Data Streaming Architecture with Apache Flink

Job Overview

22

Flink Twitter Source

Flink Twitter Source

Data Ingestion Job

“Streaming ETL” Job

Page 23: A Data Streaming Architecture with Apache Flink

Job Overview

23

(Rolling) file sink(Rolling) file sinkFilter operationFilter operationFilter operationFilter operation

Aggregation to ElasticSearchAggregation to ElasticSearch

Streaming WordCountStreaming WordCount

TopN operatorTopN operator

Page 25: A Data Streaming Architecture with Apache Flink

Closing

25

Page 26: A Data Streaming Architecture with Apache Flink

26

https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets-25580481910

Page 27: A Data Streaming Architecture with Apache Flink

Flink Forward 2016, Berlin

Submission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

Page 28: A Data Streaming Architecture with Apache Flink

We are hiring!data-artisans.com/careers

Page 29: A Data Streaming Architecture with Apache Flink

Questions?

Ask now! eMail: [email protected] Twitter: @rmetzger_

Follow: @ApacheFlink Read: flink.apache.org/blog, data-artisans.com/blog/ Mailinglists: (news | user | dev)@flink.apache.org

29

Page 30: A Data Streaming Architecture with Apache Flink

Appendix

30