introduction to stream processing with apache flink® · apache flink 1.0.3 . stateless stream...

Post on 26-Aug-2020

26 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Kostas Kloudas Vasia Kalavri Jonas Traub

Introduction to Stream Processing with Apache Flink®

Who are we?

Kostas: software engineer @ data Artisans Vasia: PhD student @ KTH Stockholm

Jonas: research associate @ TU Berlin

2

Overview

What is Stream Processing? What is Apache Flink? Windowed computations over streams Handling time Handling node failures Handling planned downtime Handling code upgrades

3

Demo instructions…

4

Robust Stream Processing with Apache Flink®: A Simple Walkthrough http://data-artisans.com/robust-stream-processing-flink-walkthrough/#more-1181

Make sure you download: Apache Flink 1.0.3

Stateless stream processing

5

Stateful stream processing

6

Why should you care?

7

Data production is and has always been a continuous process.

Stream processing enables the obvious: Continuous processing on data that is

continuously produced

What is Apache Flink?

8

A data processing engine

9

Apache Flink is an open source platform for distributed stream and batch processing

The Apache Flink Ecosystem

10

SQL

SQL

What does Flink provide?

High Throughput and Low Latency • Yahoo! Benchmark : https://yahooeng.tumblr.com/post/135321837876/benchmarking-

streaming-computation-engines-at • Extended by Data Artisans: http://data-artisans.com/extending-the-yahoo-streaming-benchmark/

11

What does Flink provide?

High Throughput and Low Latency Event-time (out-of-order) processing Exactly-once semantics Flexible windowing Fault-Tolerance

12

Time for demo…

13

Robust Stream Processing with Apache Flink®: A Simple Walkthrough http://data-artisans.com/robust-stream-processing-flink-walkthrough/#more-1181

Setup:

14

Sensor Data

Windowed computations

15

Handling time

16

Handling time

17

The system has to respect the same clock as the data.

1977 1980 1983 1999 2002 2005 2015

Processing Time

Episode IV

Episode V

Episode VI

Episode I

Episode II

Episode III

Episode VII

Event Time

18

Event Time vs Processing Time

Handling time: Watermarks

19

Special events generated by the sources. A watermark for time T states that event

time has progressed to T in that particular stream (or partition). No events with a timestamp smaller than T

can arrive any more.

Handling time: Watermarks

20

Sources emit elements and watermarks….

…operators always emit the lowest watermark

Handling time: Watermarks

21

Handling node failures

22

Checkpoints

23

Sources emit elements and checkpoints….

Checkpoints

24

Handling planned downtime

25

Handling code upgrades

26

Is Apache Flink only that?

27

Apache Flink is an open source platform for distributed stream and batch processing

Its lively community

28

You can join: • Follow: @ApacheFlink, @dataArtisans • Read: flink.apache.org/blog, data-artisans.com/blog • Subscribe: (news | user | dev) @ flink.apache.org

0

50

100

150

200

250

Feb.15 Dec.15 Aug.16

Contributors

0200400600800

10001200140016001800

Feb.15 Dec.15 Aug.16

Stars on Github

0

200

400

600

800

1000

1200

Feb.15 Dec.15 Aug.16

Forks on Github Apache Flink Community Growth

Its Users

29

…https://flink.apache.org/poweredby.html

All of them will meet at... http://flink-forward.org/

Further Reading

Event-time processing: • The Dataflow Model: http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf • http://data-artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/

Checkpointing and State: • Distributed Snapshots: Determining Global States of Distributed Systems

http://research.microsoft.com/en-us/um/people/lamport/pubs/chandy.pdf • Lightweight Asynchronous Snapshots for Distributed Dataflows

https://arxiv.org/abs/1506.08603 • Working with State in Flink: https://ci.apache.org/projects/flink/flink-docs-

master/dev/state.html

Savepoints: • https://ci.apache.org/projects/flink/flink-docs-master/setup/savepoints.html

32

top related