stream processing beyond streaming data with apache flink · flink runtime stateful computations...

43
© 2019 Ververica Stephan Ewen Co-creator and PMC of Apache Flink Ververica (formerly dataArtisans, now part of Alibaba Group) Stream Processing Beyond Streaming Data with Apache Flink

Upload: others

Post on 20-May-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica

Stephan Ewen

Co-creator and PMC of Apache Flink

Ververica (formerly dataArtisans, now part of Alibaba Group)

Stream Processing Beyond Streaming Data

with Apache Flink

Page 2: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica2

About Ververica

Original creators of

Apache Flink®

Enterprise

Stream Processing

Page 3: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica3

1.7B10K 10KSub-

Second 100TB

machines queries throughput latency state size

events / sec

Apache Flink at

The "Singles Day“ (11/11/2018)

Page 4: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica4

Some Apache Flink Users

Sources: Powered by Flink, Speakers – Flink Forward San Francisco 2019, Speakers – Flink Forward Europe 2019

Page 5: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica5

more lag time

Batch

Processing

Continuous

Processing

Event-driven

Applications

Transactional

Applications

more real time

Data

Pipelines

Streaming

Analytics

The Spectrum of Streaming Data Use Cases

Page 6: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica6

Flink Runtime

Stateful Computations over Data Streams

Stateful

Stream Processing

Streams, State, Time

Event-driven

Applications

Stateful Functions

Streaming Analytics

SQL and Tables

Apache Flink: Analytics and Applications on Streaming Data

Page 7: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica7

Flink Runtime

Stateful Computations over Data Streams

Stateful

Stream Processing

Streams, State, Time

Event-driven

Applications

Stateful Functions

Streaming Analytics

SQL and Tables

Apache Flink: Analytics and Applications on Streaming Data

Page 8: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica8

Everything is a Stream

Streams Of Records in a Log or MQ

[e.g., Apache Kafka or AWS Kinesis …]

Page 9: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica9

Everything is a Stream

Stream of Requests/Responses to/from Services

Service

DB

→ event sourcing architecture

GET /a/b POST /b/c PUT /e/f 200 404 200 200 403

Page 10: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica10

Everything is a Stream

Stream of Rows in a Table or in Files

2016-3-1

12:00 am

2016-3-1

1:00 am

2016-3-1

2:00 am

2016-3-11

11:00pm

2016-3-12

12:00am

2016-3-12

1:00am

2016-3-11

10:00pm

2016-3-12

2:00am

2016-3-12

3:00am…

Page 11: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica11

Everything is a Stream

Stream of Rows in a Table or in Files

2016-3-1

12:00 am

2016-3-1

1:00 am

2016-3-1

2:00 am

2016-3-11

11:00pm

2016-3-12

12:00am

2016-3-12

1:00am

2016-3-11

10:00pm

2016-3-12

2:00am

2016-3-12

3:00am…

a batch

Page 12: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica12

Bounded and Unbounded Streams

Page 13: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica13

Programs are DAGs of Stateful Computation Steps

Computation

Computation

Computation

Computation

Source

Source

Sink Sink

Transformation

State

State

State

Page 14: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica14

DataStream API

Source

Transformation

Windowed Transformation

Sink

val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer011(…))

val events: DataStream[Event] = lines.map((line) => parse(line))

val stats: DataStream[Statistic] = stream.keyBy("sensor").timeWindow(Time.seconds(5)).sum(new MyAggregationFunction())

stats.addSink(new RollingSink(path))

StreamingDataflow

Source Transform Window

(state read/write)Sink

Page 15: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica15

Event Sourcing + Memory Image

event stream

persists events

(temporarily)

event /

command

Process

main memory

update local

variables/structures

periodically snapshot

the memory

Also works with RocksDB

(LSM trees)

Page 16: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica16

Event Sourcing + Memory Image

Recovery: Restore snapshot and replay events

since snapshot

persists events

(temporarily)Process

event stream

Page 17: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica17

Consistent Distributed Snapshots

Page 18: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica18

Checkpoints for Recovery / Rollback / Evolution / Cloning / …

Re-load state

Reset positions

in input streams

Rolling back computation

Re-processing

Page 19: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica19

Versioning the state of applications

19

Savepoint

Savepoint

Savepoint

App. A

App. B

App. C

Time

Savepoint

Page 20: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica20

Flink Runtime

Stateful Computations over Data Streams

Stateful

Stream Processing

Streams, State, Time

Event-driven

Applications

Stateful Functions

Streaming Analytics

SQL and Tables

Apache Flink: Analytics and Applications on Streaming Data

Page 21: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica21

SQL / Table API – Batch Queries

SQL

Query

Batch Query

Execution

SELECTroom,TUMBLE_END(rowtime, INTERVAL '1' HOUR),AVG(temperature)

FROMsensors

GROUP BYTUMBLE(rowtime, INTERVAL '1' HOUR), room

Full TPC-H support in

Flink 1.9 with Blink query engine

Full TPC-DS support

targeted for Flink 1.10

Page 22: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica22

Interpreting Streams as Tables

Page 23: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica23

SQL / Table API – Batch Queries

SQL

Query

Batch Query

Execution

SELECTroom,TUMBLE_END(rowtime, INTERVAL '1' HOUR),AVG(temperature)

FROMsensors

GROUP BYTUMBLE(rowtime, INTERVAL '1' HOUR), room

Page 24: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica24

SQL / Table API – Streaming Data Case

SELECTroom,TUMBLE_END(rowtime, INTERVAL '1' HOUR),AVG(temperature)

FROMsensors

GROUP BYTUMBLE(rowtime, INTERVAL '1' HOUR), room

SQL

Query

Interpret Stream

as Table

Incremental

Query Execution output result

changes as stream

update database

with changes

Page 25: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica25

Many handy SQL features: Temporal Joins, Pattern Matching, …

SELECT tf.timetf.price * rh.rate as conv_fare

FROM taxiFare AS tf

LATERAL TABLE (Rates(tf.time)) AS rh

WHERE tf.currency = rh.currency;

Page 26: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica26

Flink Runtime

Stateful Computations over Data Streams

Stateful

Stream Processing

Streams, State, Time

Event-driven

Applications

Stateful Functions

Streaming Analytics

SQL and Tables

Apache Flink: Analytics and Applications on Streaming Data

Page 27: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica27

Disclaimer

Stateful Functions is currently a standalone project

https://statefun.io/

https://github.com/ververica/stateful-functions

We are contributing it to Apache Flink during these weeks

The project is still new and dynamic.

A good time to get involved to get traction ;-)

Page 28: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica28

Stream Processing F-a-a-S

λλ

λλ

simplicity / generality

state management

composability

lightweight resources

performance

event-driven

Can we combine some

of these properties

?

Page 29: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica29

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

Page 30: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica30

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

Event ingresses supply events that trigger functions

Page 31: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica31

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

Multiple functions send event to each other

Arbitrary addressing, no restriction to DAG

Page 32: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica32

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

Functions have locally embedded state

Page 33: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica33

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

State and messaging are consistent

with exactly-once semantics

Page 34: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica34

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

No database required

All persistence goes directly to blob storage

Page 35: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica35

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

Event egresses to respond via event streams

Page 36: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica36

Logical/Virtual Instances

A

F

Cmemory

secondary

storage

Shard 1

G H I

B

function virtual instance

Shard 2

D E

K L M N

Page 37: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica37

Logical/Virtual Instances

A

F

C

Shard 1

G H I

B

Shard 2

D E

K L M N

message to "K"

load "K"

possibly evict other

K.invoke(message)

Page 38: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica38

Apache Flink is the State and Event Streaming Fabric

Ingress

& Router

Function

Dispatcher

Ingress

& Router

Function

Dispatcher

Feedback

Operator

Feedback

Operator

Egress

Egress

(keyBy) (keyBy)

(side output)

(loop)

Apache Flink Dataflow GraphConceptual Dataflow

Ingress/

RouterFunctions

Ingress/

RouterFunctions

Egress

Egress

Page 39: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica39

Running Stateful Functions on Apache Flink

Exactly-once checkpointing for

streaming loops

Function

Dispatcher

Feedback

Operator

loop feedback

Page 40: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica40

Example: Ride Sharing App

Driver status

updates

Passenger

ride requests

Ride

status update

Driver

Ride

Pass-

enger

Geo-

index

update create

bill

Inform /

book

bidlookup

update cell

seeking

confirmed

riding

free

bidding

booked

Page 41: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica41

data preparation

combining knowledge/information

filtering, enriching,

aggregating, joining events

coordination,

(interacting) state machines

complex event/state

interactions

“occasional” actions or

spiky loads

compute-intensive

or blocking

Stream Processing

Streaming SQLStateful Functions F-a-a-S

f(a,b)

f(a,b)

f(a,b)

λλ

λλ

state-centricevent/stream-centric stateless / compute-centric

Page 42: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica42

Putting it all together: Ridesharing again

f(a,b)

f(a,b)

f(a,b)

λλ

λλ

FaaS

render map/route image

create a receipt PDF

send email

Stateful Functions

ride life-cycle

driver-to-ride matching

Stream Processing

traffic models

demand forecast & pricing

Billing

Passenger updates

Driver position updates

Driver status updates

Page 43: Stream Processing Beyond Streaming Data with Apache Flink · Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-driven Applications

© 2019 Ververica43

Thank you!

If you liked this, engage with the

Apache Flink® community

• Try Flink and help us improve it

• Contribute docs, code, tutorials

• Share your use cases and ideas

• Join a Flink Meetup

• Join the Flink Forward conference

@StephanEwen

@ApacheFlink https://flink.apache.org/