stream processing beyond streaming data with apache flink · flink runtime stateful computations...

Post on 20-May-2020

23 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2019 Ververica

Stephan Ewen

Co-creator and PMC of Apache Flink

Ververica (formerly dataArtisans, now part of Alibaba Group)

Stream Processing Beyond Streaming Data

with Apache Flink

© 2019 Ververica2

About Ververica

Original creators of

Apache Flink®

Enterprise

Stream Processing

© 2019 Ververica3

1.7B10K 10KSub-

Second 100TB

machines queries throughput latency state size

events / sec

Apache Flink at

The "Singles Day“ (11/11/2018)

© 2019 Ververica4

Some Apache Flink Users

Sources: Powered by Flink, Speakers – Flink Forward San Francisco 2019, Speakers – Flink Forward Europe 2019

© 2019 Ververica5

more lag time

Batch

Processing

Continuous

Processing

Event-driven

Applications

Transactional

Applications

more real time

Data

Pipelines

Streaming

Analytics

The Spectrum of Streaming Data Use Cases

© 2019 Ververica6

Flink Runtime

Stateful Computations over Data Streams

Stateful

Stream Processing

Streams, State, Time

Event-driven

Applications

Stateful Functions

Streaming Analytics

SQL and Tables

Apache Flink: Analytics and Applications on Streaming Data

© 2019 Ververica7

Flink Runtime

Stateful Computations over Data Streams

Stateful

Stream Processing

Streams, State, Time

Event-driven

Applications

Stateful Functions

Streaming Analytics

SQL and Tables

Apache Flink: Analytics and Applications on Streaming Data

© 2019 Ververica8

Everything is a Stream

Streams Of Records in a Log or MQ

[e.g., Apache Kafka or AWS Kinesis …]

© 2019 Ververica9

Everything is a Stream

Stream of Requests/Responses to/from Services

Service

DB

→ event sourcing architecture

GET /a/b POST /b/c PUT /e/f 200 404 200 200 403

© 2019 Ververica10

Everything is a Stream

Stream of Rows in a Table or in Files

2016-3-1

12:00 am

2016-3-1

1:00 am

2016-3-1

2:00 am

2016-3-11

11:00pm

2016-3-12

12:00am

2016-3-12

1:00am

2016-3-11

10:00pm

2016-3-12

2:00am

2016-3-12

3:00am…

© 2019 Ververica11

Everything is a Stream

Stream of Rows in a Table or in Files

2016-3-1

12:00 am

2016-3-1

1:00 am

2016-3-1

2:00 am

2016-3-11

11:00pm

2016-3-12

12:00am

2016-3-12

1:00am

2016-3-11

10:00pm

2016-3-12

2:00am

2016-3-12

3:00am…

a batch

© 2019 Ververica12

Bounded and Unbounded Streams

© 2019 Ververica13

Programs are DAGs of Stateful Computation Steps

Computation

Computation

Computation

Computation

Source

Source

Sink Sink

Transformation

State

State

State

© 2019 Ververica14

DataStream API

Source

Transformation

Windowed Transformation

Sink

val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer011(…))

val events: DataStream[Event] = lines.map((line) => parse(line))

val stats: DataStream[Statistic] = stream.keyBy("sensor").timeWindow(Time.seconds(5)).sum(new MyAggregationFunction())

stats.addSink(new RollingSink(path))

StreamingDataflow

Source Transform Window

(state read/write)Sink

© 2019 Ververica15

Event Sourcing + Memory Image

event stream

persists events

(temporarily)

event /

command

Process

main memory

update local

variables/structures

periodically snapshot

the memory

Also works with RocksDB

(LSM trees)

© 2019 Ververica16

Event Sourcing + Memory Image

Recovery: Restore snapshot and replay events

since snapshot

persists events

(temporarily)Process

event stream

© 2019 Ververica17

Consistent Distributed Snapshots

© 2019 Ververica18

Checkpoints for Recovery / Rollback / Evolution / Cloning / …

Re-load state

Reset positions

in input streams

Rolling back computation

Re-processing

© 2019 Ververica19

Versioning the state of applications

19

Savepoint

Savepoint

Savepoint

App. A

App. B

App. C

Time

Savepoint

© 2019 Ververica20

Flink Runtime

Stateful Computations over Data Streams

Stateful

Stream Processing

Streams, State, Time

Event-driven

Applications

Stateful Functions

Streaming Analytics

SQL and Tables

Apache Flink: Analytics and Applications on Streaming Data

© 2019 Ververica21

SQL / Table API – Batch Queries

SQL

Query

Batch Query

Execution

SELECTroom,TUMBLE_END(rowtime, INTERVAL '1' HOUR),AVG(temperature)

FROMsensors

GROUP BYTUMBLE(rowtime, INTERVAL '1' HOUR), room

Full TPC-H support in

Flink 1.9 with Blink query engine

Full TPC-DS support

targeted for Flink 1.10

© 2019 Ververica22

Interpreting Streams as Tables

© 2019 Ververica23

SQL / Table API – Batch Queries

SQL

Query

Batch Query

Execution

SELECTroom,TUMBLE_END(rowtime, INTERVAL '1' HOUR),AVG(temperature)

FROMsensors

GROUP BYTUMBLE(rowtime, INTERVAL '1' HOUR), room

© 2019 Ververica24

SQL / Table API – Streaming Data Case

SELECTroom,TUMBLE_END(rowtime, INTERVAL '1' HOUR),AVG(temperature)

FROMsensors

GROUP BYTUMBLE(rowtime, INTERVAL '1' HOUR), room

SQL

Query

Interpret Stream

as Table

Incremental

Query Execution output result

changes as stream

update database

with changes

© 2019 Ververica25

Many handy SQL features: Temporal Joins, Pattern Matching, …

SELECT tf.timetf.price * rh.rate as conv_fare

FROM taxiFare AS tf

LATERAL TABLE (Rates(tf.time)) AS rh

WHERE tf.currency = rh.currency;

© 2019 Ververica26

Flink Runtime

Stateful Computations over Data Streams

Stateful

Stream Processing

Streams, State, Time

Event-driven

Applications

Stateful Functions

Streaming Analytics

SQL and Tables

Apache Flink: Analytics and Applications on Streaming Data

© 2019 Ververica27

Disclaimer

Stateful Functions is currently a standalone project

https://statefun.io/

https://github.com/ververica/stateful-functions

We are contributing it to Apache Flink during these weeks

The project is still new and dynamic.

A good time to get involved to get traction ;-)

© 2019 Ververica28

Stream Processing F-a-a-S

λλ

λλ

simplicity / generality

state management

composability

lightweight resources

performance

event-driven

Can we combine some

of these properties

?

© 2019 Ververica29

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

© 2019 Ververica30

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

Event ingresses supply events that trigger functions

© 2019 Ververica31

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

Multiple functions send event to each other

Arbitrary addressing, no restriction to DAG

© 2019 Ververica32

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

Functions have locally embedded state

© 2019 Ververica33

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

State and messaging are consistent

with exactly-once semantics

© 2019 Ververica34

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

No database required

All persistence goes directly to blob storage

© 2019 Ververica35

Stateful Functions

f(a,b)

f(a,b)

f(a,b)

f(a,b)

f(a,b) mass storage

(S3, GCF, ECS, HDFS, …)event ingress

event egress

f(a,b)

snapshot

state

Event egresses to respond via event streams

© 2019 Ververica36

Logical/Virtual Instances

A

F

Cmemory

secondary

storage

Shard 1

G H I

B

function virtual instance

Shard 2

D E

K L M N

© 2019 Ververica37

Logical/Virtual Instances

A

F

C

Shard 1

G H I

B

Shard 2

D E

K L M N

message to "K"

load "K"

possibly evict other

K.invoke(message)

© 2019 Ververica38

Apache Flink is the State and Event Streaming Fabric

Ingress

& Router

Function

Dispatcher

Ingress

& Router

Function

Dispatcher

Feedback

Operator

Feedback

Operator

Egress

Egress

(keyBy) (keyBy)

(side output)

(loop)

Apache Flink Dataflow GraphConceptual Dataflow

Ingress/

RouterFunctions

Ingress/

RouterFunctions

Egress

Egress

© 2019 Ververica39

Running Stateful Functions on Apache Flink

Exactly-once checkpointing for

streaming loops

Function

Dispatcher

Feedback

Operator

loop feedback

© 2019 Ververica40

Example: Ride Sharing App

Driver status

updates

Passenger

ride requests

Ride

status update

Driver

Ride

Pass-

enger

Geo-

index

update create

bill

Inform /

book

bidlookup

update cell

seeking

confirmed

riding

free

bidding

booked

© 2019 Ververica41

data preparation

combining knowledge/information

filtering, enriching,

aggregating, joining events

coordination,

(interacting) state machines

complex event/state

interactions

“occasional” actions or

spiky loads

compute-intensive

or blocking

Stream Processing

Streaming SQLStateful Functions F-a-a-S

f(a,b)

f(a,b)

f(a,b)

λλ

λλ

state-centricevent/stream-centric stateless / compute-centric

© 2019 Ververica42

Putting it all together: Ridesharing again

f(a,b)

f(a,b)

f(a,b)

λλ

λλ

FaaS

render map/route image

create a receipt PDF

send email

Stateful Functions

ride life-cycle

driver-to-ride matching

Stream Processing

traffic models

demand forecast & pricing

Billing

Passenger updates

Driver position updates

Driver status updates

© 2019 Ververica43

Thank you!

If you liked this, engage with the

Apache Flink® community

• Try Flink and help us improve it

• Contribute docs, code, tutorials

• Share your use cases and ideas

• Join a Flink Meetup

• Join the Flink Forward conference

@StephanEwen

@ApacheFlink https://flink.apache.org/

top related