stream processing beyond streaming data with apache flink · flink runtime stateful computations...
TRANSCRIPT
© 2019 Ververica
Stephan Ewen
Co-creator and PMC of Apache Flink
Ververica (formerly dataArtisans, now part of Alibaba Group)
Stream Processing Beyond Streaming Data
with Apache Flink
© 2019 Ververica2
About Ververica
Original creators of
Apache Flink®
Enterprise
Stream Processing
© 2019 Ververica3
1.7B10K 10KSub-
Second 100TB
machines queries throughput latency state size
events / sec
Apache Flink at
The "Singles Day“ (11/11/2018)
© 2019 Ververica4
Some Apache Flink Users
Sources: Powered by Flink, Speakers – Flink Forward San Francisco 2019, Speakers – Flink Forward Europe 2019
© 2019 Ververica5
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Applications
more real time
Data
Pipelines
Streaming
Analytics
The Spectrum of Streaming Data Use Cases
© 2019 Ververica6
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
© 2019 Ververica7
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
© 2019 Ververica8
Everything is a Stream
Streams Of Records in a Log or MQ
[e.g., Apache Kafka or AWS Kinesis …]
© 2019 Ververica9
Everything is a Stream
Stream of Requests/Responses to/from Services
Service
DB
→ event sourcing architecture
GET /a/b POST /b/c PUT /e/f 200 404 200 200 403
© 2019 Ververica10
Everything is a Stream
Stream of Rows in a Table or in Files
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
© 2019 Ververica11
Everything is a Stream
Stream of Rows in a Table or in Files
2016-3-1
12:00 am
2016-3-1
1:00 am
2016-3-1
2:00 am
2016-3-11
11:00pm
2016-3-12
12:00am
2016-3-12
1:00am
2016-3-11
10:00pm
2016-3-12
2:00am
2016-3-12
3:00am…
a batch
© 2019 Ververica12
Bounded and Unbounded Streams
© 2019 Ververica13
Programs are DAGs of Stateful Computation Steps
Computation
Computation
Computation
Computation
Source
Source
Sink Sink
Transformation
State
State
State
© 2019 Ververica14
DataStream API
Source
Transformation
Windowed Transformation
Sink
val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer011(…))
val events: DataStream[Event] = lines.map((line) => parse(line))
val stats: DataStream[Statistic] = stream.keyBy("sensor").timeWindow(Time.seconds(5)).sum(new MyAggregationFunction())
stats.addSink(new RollingSink(path))
StreamingDataflow
Source Transform Window
(state read/write)Sink
© 2019 Ververica15
Event Sourcing + Memory Image
event stream
persists events
(temporarily)
event /
command
Process
main memory
update local
variables/structures
periodically snapshot
the memory
Also works with RocksDB
(LSM trees)
© 2019 Ververica16
Event Sourcing + Memory Image
Recovery: Restore snapshot and replay events
since snapshot
persists events
(temporarily)Process
event stream
© 2019 Ververica17
Consistent Distributed Snapshots
© 2019 Ververica18
Checkpoints for Recovery / Rollback / Evolution / Cloning / …
Re-load state
Reset positions
in input streams
Rolling back computation
Re-processing
© 2019 Ververica19
Versioning the state of applications
19
Savepoint
Savepoint
Savepoint
App. A
App. B
App. C
Time
Savepoint
© 2019 Ververica20
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
© 2019 Ververica21
SQL / Table API – Batch Queries
SQL
Query
Batch Query
Execution
SELECTroom,TUMBLE_END(rowtime, INTERVAL '1' HOUR),AVG(temperature)
FROMsensors
GROUP BYTUMBLE(rowtime, INTERVAL '1' HOUR), room
Full TPC-H support in
Flink 1.9 with Blink query engine
Full TPC-DS support
targeted for Flink 1.10
© 2019 Ververica22
Interpreting Streams as Tables
© 2019 Ververica23
SQL / Table API – Batch Queries
SQL
Query
Batch Query
Execution
SELECTroom,TUMBLE_END(rowtime, INTERVAL '1' HOUR),AVG(temperature)
FROMsensors
GROUP BYTUMBLE(rowtime, INTERVAL '1' HOUR), room
© 2019 Ververica24
SQL / Table API – Streaming Data Case
SELECTroom,TUMBLE_END(rowtime, INTERVAL '1' HOUR),AVG(temperature)
FROMsensors
GROUP BYTUMBLE(rowtime, INTERVAL '1' HOUR), room
SQL
Query
Interpret Stream
as Table
Incremental
Query Execution output result
changes as stream
update database
with changes
© 2019 Ververica25
Many handy SQL features: Temporal Joins, Pattern Matching, …
SELECT tf.timetf.price * rh.rate as conv_fare
FROM taxiFare AS tf
LATERAL TABLE (Rates(tf.time)) AS rh
WHERE tf.currency = rh.currency;
© 2019 Ververica26
Flink Runtime
Stateful Computations over Data Streams
Stateful
Stream Processing
Streams, State, Time
Event-driven
Applications
Stateful Functions
Streaming Analytics
SQL and Tables
Apache Flink: Analytics and Applications on Streaming Data
© 2019 Ververica27
Disclaimer
Stateful Functions is currently a standalone project
https://statefun.io/
https://github.com/ververica/stateful-functions
We are contributing it to Apache Flink during these weeks
The project is still new and dynamic.
A good time to get involved to get traction ;-)
© 2019 Ververica28
Stream Processing F-a-a-S
λλ
λλ
simplicity / generality
state management
composability
lightweight resources
performance
event-driven
Can we combine some
of these properties
?
© 2019 Ververica29
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b) mass storage
(S3, GCF, ECS, HDFS, …)event ingress
event egress
f(a,b)
snapshot
state
© 2019 Ververica30
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b) mass storage
(S3, GCF, ECS, HDFS, …)event ingress
event egress
f(a,b)
snapshot
state
Event ingresses supply events that trigger functions
© 2019 Ververica31
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b) mass storage
(S3, GCF, ECS, HDFS, …)event ingress
event egress
f(a,b)
snapshot
state
Multiple functions send event to each other
Arbitrary addressing, no restriction to DAG
© 2019 Ververica32
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b) mass storage
(S3, GCF, ECS, HDFS, …)event ingress
event egress
f(a,b)
snapshot
state
Functions have locally embedded state
© 2019 Ververica33
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b) mass storage
(S3, GCF, ECS, HDFS, …)event ingress
event egress
f(a,b)
snapshot
state
State and messaging are consistent
with exactly-once semantics
© 2019 Ververica34
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b) mass storage
(S3, GCF, ECS, HDFS, …)event ingress
event egress
f(a,b)
snapshot
state
No database required
All persistence goes directly to blob storage
© 2019 Ververica35
Stateful Functions
f(a,b)
f(a,b)
f(a,b)
f(a,b)
f(a,b) mass storage
(S3, GCF, ECS, HDFS, …)event ingress
event egress
f(a,b)
snapshot
state
Event egresses to respond via event streams
© 2019 Ververica36
Logical/Virtual Instances
A
F
Cmemory
secondary
storage
Shard 1
G H I
B
function virtual instance
Shard 2
D E
K L M N
© 2019 Ververica37
Logical/Virtual Instances
A
F
C
Shard 1
G H I
B
Shard 2
D E
K L M N
message to "K"
load "K"
possibly evict other
K.invoke(message)
© 2019 Ververica38
Apache Flink is the State and Event Streaming Fabric
Ingress
& Router
Function
Dispatcher
Ingress
& Router
Function
Dispatcher
Feedback
Operator
Feedback
Operator
Egress
Egress
(keyBy) (keyBy)
(side output)
(loop)
Apache Flink Dataflow GraphConceptual Dataflow
Ingress/
RouterFunctions
Ingress/
RouterFunctions
Egress
Egress
© 2019 Ververica39
Running Stateful Functions on Apache Flink
Exactly-once checkpointing for
streaming loops
Function
Dispatcher
Feedback
Operator
loop feedback
© 2019 Ververica40
Example: Ride Sharing App
Driver status
updates
Passenger
ride requests
Ride
status update
Driver
Ride
Pass-
enger
Geo-
index
update create
bill
Inform /
book
bidlookup
update cell
seeking
confirmed
riding
free
bidding
booked
© 2019 Ververica41
data preparation
combining knowledge/information
filtering, enriching,
aggregating, joining events
coordination,
(interacting) state machines
complex event/state
interactions
“occasional” actions or
spiky loads
compute-intensive
or blocking
Stream Processing
Streaming SQLStateful Functions F-a-a-S
f(a,b)
f(a,b)
f(a,b)
λλ
λλ
state-centricevent/stream-centric stateless / compute-centric
© 2019 Ververica42
Putting it all together: Ridesharing again
f(a,b)
f(a,b)
f(a,b)
λλ
λλ
FaaS
render map/route image
create a receipt PDF
send email
Stateful Functions
ride life-cycle
driver-to-ride matching
Stream Processing
traffic models
demand forecast & pricing
Billing
Passenger updates
Driver position updates
Driver status updates
© 2019 Ververica43
Thank you!
If you liked this, engage with the
Apache Flink® community
• Try Flink and help us improve it
• Contribute docs, code, tutorials
• Share your use cases and ideas
• Join a Flink Meetup
• Join the Flink Forward conference
@StephanEwen
@ApacheFlink https://flink.apache.org/