k. tzoumas & s. ewen – flink forward keynote
TRANSCRIPT
Welcome to
The first conference on Apache Flink
Sponsored by
Some practical info § Registration, cloakroom, and meals are in
Palais
§ Information point always staffed
§ WiFi is FlinkForward
§ Twitter hashtag is #ff15
§ Follow @FlinkForward
Some practical info
§ Need help? Look for a volunteer (pink badges)
§ All sessions are recorded and will be made available online
§ This includes the training sessions
3
Getting around
4
Please go around while talks are in progress
Our speaker organizations
5
Kostas Tzoumas and Stephan Ewen @kostas_tzoumas | @StephanEwen
Apache FlinkTM: From Incubation to Flink 1.0
7
1. A bit of history
2. The streaming era and Flink
3. Inside Flink 0.10
4. Towards Flink 1.0 and beyond
A bit of history From incubation until now
8
9
DataSet API (Java/Scala)
Flink core
Local Remote Yarn
Apr 2014 Jun 2015 Dec 2014
0.7 0.6 0.5 0.9 0.9-m1 0.10
Oct 2015
Top level
0.8
Gel
ly
Tabl
e
Flin
kML
SAM
OA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
Hado
op M
/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dat
aflow
Dat
aflow
Casc
adin
g
Tabl
e
Stor
m
Community growth
Flink is one of the largest and most active Apache big data projects with well over 120 contributors
10
Flink meetups around the globe
11
Featured in
12
The streaming era Welcome to
13
14
batch
event based
need new systems
well served
15
Streaming is the biggest change in data infrastructure since Hadoop
16
1. Radically simplified infrastructure 2. Internet of Things, on-demand services
3. Can completely subsume batch
17
In a world of events and isolated apps, the stream processor is the backbone of the data infrastructure
App App
App
local view
local view local view
Consistent movement,
analytics
App App App
Global view Consistent store
18
§ Until now, stream processors were less mature than batch processors
§ This led to • in-house solutions • abuse of batch processors • Lambda architectures
§ This is no longer the case
19
Flink 0.10 With the upcoming 0.10 release, Flink significantly surpasses the state of the art in open source stream processing systems.
And, we are heading to Flink 1.0 after that.
20
§ Streaming technology has matured • e.g., Flink, Kafka, Dataflow
§ Flink and Dataflow duality • a Google technology • an open source Apache project
+
21
§ Streaming is happening
§ Better adapt now
§ Flink 0.10: a ready to use open source stream processor
Flink 0.10 Flink for the streaming era
22
Improved DataStream API § Stream data analysis differs from batch data
analysis by introducing time
§ Streams are unbounded and produce data over time
§ Simple as batch API if handling time in a simple way
§ Powerful if you want to handle time in an advanced way (out-of-order records, preliminary results, etc)
23
Improved DataStream API
24
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) }
Improved DataStream API
25
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) } .keyBy("location") .timeWindow(Time.of(15, MINUTES), Time.of(5, MINUTES)) .sum("numVehicles")
Improved DataStream API
26
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) } .keyBy("location") .timeWindow(Time.of(15, MINUTES), Time.of(5, MINUTES)) .trigger(new Threshold(200)) .sum("numVehicles")
Improved DataStream API
27
case class Event(location: Location, numVehicles: Long)
val stream: DataStream[Event] = …; stream .filter { evt => isIntersection(evt.location) } .keyBy("location") .timeWindow(Time.of(15, MINUTES), Time.of(5, MINUTES)) .trigger(new Threshold(200)) .sum("numVehicles") .keyBy( evt => evt.location.grid ) .mapWithState { (evt, state: Option[Model]) => { val model = state.orElse(new Model()) (model.classify(evt), Some(model.update(evt))) }}
IoT / Mobile Applications
28
Events occur on devices
Queue / Log
Events analyzed in a data streaming
system
Stream Analysis
Events stored in a log
IoT / Mobile Applications
29
IoT / Mobile Applications
30
IoT / Mobile Applications
31
IoT / Mobile Applications
32
Out of order !!!
First burst of events
Second burst of events
IoT / Mobile Applications
33
Event time windows
Arrival time windows
Instant event-at-a-time
Flink supports out of order time (event time) windows, arrival time windows (and mixtures) plus low latency processing.
First burst of events
Second burst of events
High Availability and Consistency
34
No Single-Point-Of-Failure any more
Exactly-once processing semantics across pipeline
Checkpoints/Fault Tolerance is decoupled from windows è Allows for highly flexible window implementations
ZooKeeper ensemble
Multiple Masters
failover
Performance
35
Continuous streaming
Latency-bound buffering
Distributed Snapshots
High Throughput & Low Latency
With configurable throughput/latency tradeoff
Batch and Streaming
36
case class WordCount(word: String, count: Int)
val text: DataStream[String] = …; text .flatMap { line => line.split(" ") } .map { word => new WordCount(word, 1) } .keyBy("word") .window(GlobalWindows.create()) .trigger(new EOFTrigger()) .sum("count")
Batch Word Count in the DataStream API
Batch and Streaming
37
Batch Word Count in the DataSet API
case class WordCount(word: String, count: Int)
val text: DataStream[String] = …; text .flatMap { line => line.split(" ") } .map { word => new WordCount(word, 1) } .keyBy("word") .window(GlobalWindows.create()) .trigger(new EOFTrigger()) .sum("count")
val text: DataSet[String] = …; text .flatMap { line => line.split(" ") } .map { word => new WordCount(word, 1) } .groupBy("word") .sum("count")
Batch and Streaming
38
Pipelined and blocking operators Streaming Dataflow Runtime
Batch Parameters
DataSet DataStream
Relational Optimizer
Window Optimization
Pipelined and windowed operators
Schedule lazily Schedule eagerly
Recompute whole operators Periodic checkpoints
Streaming data movement
Stateful operations
DAG recovery Fully buffered streams DAG resource management
Streaming Parameters
Batch and Streaming
39
A full-fledged batch processor as well G
elly
Tabl
e
Flin
kML
SAM
OA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
Hado
op M
/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dat
aflow
Dat
aflow
Casc
adin
g
Tabl
e
Stor
m
Batch and Streaming
40
A full-fledged batch processor as well G
elly
Tabl
e
Flin
kML
SAM
OA
DataSet (Java/Scala/Python) DataStream (Java/Scala)
Hado
op M
/R
Flink dataflow engine
Local Remote Yarn Tez Embedded
Dat
aflow
Dat
aflow
Casc
adin
g
Tabl
e
Stor
m
More details at Dongwon Kim's Talk "A comparative performance evaluation of Flink"
Integration (picture not complete)
41
POSIX Java/Scala Collections
POSIX
Monitoring
42
Life system metrics and user-defined accumulators/statistics
Get http://flink-‐m:8081/jobs/7684be6004e4e955c2a558a9bc463f65/accumulators
Monitoring REST API for custom monitoring tools
{ "id": "dceafe2df1f57a1206fcb907cb38ad97", "user-‐accumulators": [ { "name":"avglen", "type":"DoubleCounter", "value":"123.03259440000001" }, { "name":"genwords", "type":"LongCounter", "value":"75000000" } ] }
Flink 0.10 Summary § Focus on operational readiness • high availability • monitoring • integration with other systems
§ First-class support for event time
§ Refined DataStream API: easy and powerful
43
Towards Flink 1.0 and beyond Where we see the project going
44
Towards Flink 1.0
§ Flink 1.0 is around the corner
§ Focus on defining public APIs and automatic API compatibility checks
§ Guarantee backwards compatibility in all Flink 1.X versions
45
Beyond Flink 1.0 § Flink engine has most features in place
§ Focus on usability features on top of DataStream API • e.g., SQL, ML, more connectors
§ Continue work on elasticity and memory management
46
47
Enjoy the rest of
The first conference on Apache Flink