apache flink deep dive
TRANSCRIPT
![Page 1: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/1.jpg)
Apache Flink Deep Dive
Vasia KalavriFlink Committer & KTH PhD student
1st Apache Flink Meetup StockholmMay 11, 2015
![Page 2: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/2.jpg)
Flink Internals
● Job Life-Cycle○ what happens after you submit a Flink job?
● The Batch Optimizer○ how are execution plans chosen?
● Delta Iterations○ how are Flink iterations special for Graph and ML
apps?
2
![Page 3: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/3.jpg)
what happens after you submit a Flink job?
![Page 4: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/4.jpg)
The Flink Stack
Pyt
hon
Gel
ly
Tabl
e
Flin
k M
L
SA
MO
A
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop M/R
Flink Runtime
Local Remote Yarn Tez EmbeddedD
ataf
low
*current Flink master + few PRs
Streaming Optimizer
4
![Page 5: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/5.jpg)
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })
.groupBy(0).aggregate(SUM, 1);
1
32
Program Life-Cycle
4
5
![Page 6: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/6.jpg)
Task Manager
Job Manager
Task Manager
Flink Client &Optimizer
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })
.groupBy(0).aggregate(SUM, 1);
O Romeo, Romeo, wherefore art thou Romeo?
O, 1Romeo, 3wherefore, 1art, 1thou, 1
6
Nor arm, nor face, nor any other part
nor, 3arm, 1face, 1,any, 1,other, 1part, 1
creates and submits the job graph
creates the execution graph and deploys tasks
execute tasks and send status updates
![Page 7: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/7.jpg)
Input First SecondX Y
Operator X Operator Y
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));DataSet<String> second = first.filter (str -> str.length() > 40);
second.print()env.execute();
Series of Transformations
7
![Page 8: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/8.jpg)
DataSet AbstractionThink of it as a collection of data elements that can be produced/recovered in several ways:
… like a Java collection… like an RDD … perhaps it is never fully materialized (because the program does not need it to)… implicitly updated in an iteration
→ this is transparent to the user
8
![Page 9: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/9.jpg)
Romeo, Romeo, where art thou Romeo?
Load Log
Search for str1
Search for str2
Search for str3
Grep 1
Grep 2
Grep 3
Example: grep
9
![Page 10: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/10.jpg)
Romeo, Romeo, where art thou Romeo?
Load Log
Search for str1
Search for str2
Search for str3
Grep 1
Grep 2
Grep 3
Stage 1:Create/cache Log
Subsequent stages:Grep log for matches
Caching in-memory and disk if needed
Staged (batch) execution
10
![Page 11: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/11.jpg)
Romeo, Romeo, where art thou Romeo?
Load Log
Search for str1
Search for str2
Search for str3
Grep 1
Grep 2
Grep 3
001100110011001100110011
Stage 1:Deploy and start operators
Data transfer in-memory and disk if needed
Note: Log DataSet is never “created”!
Pipelined execution
11
![Page 12: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/12.jpg)
12
![Page 13: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/13.jpg)
how are execution plans chosen?
![Page 14: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/14.jpg)
Flink Batch Optimizer
Inspired by database optimizers, it creates and selects the execution plan for a user program
14
![Page 15: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/15.jpg)
DataSet<Tuple5<Integer, String, String, String, Integer>> orders = … DataSet<Tuple2<Integer, Double>> lineitems = …
DataSet<Tuple2<Integer, Integer>> filteredOrders = orders .filter(. . .) .project(0,4).types(Integer.class, Integer.class);
DataSet<Tuple3<Integer, Integer, Double>> lineitemsOfOrders = filteredOrders .join(lineitems) .where(0).equalTo(0) .projectFirst(0,1).projectSecond(1) .types(Integer.class, Integer.class, Double.class);
DataSet<Tuple3<Integer, Integer, Double>> priceSums = lineitemsOfOrders .groupBy(0,1).aggregate(Aggregations.SUM, 2);
priceSums.writeAsCsv(outputPath);
A Simple Program
15
![Page 16: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/16.jpg)
DataSourceorders.tbl
FilterMap DataSource
lineitem.tbl
JoinHybrid Hash
buildHT probe
broadcast forward
Combine
GroupRedsort
DataSourceorders.tbl
FilterMap DataSource
lineitem.tbl
JoinHybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRedsort
forwardBest plan depends onrelative sizes of input files
Alternative Execution Plans
16
![Page 17: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/17.jpg)
17
![Page 18: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/18.jpg)
● Evaluates physical execution strategies○ e.g. hash-join vs. sort-merge join
● Chooses data shipping strategies○ e.g. broadcast vs. partition
● Reuses partitioning and sort orders● Decides to cache loop-invariant data in
iterations
Optimization Examples
18
![Page 19: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/19.jpg)
case class PageVisit(url: String, ip: String, userId: Long)
case class User(id: Long, name: String, email: String, country: String)
// get your data from somewhere
val visits: DataSet[PageVisit] = ...
val users: DataSet[User] = ...
// filter the users data set
val germanUsers = users.filter((u) => u.country.equals("de"))
// join data sets
val germanVisits: DataSet[(PageVisit, User)] =
// equi-join condition (PageVisit.userId = User.id)
visits.join(germanUsers).where("userId").equalTo("id")
Example: Distributed Joins
The join operator needs to create all the pairs of elements from the two inputs, for which the join condition evaluates to true
19
![Page 20: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/20.jpg)
Example: Distributed Joins● Ship Strategy: The input data is distributed across all
parallel instances that participate in the join● Local Strategy: Each parallel instance performs a join
algorithm on its local partition
For both steps, there are multiple valid strategies which are favorable in different situations.
20
![Page 21: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/21.jpg)
Repartition-Repartition Strategy
Partitions both inputs using the same partitioning function.
All elements that share the same join key are shipped to the same parallel instance and can be locally joined.
21
![Page 22: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/22.jpg)
Broadcast-Forward Strategy
Sends one complete data set to each parallel instance that holds a partition of the other data.
The other Dataset remains local and is not shipped at all.
22
![Page 23: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/23.jpg)
The optimizer will compute cost estimates for execution plans and will pick the “cheapest” plan:● amount of data shipped over the the network● if the data of one input is already partitioned
R-R Cost: Full shuffle of both data sets over the networkB-F Cost: Depends on the size of the dataset that is broadcasted and the number of parallel instancesRead more: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
How does the Optimizer choose?
23
![Page 24: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/24.jpg)
how are Flink iterations special?
![Page 25: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/25.jpg)
● for/while loop in client submits one job per iteration step
● Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
Iterate by unrolling
25
![Page 26: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/26.jpg)
Native Iterations● the runtime is aware of the iterative execution● no scheduling overhead between iterations● caching and state maintenance are handled automatically
Caching Loop-invariant DataPushing work“out of the loop”
Maintain state as index
26
![Page 27: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/27.jpg)
Flink Iteration Operators
Iterate IterateDelta
Input
Iterative Update Function
Result
Rep
lace
Workset
IterativeUpdate Function
Result
Solution Set
State
27
![Page 28: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/28.jpg)
Delta Iteration
● Not all the elements of the state are updated in each iteration.
● The elements that require an update, are stored in the workset.
● The step function is applied only to the workset elements.
28
![Page 29: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/29.jpg)
Partition a graph into components by iteratively propagating the min vertex ID among neighbors
Example: Connected Components
29
![Page 30: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/30.jpg)
Delta-Connected Components
30
![Page 31: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/31.jpg)
31
![Page 32: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/32.jpg)
Performance
32
![Page 33: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/33.jpg)
Read the documentation and our blog posts!● Memory Management● Serialization and Type Extraction● Streaming Optimizations● Fault-Tolerance
Want to learn more?
33
![Page 34: Apache Flink Deep Dive](https://reader034.vdocument.in/reader034/viewer/2022042602/55c41628bb61eb1a058b4637/html5/thumbnails/34.jpg)
Apache Flink Deep Dive
Vasia KalavriFlink Committer & KTH PhD student
1st Apache Flink Meetup StockholmMay 11, 2015