apache flink deep dive

Apache Flink Deep Dive

Vasia KalavriFlink Committer & KTH PhD student

[email protected]

1st Apache Flink Meetup StockholmMay 11, 2015

Flink Internals

● Job Life-Cycle○ what happens after you submit a Flink job?

● The Batch Optimizer○ how are execution plans chosen?

● Delta Iterations○ how are Flink iterations special for Graph and ML

apps?

2

what happens after you submit a Flink job?

The Flink Stack

Pyt

hon

Gel

ly

Tabl

e

Flin

k M

L

SA

MO

A

Batch Optimizer

DataSet (Java/Scala) DataStream (Java/Scala)Hadoop M/R

Flink Runtime

Local Remote Yarn Tez EmbeddedD

ataf

low

*current Flink master + few PRs

Streaming Optimizer

4

DataSet<String> text = env.readTextFile(input);

DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })

.groupBy(0).aggregate(SUM, 1);

1

32

Program Life-Cycle

4

5

Task Manager

Job Manager

Task Manager

Flink Client &Optimizer

DataSet<String> text = env.readTextFile(input);

DataSet<Tuple2<String, Integer>> result = text .flatMap((str, out) -> { for (String token : value.split("\\W")) { out.collect(new Tuple2(token, 1)); })

.groupBy(0).aggregate(SUM, 1);

O Romeo, Romeo, wherefore art thou Romeo?

O, 1Romeo, 3wherefore, 1art, 1thou, 1

6

Nor arm, nor face, nor any other part

nor, 3arm, 1face, 1,any, 1,other, 1part, 1

creates and submits the job graph

creates the execution graph and deploys tasks

execute tasks and send status updates

Input First SecondX Y

Operator X Operator Y

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();DataSet<String> input = env.readTextFile(input);

DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));DataSet<String> second = first.filter (str -> str.length() > 40);

second.print()env.execute();

Series of Transformations

7

DataSet AbstractionThink of it as a collection of data elements that can be produced/recovered in several ways:

… like a Java collection… like an RDD … perhaps it is never fully materialized (because the program does not need it to)… implicitly updated in an iteration

→ this is transparent to the user

8

Romeo, Romeo, where art thou Romeo?

Load Log

Search for str1

Search for str2

Search for str3

Grep 1

Grep 2

Grep 3

Example: grep

9


Load Log

Search for str1

Search for str2

Search for str3

Grep 1

Grep 2

Grep 3

Stage 1:Create/cache Log

Subsequent stages:Grep log for matches

Caching in-memory and disk if needed

Staged (batch) execution

10


Load Log

Search for str1

Search for str2

Search for str3

Grep 1

Grep 2

Grep 3

001100110011001100110011

Stage 1:Deploy and start operators

Data transfer in-memory and disk if needed

Note: Log DataSet is never “created”!

Pipelined execution

11

how are execution plans chosen?

Flink Batch Optimizer

Inspired by database optimizers, it creates and selects the execution plan for a user program

14

DataSet<Tuple5<Integer, String, String, String, Integer>> orders = … DataSet<Tuple2<Integer, Double>> lineitems = …

DataSet<Tuple2<Integer, Integer>> filteredOrders = orders .filter(. . .) .project(0,4).types(Integer.class, Integer.class);

DataSet<Tuple3<Integer, Integer, Double>> lineitemsOfOrders = filteredOrders .join(lineitems) .where(0).equalTo(0) .projectFirst(0,1).projectSecond(1) .types(Integer.class, Integer.class, Double.class);

DataSet<Tuple3<Integer, Integer, Double>> priceSums = lineitemsOfOrders .groupBy(0,1).aggregate(Aggregations.SUM, 2);

priceSums.writeAsCsv(outputPath);

A Simple Program

15

DataSourceorders.tbl

FilterMap DataSource

lineitem.tbl

JoinHybrid Hash

buildHT probe

broadcast forward

Combine

GroupRedsort

DataSourceorders.tbl

FilterMap DataSource

lineitem.tbl

JoinHybrid Hash

buildHT probe

hash-part [0] hash-part [0]

hash-part [0,1]

GroupRedsort

forwardBest plan depends onrelative sizes of input files

Alternative Execution Plans

16

● Evaluates physical execution strategies○ e.g. hash-join vs. sort-merge join

● Chooses data shipping strategies○ e.g. broadcast vs. partition

● Reuses partitioning and sort orders● Decides to cache loop-invariant data in

iterations

Optimization Examples

18

case class PageVisit(url: String, ip: String, userId: Long)

case class User(id: Long, name: String, email: String, country: String)

// get your data from somewhere

val visits: DataSet[PageVisit] = ...

val users: DataSet[User] = ...

// filter the users data set

val germanUsers = users.filter((u) => u.country.equals("de"))

// join data sets

val germanVisits: DataSet[(PageVisit, User)] =

// equi-join condition (PageVisit.userId = User.id)

visits.join(germanUsers).where("userId").equalTo("id")

Example: Distributed Joins

The join operator needs to create all the pairs of elements from the two inputs, for which the join condition evaluates to true

19

Example: Distributed Joins● Ship Strategy: The input data is distributed across all

parallel instances that participate in the join● Local Strategy: Each parallel instance performs a join

algorithm on its local partition

For both steps, there are multiple valid strategies which are favorable in different situations.

20

Repartition-Repartition Strategy

Partitions both inputs using the same partitioning function.

All elements that share the same join key are shipped to the same parallel instance and can be locally joined.

21

Broadcast-Forward Strategy

Sends one complete data set to each parallel instance that holds a partition of the other data.

The other Dataset remains local and is not shipped at all.

22

The optimizer will compute cost estimates for execution plans and will pick the “cheapest” plan:● amount of data shipped over the the network● if the data of one input is already partitioned

R-R Cost: Full shuffle of both data sets over the networkB-F Cost: Depends on the size of the dataset that is broadcasted and the number of parallel instancesRead more: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

How does the Optimizer choose?

23

http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

how are Flink iterations special?

● for/while loop in client submits one job per iteration step

● Data reuse by caching in memory and/or disk

Step Step Step Step Step

Client

Iterate by unrolling

25

Native Iterations● the runtime is aware of the iterative execution● no scheduling overhead between iterations● caching and state maintenance are handled automatically

Caching Loop-invariant DataPushing work“out of the loop”

Maintain state as index

26

Flink Iteration Operators

Iterate IterateDelta

Input

Iterative Update Function

Result

Rep

lace

Workset

IterativeUpdate Function

Result

Solution Set

State

27

Delta Iteration

● Not all the elements of the state are updated in each iteration.

● The elements that require an update, are stored in the workset.

● The step function is applied only to the workset elements.

28

Partition a graph into components by iteratively propagating the min vertex ID among neighbors

Example: Connected Components

29

Delta-Connected Components

30

Performance

32

Read the documentation and our blog posts!● Memory Management● Serialization and Type Extraction● Streaming Optimizations● Fault-Tolerance

Want to learn more?

33

Apache Flink Deep Dive

Vasia KalavriFlink Committer & KTH PhD student

[email protected]

1st Apache Flink Meetup StockholmMay 11, 2015

apache flink deep dive

Technology

log dataset

containsapache flink

flink job

dataset filteredorders

dataset orders

grep log

dataset abstraction

flink batch optimizer