unified batch and stream processing with flink @ big data beers berlin may 2015

43
Robert Metzger Flink committer @rmetzger_ Apache Flink

Upload: robert-metzger

Post on 06-Aug-2015

183 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Robert MetzgerFlink committer

@rmetzger_

ApacheFlink

Page 2: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

1 year of Flink - code

April 2014 April 2015

Page 3: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Community growth

3

Aug-10 Feb-11 Sep-11 Apr-12 Oct-12 May-13 Nov-13 Jun-14 Dec-14 Jul-150

20

40

60

80

100

120

#unique contributors by git commits

Page 4: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

What is Flink?

4

Gelly

Table

ML

SA

MO

A

DataSet (Java/Scala/Python)DataStream (Java/Scala)

Hadoop M

/R

Local Remote Yarn Tez Embedded

Data

flow

Data

flow

(W

iP)

MR

QL

Table

Casc

adin

g

(WiP

)

Streaming dataflow runtime

Page 5: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Program compilation

5

case class Path (from: Long, to: Long)val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next }

Optimizer

Type extraction

stack

Task schedulin

g

Dataflow metadata

Pre-flight (Client)

MasterWorkers

DataSource

orders.tbl

Filter

MapDataSour

celineitem.tbl

JoinHybrid Hash

buildHT

probe

hash-part [0] hash-part [0]

GroupRed

sort

forward

Program

DataflowGraph

deployoperators

trackintermediate

results

Page 6: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Native workload support

6

Flink

Streaming topologies

Long batchpipelines

Machine Learning at scale

How can an engine natively support all these workloads?

And what does "native" mean?

Graph Analysis

Page 7: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

E.g.: Non-native iterations

7

Step Step Step Step Step

Client

for (int i = 0; i < maxIterations; i++) {// Execute MapReduce job

}

Page 8: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

E.g.: Non-native streaming

8

streamdiscretizer

Job Job Job Jobwhile (true) { // get next few records // issue batch job}

Page 9: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Native workload support

9

Flink

Streaming topologies

Heavy batch jobs

Machine Learning at scale

How can an engine natively support all these workloads?

And what does native mean?

Page 10: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Flink Engine

1. Execute everything as streams

2. Allow some iterative (cyclic) dataflows

3. Allow some mutable state

4. Operate on managed memory

10

Page 11: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Flink by Use Case

11

Page 12: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Data Streaming Analysisstreaming dataflows

12

Page 13: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

3 Parts of a Streaming Infrastructure

13

Gathering

Broker Analysis

Sensors

Transactionlogs …

Server Logs

Page 14: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

3 Parts of a Streaming Infrastructure

14

Gathering

Broker Analysis

Sensors

Transactionlogs …

Server Logs

Result may be fed back to the broker

Page 15: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Cornerstones of Flink Streaming

Pipelined stream processor (low latency)

Expressive APIs

Flexible operator state, streaming windows

Efficient fault tolerance for streams and state.

15

Page 16: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Pipelined stream processor

16

StreamingShuffle!

Page 17: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

17

Expressive APIs

case class Word (word: String, frequency: Int)

val lines: DataStream[String] = env.fromSocketStream(...)

lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print()

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print()

DataSet API (batch):

DataStream API (streaming):

Page 18: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Checkpointing / Recovery

18

Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots

Pushes checkpoint barriersthrough the data flow

Operator checkpointstarting

Checkpoint done

Data Stream

barrier

Before barrier =part of the snapshot

After barrier =Not in snapshot

Checkpoint done

checkpoint in progress

(backup till next snapshot)

Page 19: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Long batch pipelinesBatch on Streaming

19

Page 20: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

20

Batch Pipelines

Page 21: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Batch on Streaming

Batch programs are a special kind of streaming program

21

Infinite Streams Finite Streams

Stream Windows Global View

PipelinedData Exchange

Pipelined or Blocking Exchange

Streaming Programs Batch Programs

Page 22: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

22

Batch Pipelines

Data exchange (shuffle / broadcast)is mostly streamed

Some operators block (e.g. sorts / hash tables)

Page 23: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

23

Operators Execution Overlaps

Page 24: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

24

Memory Management

Page 25: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Memory Management

25

Page 26: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Smooth out-of-core performance

26More at: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html

Blue bars are in-memory, orange bars (partially) out-of-core

Page 27: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Table API

27

val customers = envreadCsvFile(…).as('id, 'mktSegment) .filter("mktSegment = AUTOMOBILE")

val orders = env.readCsvFile(…) .filter( o => dateFormat.parse(o.orderDate).before(date) ) .as("orderId, custId, orderDate, shipPrio")

val items = orders .join(customers).where("custId = id") .join(lineitems).where("orderId = id") .select("orderId, orderDate, shipPrio, extdPrice * (Literal(1.0f) – discount) as revenue")

val result = items .groupBy("orderId, orderDate, shipPrio") .select('orderId, revenue.sum, orderDate, shipPrio")

Page 28: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Machine Learning Algorithms

Iterative data flows

28

Page 29: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

29

Iterate by looping

for/while loop in client submits one job per iteration step

Data reuse by caching in memory and/or disk

Step Step Step Step Step

Client

Page 30: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

30

Iterate in the Dataflow

Page 31: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Example: Matrix Factorization

31

Factorizing a matrix with28 billion ratings forrecommendations

More at: http://data-artisans.com/computing-recommendations-with-flink.html

Page 32: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Graph AnalysisStateful Iterations

32

Page 33: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

33

Iterate natively with state/deltas

Page 34: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Effect of delta iterations…#

of

ele

men

ts u

pd

ate

d

iteration

Page 35: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

35

… fast graph analysis

More at: http://data-artisans.com/data-analysis-with-flink.html

Page 36: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Closing

36

Page 37: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

37

Flink Roadmap for 2015

Some examples: More flexible state and state backends in

streaming

Master Failover

Improved monitoring

Integration with other Apache projects• SAMOA, Zeppelin, Ignite

More additions to the libraries

Page 38: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

Flink Forward registration & call for abstracts is open now

flink.apache.org 38

• 12. and 13. October 2015• Kulturbrauerei Berlin• With Flink Workshops/Training!

Page 39: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

39

Page 40: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

flink.apache.org@ApacheFlink

Page 41: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

41

Page 42: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

42

Page 43: Unified batch and stream processing with Flink @ Big Data Beers Berlin May 2015

43

Examples of optimization

Task chaining• Coalesce map/filter/etc tasks

Join optimizations• Broadcast/partition, build/probe side, hash or sort-

merge

Interesting properties• Re-use partitioning and sorting for later operations

Automatic caching• E.g., for iterations