the stream processor as a database apache flink

44
Stephan Ewen @stephanewen The Stream Processor as a Database Apache Flink

Upload: dataworks-summithadoop-summit

Post on 14-Apr-2017

837 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: The Stream Processor as a Database Apache Flink

Stephan Ewen@stephanewen

The Stream Processor as a Database

Apache Flink

Page 2: The Stream Processor as a Database Apache Flink

2

Streaming technology is enabling the obvious: continuous processing on data that is

continuously produced

Page 3: The Stream Processor as a Database Apache Flink

Apache Flink Stack

3

DataStream APIStream Processing

DataSet APIBatch Processing

RuntimeDistributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

Page 4: The Stream Processor as a Database Apache Flink

Programs and Dataflows

4

Source

Transformation

Transformation

Sink

val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer09(…))

val events: DataStream[Event] = lines.map((line) => parse(line))

val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .apply(new MyAggregationFunction())

stats.addSink(new RollingSink(path))

Source[1]

map()[1]

keyBy()/window()/

apply()[1]

Sink[1]

Source[2]

map()[2]

keyBy()/window()/

apply()[2]

StreamingDataflow

Page 5: The Stream Processor as a Database Apache Flink

What makes Flink flink?

5

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Page 6: The Stream Processor as a Database Apache Flink

6

The (Classic) Use CaseRealtime Counts and Aggregates

Page 7: The Stream Processor as a Database Apache Flink

(Real)Time Series Statistics

7stream of events realtime statistics

Page 8: The Stream Processor as a Database Apache Flink

The Architecture

8

collect log analyze serve & store

Page 9: The Stream Processor as a Database Apache Flink

The Flink Job

9

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events .filter(evt => evt.isImpression) .map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream .keyBy("id") .timeWindow(Time.hours(1)) .sum("impressions")

Page 10: The Stream Processor as a Database Apache Flink

10

The Flink Job

KafkaSource map() window()/

sum() Sink

KafkaSource map() window()/

sum() Sink

filter()

filter()

keyBy()

keyBy()

Page 11: The Stream Processor as a Database Apache Flink

Putting it all together

11

Periodically (every second)flush new aggregates

to Redis

Page 12: The Stream Processor as a Database Apache Flink

How does it perform?

12

Latency Throughput Number ofKeys

Page 13: The Stream Processor as a Database Apache Flink

99th PercentileLatency (sec)

9

8

2

1

Storm 0.10

Flink 0.10

60 80 100 120 140 160 180

Throughput(1000 events/sec)

Spark Streaming 1.5

Yahoo! Streaming Benchmark

13

Latency

(lower is better)

Page 14: The Stream Processor as a Database Apache Flink

Extended Benchmark: Throughput

14

Throughput

• 10 Kafka brokers with 2 partitions each• 10 compute machines (Flink / Storm)

• Xeon [email protected] CPU (4 cores HT)• 32 GB RAM (only 8GB allocated to JVMs)

• 10 GigE Ethernet between compute nodes• 1 GigE Ethernet between Kafka cluster and Flink nodes

Page 15: The Stream Processor as a Database Apache Flink

Scaling Number of Users Yahoo! Streaming Benchmark has 100 keys only

• Every second, only 100 keys are written tokey/value store

• Quite few, compared to many real world use cases

Tweet impressions: millions keys/hour• Up to millions of keys updated per second

15

Number ofKeys

Page 16: The Stream Processor as a Database Apache Flink

Performance

16

Number ofKeys

Page 17: The Stream Processor as a Database Apache Flink

The Bottleneck

17

Writes to the key/valuestore take too long

Page 18: The Stream Processor as a Database Apache Flink

18

Queryable State

Page 19: The Stream Processor as a Database Apache Flink

Queryable State

19

Page 20: The Stream Processor as a Database Apache Flink

Queryable State

20

Optional, andonly at the end of

windows

Page 21: The Stream Processor as a Database Apache Flink

Queryable State Enablers Flink has state as a first class citizen

State is fault tolerant (exactly once semantics)

State is partitioned (sharded) together with the operators that create/update it

State is continuous (not mini batched)

State is scalable (e.g., embedded RocksDB state backend)

21

Page 22: The Stream Processor as a Database Apache Flink

Queryable State Status [FLINK-3779] / Pull Request #2051 :

Queryable State Prototype

Design and implementation under evolution

Some experiments were using earlier versions of the implementation

Exact numbers may differ in final implementation, but order of magnitude is comparable

22

Page 23: The Stream Processor as a Database Apache Flink

Queryable State Performance

23

Page 24: The Stream Processor as a Database Apache Flink

Queryable State: Application View

24

Application only interested in latest realtime results

Application

Page 25: The Stream Processor as a Database Apache Flink

Queryable State: Application View

25

Application requires both latest realtime- and older results

Database

realtime results older results

Application Query Service

current timewindows

past timewindows

Page 26: The Stream Processor as a Database Apache Flink

Apache Flink Architecture Review

26

Page 27: The Stream Processor as a Database Apache Flink

Queryable State: Implementation

27

Query Client

StateRegistry

window()/sum()

Job Manager Task Manager

ExecutionGraph

State Location Server

deploy

status

Query: /job/operation/state-name/key

StateRegistry

window()/sum()

Task Manager

(1) Get location of "key-partition"for "operator" of" job"

(2) Look uplocation

(3)Respond location

(4) Querystate-name and key

localstate

register

Page 28: The Stream Processor as a Database Apache Flink

28

Contrasting with key/value stores

Page 29: The Stream Processor as a Database Apache Flink

Turning the Database Inside Out Cf. Martin Kleppman's talks on

re-designing data warehousingbased on log-centric processing

This view angle picks up some ofthese concepts

Queryable State in Apache Flink = (Turning DB inside out)++

29

Page 30: The Stream Processor as a Database Apache Flink

Write Path in Cassandra (simplified)

30From the Apache Cassandra docs

Page 31: The Stream Processor as a Database Apache Flink

Write Path in Cassandra (simplified)

31From the Apache Cassandra docs

First step is durable write to the commit log(in all databases that offer strong durability)

Memtable is a re-computableview of the commit log

actions and the persistent SSTables)

Page 32: The Stream Processor as a Database Apache Flink

Write Path in Cassandra (simplified)

32From the Apache Cassandra docs

First step is durable write to the commit log(in all databases that offer strong durability)

Memtable is a re-computableview of the commit log

actions and the persistent SSTables)

Replication to Quorumbefore write is acknowledged

Page 33: The Stream Processor as a Database Apache Flink

Durability of Queryable state

33

snapshotstate

Page 34: The Stream Processor as a Database Apache Flink

Durability of Queryable state

34

Event sequence is the ground truth andis durably stored in the log already

Queryable statere-computable

from checkpoint and log

snapshotstate Snapshot replication

can happen in thebackground

Page 35: The Stream Processor as a Database Apache Flink

Performance of Flink's State

35window()/

sum()

Source /filter() /map()

State index(e.g., RocksDB)

Events are persistentand ordered (per partition / key)

in the log (e.g., Apache Kafka)

Events flow without replication or synchronous writes

Page 36: The Stream Processor as a Database Apache Flink

Performance of Flink's State

36window()/

sum()

Source /filter() /map()

Trigger checkpoint Inject checkpoint barrier

Page 37: The Stream Processor as a Database Apache Flink

Performance of Flink's State

37window()/

sum()

Source /filter() /map()

Take state snapshot RocksDB:Trigger state

copy-on-write

Page 38: The Stream Processor as a Database Apache Flink

Performance of Flink's State

38window()/

sum()

Source /filter() /map()

Persist state snapshots Durably persistsnapshots

asynchronously

Processing pipeline continues

Page 39: The Stream Processor as a Database Apache Flink

39

Conclusion

Page 40: The Stream Processor as a Database Apache Flink

Takeaways Streaming applications are often not bound by the stream

processor itself. Cross system interaction is frequently biggest bottleneck

Queryable state mitigates a big bottleneck: Communication with external key/value stores to publish realtime results

Apache Flink's sophisticated support for state makes this possible

40

Page 41: The Stream Processor as a Database Apache Flink

TakeawaysPerformance of Queryable State

Data persistence is fast with logs (Apache Kafka)• Append only, and streaming replication

Computed state is fast with local data structures and no synchronous replication (Apache Flink)

Flink's checkpoint method makes computed state persistent with low overhead

41

Page 42: The Stream Processor as a Database Apache Flink

Go Flink!

42

Low latency

High Throughput

Well-behavedflow control

(back pressure)

Make more sense of data

Works on real-timeand historic data

TrueStreaming

Event Time

APIsLibraries

StatefulStreaming

Globally consistentsavepoints

Exactly-once semanticsfor fault tolerance

Windows &user-defined state

Flexible windows(time, count, session, roll-your own)

Complex Event Processing

Page 43: The Stream Processor as a Database Apache Flink

Flink Forward 2016, BerlinSubmission deadline: June 30, 2016Early bird deadline: July 15, 2016

www.flink-forward.org

Page 44: The Stream Processor as a Database Apache Flink

We are hiring!data-artisans.com/careers