the stream processor as a database - francisco...§streaming applications are often not bound by the...

30
Ufuk Celebi @iamuce The Stream Processor as a Database

Upload: others

Post on 12-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

UfukCelebi@iamuce

The Stream Processoras a Database

Page 2: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

The(Classic)UseCaseRealtimeCountsandAggregates

2

Page 3: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

(Real-)TimeSeriesStatistics

3

StreamofEvents Real-timeStatistics

Page 4: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheArchitecture

4

collect messagequeue

analyze serve&store

Page 5: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

5

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 6: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

6

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 7: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

7

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 8: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

8

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 9: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

9

case class Impressions(id: String, impressions: Long)

val events: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(…))

val impressions: DataStream[Impressions] = events.filter(evt => evt.isImpression).map(evt => Impressions(evt.id, evt.numImpressions)

val counts: DataStream[Impressions]= stream.keyBy("id").timeWindow(Time.hours(1)).sum("impressions")

Page 10: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheFlinkJob

10

KafkaSource map() window()/

sum() Sink

KafkaSource map() window()/

sum() Sink

filter()

filter()

keyBy()

keyBy()

State

State

Page 11: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Puttingitalltogether

11

Periodically(everysecond)flushnewaggregates

toRedis

Page 12: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TheBottleneck

12

Writestothekey/valuestoretaketoolong

Page 13: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Queryable State

13

Page 14: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableState

14

Page 15: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableState

15

Optional,andonlyattheendof

windows

Page 16: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableState:ApplicationView

16

Database

realtimeresults olderresults

Application QueryService

currenttimewindows

pasttimewindows

Page 17: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableStateEnablers§ Flinkhasstateasafirstclasscitizen

§ Stateisfaulttolerant (exactlyoncesemantics)

§ Stateispartitioned (sharded)togetherwiththeoperatorsthatcreate/updateit

§ Stateiscontinuous (notminibatched)

§ Stateisscalable

17

Page 18: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

StateinFlink

18

window()/sum()

Source/filter()/map()

Stateindex(e.g.,RocksDB)

Eventsarepersistentandordered (perpartition/key)

inthemessagequeue(e.g.,ApacheKafka)

Eventsflowwithoutreplicationor synchronouswrites

Page 19: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

StateinFlink

19

window()/sum()

Source/filter()/map()

Triggercheckpoint Injectcheckpointbarrier

Page 20: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

StateinFlink

20

window()/sum()

Source/filter()/map()

Takestatesnapshot Triggerstatecopy-on-write

Page 21: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

StateinFlink

21

window()/sum()

Source/filter()/map()

Persiststatesnapshots Durablypersistsnapshots

asynchronously

Processingpipelinecontinues

Page 22: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableState:Implementation

22

QueryClient

StateRegistry

window()/sum()

JobManager TaskManager

ExecutionGraph

StateLocationServer

deploy

status

Query:/job/state-name/key

StateRegistry

window()/sum()

TaskManager

(1)Getlocationof"key-partition"of"job"

(2)Lookuplocation

(3)Respondlocation

(4)Querystate-nameandkey

localstate

register

Page 23: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

QueryableStatePerformance

23

Page 24: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Conclusion

24

Page 25: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Takeaways§ Streamingapplicationsareoftennotboundbythestream

processoritself.Crosssysteminteraction isfrequentlybiggestbottleneck

§ Queryablestatemitigatesabigbottleneck:Communicationwithexternalkey/valuestorestopublishrealtimeresults

§ ApacheFlink'ssophisticatedsupportforstatemakesthispossible

25

Page 26: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

TakeawaysPerformanceofQueryableState

§ Datapersistenceisfastwithlogs• Appendonly,andstreamingreplication

§ Computedstateisfastwithlocaldatastructuresandnosynchronousreplication

§ Flink'scheckpointmethodmakescomputedstatepersistentwithlowoverhead

26

Page 27: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Questions?§ eMail:[email protected]§ Twitter:@iamuce§ Code/Demo:https://github.com/dataArtisans/flink-

queryable_state_demo

27

Page 28: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Appendix

28

Page 29: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

Flink Runtime+APIs

29

DataStreamAPI

RuntimeDistributedStreamingDataFlow

TableAPI&StreamSQL

ProcessFunction API

Building Blocks: Streams, Time, State

Page 30: The Stream Processor as a Database - Francisco...§Streaming applications are often not bound by the stream processor itself. Cross system interactionis frequently biggest bottleneck

ApacheFlinkArchitectureReview

30