structured streaming in - bi consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · the...

Budapest Data Forum, 2018

Structured Streaming in

Spark / Big Data / Cloud Computing Trainings Building Data Infrastructures for Industry 4.0 & Online

Why Real-time?

Why Spark Streaming?

Why Real-time?

How to chose a streaming tool?

The Apache landscape

streams

Sometimes you just want to keep it simple

Remember this from 1 hour ago?

So, our fancy tools

streams

How to chose a fancy streaming tool?

Popularity

See the bigger picture

Throughput

source: https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html

*as the Spark folks measured it

Throughput

source:https://data-artisans.com/blog/curious-case-broken-benchmark-revisiting-apache-flink-vs-databricks-runtime

*as the Flink folks measured it

Developers!

LatencyNative Streaming

(event-based processing)

Microbatching

streams

trident

https://www.theguardian.com/technology/2014/feb/05/why-google-engineers-designers

Structured Streaming

Pain points to solve• Interoperability

batch, interactive and real-time analytics

• Event time based processingevent time instead of processing time

• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing

Unbounded Table

image credit: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Late data

Handling late data with Watermarking

The drama of Exactly-once processing (Act I)

Spark: got it, thanks! Consider line 11 done.Spark: Hey Postgres,

store the results please

Spark: give me data

Kafka: you were at the 10th line, there you go with the 11th.

Spark: give me data

The drama of Exactly-once processing (Act II)

Spark: got it, thanks! Consider line 13 done.Spark: Hey Postgres,

store the re.....

Spark: give me data

Claudius: Hey Spark, got thirsty? ;)

Summary• Only use fancy tools if you need them ;)

• Structured Streaming

• Great Concept

• Access to core Spark functionalities

• Probably takes 1-2 years to make it feature-rich

Questions?

Zoltan Toth zoltan@datapao.com

+36 30 291 3599

structured streaming in - bi consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · the...

Documents

az adatok hatalma - bi...

powerpoint...

janos matyas -...

sanoma big data migration - bi...

ferenczi.huferenczi.hu/letoltes/taj2016_1.pdf · created...

slágerek fesztiválja -...

securing your hadoop cluster with apache ranger, atlas and...

big data @ magyar...

szponzori welcome Üzenet - bi...

the international practice of statistical property...

title: “mongodb”, type: meetup presentation”, speaker...

per » dvseptyum...

scalingupa data science team - bi...

almae provinciae sanctissimi ialtatmm -...

tartalomjegyzék -...

detecting fraud and outliers using r - bi...

globális világproblémák...

introduction to pandas - bi...

sas open platform - bi...

sas viya overview - bi...