structured streaming in - bi consultingbiconsulting.hu/letoltes/2018budapestdata/toth... · the...

Post on 22-May-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Budapest Data Forum, 2018

Structured Streaming in

Spark / Big Data / Cloud Computing Trainings Building Data Infrastructures for Industry 4.0 & Online

Why Real-time?

Why Spark Streaming?

Why Real-time?

How to chose a streaming tool?

The Apache landscape

streams

Sometimes you just want to keep it simple

+

Remember this from 1 hour ago?

So, our fancy tools

streams

How to chose a fancy streaming tool?

Popularity

See the bigger picture

Throughput

source: https://databricks.com/blog/2017/10/11/benchmarking-structured-streaming-on-databricks-runtime-against-state-of-the-art-streaming-systems.html

*as the Spark folks measured it

Throughput

source:https://data-artisans.com/blog/curious-case-broken-benchmark-revisiting-apache-flink-vs-databricks-runtime

*as the Flink folks measured it

Developers!

LatencyNative Streaming

(event-based processing)

vs

Microbatching

streams

trident

https://www.theguardian.com/technology/2014/feb/05/why-google-engineers-designers

Structured Streaming

Pain points to solve• Interoperability

batch, interactive and real-time analytics

• Event time based processingevent time instead of processing time

• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing

Pain points to solve• Interoperability

batch, interactive and real-time analytics

• Event time based processingevent time instead of processing time

• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing

Unbounded Table

image credit: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Pain points to solve• Interoperability

batch, interactive and real-time analytics

• Event time based processingevent time instead of processing time

• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing

Late data

Handling late data with Watermarking

Pain points to solve• Interoperability

batch, interactive and real-time analytics

• Event time based processingevent time instead of processing time

• End-to-end guarantees consistent data throughout the whole pipeline exactly-once processing

The drama of Exactly-once processing (Act I)

Spark: got it, thanks! Consider line 11 done.Spark: Hey Postgres,

store the results please

Spark: give me data

Kafka: you were at the 10th line, there you go with the 11th.

Spark: give me data

Kafka: you were at the 11th line, there you go with the 12th.

OK!

...

The drama of Exactly-once processing (Act II)

Spark: got it, thanks! Consider line 13 done.Spark: Hey Postgres,

store the re.....

Spark: give me data

Kafka: you were at the 12th line, there you go with the 13th.

Claudius: Hey Spark, got thirsty? ;)

Demo

Summary• Only use fancy tools if you need them ;)

• Structured Streaming

• Great Concept

• Access to core Spark functionalities

• Probably takes 1-2 years to make it feature-rich

Questions?

Zoltan Toth zoltan@datapao.com

+36 30 291 3599

top related