windowing data in big data streams

25
WINDOWING DATA IN BIG DATA STREAMS ADAM WARSKI, WOLVESSUMMIT

Upload: softwaremill

Post on 21-Jan-2018

99 views

Category:

Technology


0 download

TRANSCRIPT

WINDOWING DATA IN BIG DATA STREAMS

ADAM WARSKI, WOLVESSUMMIT

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

BIG DATA? FAST DATA?

▸ What is big data?

▸ Shift of focus

▸ Processing speed

▸ Fast data -> streaming

A TYPE OF DATA PROCESSING ENGINE THAT IS DESIGNED WITH INFINITE DATA SETS IN MIND

Tyler Akidau, Google

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

WHAT IS STREAMING?

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

WINDOWING

▸ Time becomes the focus point

▸ How many invalid password errors where there in the last 5 minutes

▸ During which 30-minute window did we get most traffic?

▸ What’s the average 5-minute speed on a section of a highway throughout the day?

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

HOW TO DO STREAMING? WITH WINDOWS?

▸ Many possibilities:

▸ Spark Streaming

▸ Spark Structured Streaming

▸ Kafka Streams

▸ Flink

▸ Akka Streams

▸ …

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

WHICH ONE TO CHOOSE?

LET’S FIND OUT

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

/ME

▸ coder @

▸ Lightbend, Confluent, Datastax consulting partner

▸ mainly Scala

▸ open-source: MacWire, ElasticMQ, Quicklens, …

▸ http://www.warski.org / @adamwarski

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

WHAT’S THE TIME?

▸ How to associate time with an event:

▸ event time: “logical”, data-dependent

▸ ingestion time: when the event entered the system

▸ processing time: when the event is being processed

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

TYPES OF WINDOWS

▸ Time-based

▸ fixed/tumbling

▸ sliding

▸ Session-based

time

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

OUT-OF-ORDER: WATERMARKS, LATENESS

▸ Windows GC

▸ At some point, enough is enough

▸ Watermark:

▸ all events before X have been observed

▸ heuristics

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

TRIGGERS

▸ When to emit window results

▸ Watermark progress

▸ Event time progress

▸ Processing time progress

▸ Punctuations

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

ACCUMULATION OF RESULTS

▸ If we trigger many times …

▸ discard

▸ accumulate

▸ retract & accumulate

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

FINALLY … HOW TO MANIPULATE THE DATA

▸ map, flatMap, filter …

▸ stateful computation

▸ fold, reduce

▸ past-dependent operations

▸ where to store the state

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SUMMING UP

▸ Event/ingestion/processing time

▸ Tumbling/sliding/session windows

▸ Watermarks

▸ Triggers

▸ Accumulation of results

▸ State management

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SPARK STREAMING

▸ Micro-batches (DStream)

▸ .window() API:

▸ tumbling/sliding windows

▸ only processing time

▸ no watermarks

▸ triggers at the end of the window

▸ state persisted in cluster (e.g. updateStateByKey())

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SPARK STREAMING - WHY BOTHER?

▸ Popular

▸ Not only streaming

▸ ML

▸ SQL

▸ GraphX

▸ but …

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SPARK STRUCTURED STREAMING

▸ Alpha in Spark 2.0

▸ Micro-batches not exposed

▸ groupBy(window(…))

▸ Event-time support

▸ No watermarks, session windows (2.1?)

▸ Trigger: processing time; outputs changed windows

▸ Exactly-once processing*

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

FLINK

▸ Mostly with keyed streams (parallelism)

▸ TimeCharacteristic: event/ingestion/processing

▸ TimestampAssigner: also generates watermarks

▸ WindowAssigner: arbitrary, built-in tumbling, sliding, session

▸ Trigger: event/processing time, count, single/continuous

▸ Window function: fold/reduce/with-kv-state

▸ Exactly-once* / at-least-once

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

KAFKA STREAMS

▸ State: Kafka topics/local key-value backed by a topic for resiliency

▸ Watermarks: no, but windows are retained for 1 day

▸ Time: event/ingestion/processing; TimestampExtractor

▸ Tumbling/sliding windows

▸ Trigger: after every element

▸ aggregate by key&window into an ever-updating KTable

▸ At-least-once

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

AKKA STREAMS

▸ Single-node, no clustering

▸ No OOTB support, but quite easy to implement:

▸ Windows: arbitrary, assign windows to each element

▸ Trigger: only window-close

▸ State: local

▸ Watermarks: can be implemented

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SUMMING UP

▸ Spark: widely used, some features missing

▸ Flink: versatile

▸ Kafka: simple model

▸ Akka: single-node

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SUMMING UP

▸ Windowing is just one of the aspects

▸ Other:

▸ State management

▸ Work distribution

▸ Processing guarantees

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

SUMMING UP

▸ Other stream processing systems out there!

▸ Apache Storm

▸ Google Cloud Dataflow

▸ Amazon Kinesis

▸ Apache Beam

▸ …

ADAM WARSKI, SOFTWAREMILL, @ADAMWARSKI

LINKS

▸ Streaming 101 & 102: 

▸ https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

▸ https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

▸ https://softwaremill.com/windowing-data-in-akka-streams/

THANKS!

ADAM WARSKI

@ADAMWARSKI / [email protected]