streaming analytics better than batch when and why - (big data tech 2017)

Post on 12-Apr-2017

308 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Streaming analytics better than batch - when and why ?

_A. Kawa - D. Wysakowicz - K. Zarzycki_

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Have you ever built cool Big Data

pipelines?

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Example Use-Case

■ Can be done in batch and real-time■ User session analytics at Spotify

● Simple stats■ Duration, number of songs, skips,

searches etc.

● Advanced analytics■ Mood, physical activity, real-time content,

ads

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Example Output

How long do users listen to a new edition of Discover Weekly?

_1. Dashboards_

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Example Output

How long do users listen to a new edition of Discover Weekly?

Australian users are listening to Discover Weekly too short !!!

_1. Dashboards_ _2. Alerts_

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Example Output

How long do users listen to a new edition of Discover Weekly?

Australian users are listening to Discover Weekly too short !!!

Recommend songs and ads based on

current activity.

_1. Dashboards_ _2. Alerts_ _3. Content_

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

1st - Batch Architecture

1h1h

1h

1h - 1d

1hUser

EventsUser

Sessions

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

1st - Batch Architecture

1h1h

1h

1d

1hUser

EventsUser

Sessions

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

The More Moving Parts …

⬇ The higher learning curve⬇ The more gluing code⬇ The larger administrative effort⬇ The more error-prone solution

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Long Waiting Time

Image source: “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 and http://www.slideshare.net/JoshBaer/shortening-the-feedback-loop-big-data-spain-external

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

2nd - Micro-Batch Architecture

1m - 1h

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

♪ ♪

No Built-In Session Windows

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

[10:00 - 11:00) [11:00 - 12:00)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

♪ ♪

No Built-In Session Windows

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

[10:00 - 11:00) [11:00 - 12:00)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Late Data …

♪ ♪ ♪ ♪ ♪ ♪ Event Time

14:55 - 16:35

Processing Time

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

... Included in Current Batch

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪

14:55 - 16:35 16:50 - …

Event Time

Processing Time

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Out-Of-Order Data …

♪ ♫ ♪ Event Time

Processing Time

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Out-Of-Order Data …

♪ ♫ ♪ ♪ ♪ ♫

♪ ♪ ♫

Event Time

Processing Time

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Out-Of-Order Data …

♪ ♫ ♪ ♪ ♪ ♫

♪ ♪ ♫

Event Time

Processing Time

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

... Breaks Correctness

♪ ♫ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫

♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫♪

Event Time

Processing Time

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Problems

FILES, BATCHES, DATA LAKES

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Solving Streaming Problem With Batch?

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

3rd - Streaming-First Architecture

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

User Session Windows

♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

User 3 ♪ ♪ ♪ ♪ ♪ ♪

Session gap eg. 15 minutes

♪♪

5

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

User Session Windows

♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

User 3 ♪ ♪ ♪ ♪ ♪ ♪

Session gap eg. 15 minutes

♪♪

5

[3,2]

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Reading From Kafkaval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪ ♪ ♪ ♪ ♪ ♪

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Session Windows With Gapval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪

User 1

User 2

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Session Windows With Gapval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

User 1 ♪ ♪ ♪ ♪ ♪ ♪

Session gap - 15 minutes

♪♪

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Analyzing User Sessionval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

.apply(new CountSessionStats())

User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Handling Late Eventsval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

.allowedLateness(Time.minutes(60))

.apply(new CountSessionStats())

User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Triggering Early Resultsval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))

.allowedLateness(Time.minutes(60))

.apply(new CountSessionStats())

User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Sessionization Exampleval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))

.allowedLateness(Time.minutes(60))

.apply(new CountSessionStats())

Working example:https://github.com/getindata/flink-use-case

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Modern Stream Processing Engines

■ Rich stream processing semantic● Built-in support for event-time windows● Accurate results for late / out-of-order events and replays● Early triggers

■ Low latency and high-throughput■ Exactly-once stateful processing

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Modern Stream Processing Engines

■ Rich stream processing semantic● Built-in support for event-time windows● Accurate results for late / out-of-order events and replays● Early triggers

■ Low latency and high-throughput■ Exactly-once stateful processing

User survey:http://data-artisans.com/flink-user-survey-2016-part-1http://data-artisans.com/flink-user-survey-2016-part-2

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

How can I reprocess data?

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Reprocessing Events In Flink

1. Take periodic snapshots of a job● It stores Kafka offsets, on-flight sessions, application state

2. Restart a job from a savepoint rather than from a beginning

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

What if data is no longer in Kafka?

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Consuming Data From HDFS

■ Run your streaming code on HDFS (bounded data)● You need to read data in event-time based order

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

How to join with other data sets/streams?

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Join With Other Datasets / Streams■ Flink can join windowed streams easily■ Join of data stream with data set is WIP

● Even with slowly changing data set!● Even keyed data

Stream 2

Stream 1

Joined Stream Input Stream Joined Stream

+

Id Name

1 John Doe

2 Jane Doe

Dataset

+

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

When is batch processing good?

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Batch Processing Use-Cases

■ Ad-hoc analytics and data exploration● Notebooks, Spark/Flink/Hive, Parquet, complete data sets

■ Technical advantages● A large swaths of historical data in HDFS● High-level libraries in mature batch technologies

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Batch Processing Use-Cases

■ Ad-hoc analytics and data exploration● Notebooks, Spark/Flink/Hive, Parquet, complete data sets

■ Implementation advantages● Offline experiments over large historical data

■ Historical events are usually stored in HDFS, not Kafka

● High-level libraries in batch processing technologies■ Spark MLlib, H2O

(when data arrives continuously)

don’t solve streaming problem

with batch jobs

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

I like this streaming API.Can I use it for batch?

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Unified batch and streaming API

■ Not with raw Flink API■ But with Flink Table API■ Apache Beam

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Who Are You, actually?

■ At GetInData, we build custom Big Data solutions● Hadoop, Flink, Spark, Kafka and more

■ Our team is today represented by

Krzysztof Zarzycki

Dawid Wysakowicz

Adam Kawa

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

■ Stream often the natural representation of your data

■ Stream processing is not only about low latency

Summary

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Q&A

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Thanks ! Big Data Tech Warsaw !

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Log Abstraction

11:00 - 12:00

12:00 - 13:00

… …

10:00 - …

10:00 - …

10:00 - 11:00

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Spark Structured Streaming

⬇ It’s still ALPHA and the APIs are still experimental⬇ Operates on top of micro-batches (Spark SQL engine)

⬆ Easy-to-learn API (Dataset/DataFrame)⬆ Rich ecosystem of tools and libraries e.g. MLlib⬆ Supports event-time

⬇ Sessionization not yet supported - SPARK-10816⬇ Queryable state not yet supported - SPARK-16738

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Kafka Streams

⬇ No exactly-once (just at-least-once)⬇ Kafka as the only data source⬇ No bounded streams (batch) optimizations

⬆ Simplicity⬆ Embedded into application⬆ Supports event-time

⬇ Lack of session windows

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Apache Beam

⬆ Unified API for batch and streaming⬆ Rich streaming processing semantics

⬆ Complex TriggerDSL⬆ Multiple runtime environments

⬆ Spark, Flink, Apex, Dataflow⬆ Side inputs and outputs

⬇ Verbose Java API⬇ New project - Top level since 01/2017

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Google Dataflow

■ Runtime environment for Apache Beam in Google Cloud

⬇ No support for Iterative Computations⬆ Supports Side Outputs⬆ Works with every Google Cloud Service (Pub/Sub, BigTable

etc.)

top related