streaming analytics better than batch when and why - (big data tech 2017)

56
© Copyright. All rights reserved. Not to be reproduced without prior written consent. Streaming analytics better than batch - when and why ? _ A. Kawa - D. Wysakowicz - K. Zarzycki _

Upload: getindata

Post on 12-Apr-2017

308 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Streaming analytics better than batch - when and why ?

_A. Kawa - D. Wysakowicz - K. Zarzycki_

Page 2: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Have you ever built cool Big Data

pipelines?

Page 3: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Page 4: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Example Use-Case

■ Can be done in batch and real-time■ User session analytics at Spotify

● Simple stats■ Duration, number of songs, skips,

searches etc.

● Advanced analytics■ Mood, physical activity, real-time content,

ads

Page 5: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Example Output

How long do users listen to a new edition of Discover Weekly?

_1. Dashboards_

Page 6: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Example Output

How long do users listen to a new edition of Discover Weekly?

Australian users are listening to Discover Weekly too short !!!

_1. Dashboards_ _2. Alerts_

Page 7: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Example Output

How long do users listen to a new edition of Discover Weekly?

Australian users are listening to Discover Weekly too short !!!

Recommend songs and ads based on

current activity.

_1. Dashboards_ _2. Alerts_ _3. Content_

Page 8: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

1st - Batch Architecture

1h1h

1h

1h - 1d

1hUser

EventsUser

Sessions

Page 9: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

1st - Batch Architecture

1h1h

1h

1d

1hUser

EventsUser

Sessions

Page 10: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

The More Moving Parts …

⬇ The higher learning curve⬇ The more gluing code⬇ The larger administrative effort⬇ The more error-prone solution

Page 11: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Long Waiting Time

Image source: “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 and http://www.slideshare.net/JoshBaer/shortening-the-feedback-loop-big-data-spain-external

Page 12: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

2nd - Micro-Batch Architecture

1m - 1h

Page 13: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

♪ ♪

No Built-In Session Windows

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

[10:00 - 11:00) [11:00 - 12:00)

Page 14: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

♪ ♪

No Built-In Session Windows

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

[10:00 - 11:00) [11:00 - 12:00)

Page 15: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Late Data …

♪ ♪ ♪ ♪ ♪ ♪ Event Time

14:55 - 16:35

Processing Time

Page 16: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

... Included in Current Batch

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪

14:55 - 16:35 16:50 - …

Event Time

Processing Time

Page 17: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Out-Of-Order Data …

♪ ♫ ♪ Event Time

Processing Time

Page 18: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Out-Of-Order Data …

♪ ♫ ♪ ♪ ♪ ♫

♪ ♪ ♫

Event Time

Processing Time

Page 19: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Out-Of-Order Data …

♪ ♫ ♪ ♪ ♪ ♫

♪ ♪ ♫

Event Time

Processing Time

Page 20: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

... Breaks Correctness

♪ ♫ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫

♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫♪

Event Time

Processing Time

Page 21: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Problems

FILES, BATCHES, DATA LAKES

Page 22: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Solving Streaming Problem With Batch?

Page 23: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

3rd - Streaming-First Architecture

Page 24: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

User Session Windows

♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

User 3 ♪ ♪ ♪ ♪ ♪ ♪

Session gap eg. 15 minutes

♪♪

5

Page 25: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

User Session Windows

♪User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

User 3 ♪ ♪ ♪ ♪ ♪ ♪

Session gap eg. 15 minutes

♪♪

5

[3,2]

Page 26: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Reading From Kafkaval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪ ♪ ♪ ♪ ♪ ♪

Page 27: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Session Windows With Gapval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪

♪ ♪ ♪ ♪ ♪ ♪ ♪

User 1

User 2

Page 28: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Session Windows With Gapval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

User 1 ♪ ♪ ♪ ♪ ♪ ♪

Session gap - 15 minutes

♪♪

Page 29: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Analyzing User Sessionval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

.apply(new CountSessionStats())

User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪

Page 30: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Handling Late Eventsval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

.allowedLateness(Time.minutes(60))

.apply(new CountSessionStats())

User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪

Page 31: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Triggering Early Resultsval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))

.allowedLateness(Time.minutes(60))

.apply(new CountSessionStats())

User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪

Page 32: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Sessionization Exampleval sessionStream : DataStream[SessionStats] = sEnv

.addSource(new KafkaConsumer(...))

.keyBy(_.userId)

.window(EventTimeSessionWindows.withGap(Time.minutes(15)))

.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))

.allowedLateness(Time.minutes(60))

.apply(new CountSessionStats())

Working example:https://github.com/getindata/flink-use-case

Page 33: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Modern Stream Processing Engines

■ Rich stream processing semantic● Built-in support for event-time windows● Accurate results for late / out-of-order events and replays● Early triggers

■ Low latency and high-throughput■ Exactly-once stateful processing

Page 34: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Modern Stream Processing Engines

■ Rich stream processing semantic● Built-in support for event-time windows● Accurate results for late / out-of-order events and replays● Early triggers

■ Low latency and high-throughput■ Exactly-once stateful processing

User survey:http://data-artisans.com/flink-user-survey-2016-part-1http://data-artisans.com/flink-user-survey-2016-part-2

Page 35: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Page 36: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

How can I reprocess data?

Page 37: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Reprocessing Events In Flink

1. Take periodic snapshots of a job● It stores Kafka offsets, on-flight sessions, application state

2. Restart a job from a savepoint rather than from a beginning

Page 38: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

What if data is no longer in Kafka?

Page 39: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Consuming Data From HDFS

■ Run your streaming code on HDFS (bounded data)● You need to read data in event-time based order

Page 40: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

How to join with other data sets/streams?

Page 41: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Join With Other Datasets / Streams■ Flink can join windowed streams easily■ Join of data stream with data set is WIP

● Even with slowly changing data set!● Even keyed data

Stream 2

Stream 1

Joined Stream Input Stream Joined Stream

+

Id Name

1 John Doe

2 Jane Doe

Dataset

+

Page 42: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

When is batch processing good?

Page 43: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Batch Processing Use-Cases

■ Ad-hoc analytics and data exploration● Notebooks, Spark/Flink/Hive, Parquet, complete data sets

■ Technical advantages● A large swaths of historical data in HDFS● High-level libraries in mature batch technologies

Page 44: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Batch Processing Use-Cases

■ Ad-hoc analytics and data exploration● Notebooks, Spark/Flink/Hive, Parquet, complete data sets

■ Implementation advantages● Offline experiments over large historical data

■ Historical events are usually stored in HDFS, not Kafka

● High-level libraries in batch processing technologies■ Spark MLlib, H2O

(when data arrives continuously)

don’t solve streaming problem

with batch jobs

Page 45: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

I like this streaming API.Can I use it for batch?

Page 46: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Unified batch and streaming API

■ Not with raw Flink API■ But with Flink Table API■ Apache Beam

Page 47: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Who Are You, actually?

■ At GetInData, we build custom Big Data solutions● Hadoop, Flink, Spark, Kafka and more

■ Our team is today represented by

Krzysztof Zarzycki

Dawid Wysakowicz

Adam Kawa

Page 48: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

■ Stream often the natural representation of your data

■ Stream processing is not only about low latency

Summary

Page 49: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Q&A

Page 50: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Thanks ! Big Data Tech Warsaw !

Page 51: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Page 52: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Log Abstraction

11:00 - 12:00

12:00 - 13:00

… …

10:00 - …

10:00 - …

10:00 - 11:00

Page 53: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Spark Structured Streaming

⬇ It’s still ALPHA and the APIs are still experimental⬇ Operates on top of micro-batches (Spark SQL engine)

⬆ Easy-to-learn API (Dataset/DataFrame)⬆ Rich ecosystem of tools and libraries e.g. MLlib⬆ Supports event-time

⬇ Sessionization not yet supported - SPARK-10816⬇ Queryable state not yet supported - SPARK-16738

Page 54: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Kafka Streams

⬇ No exactly-once (just at-least-once)⬇ Kafka as the only data source⬇ No bounded streams (batch) optimizations

⬆ Simplicity⬆ Embedded into application⬆ Supports event-time

⬇ Lack of session windows

Page 55: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Apache Beam

⬆ Unified API for batch and streaming⬆ Rich streaming processing semantics

⬆ Complex TriggerDSL⬆ Multiple runtime environments

⬆ Spark, Flink, Apex, Dataflow⬆ Side inputs and outputs

⬇ Verbose Java API⬇ New project - Top level since 01/2017

Page 56: Streaming analytics better than batch   when and why - (Big Data Tech 2017)

© Copyright. All rights reserved. Not to be reproduced without prior written consent.

Google Dataflow

■ Runtime environment for Apache Beam in Google Cloud

⬇ No support for Iterative Computations⬆ Supports Side Outputs⬆ Works with every Google Cloud Service (Pub/Sub, BigTable

etc.)