distributed systems for stream processing · distributed systems for stream processing apache kafka...

45
Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid

Upload: others

Post on 22-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Distributed systems for stream processing

Apache Kafka and Spark Structured Streaming

Alena Hall - @lenadroid

Page 2: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

ü HighScale&BigDataü DistributedSystemsü FunctionalProgrammingü DataScience&MachineLearning

Alena Hall - lenadroid

Page 3: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Ever-increasing

Data

lenadroid

Page 4: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Direct result of some action

lenadroid

Page 5: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Produced as a side effect

lenadroid

Page 6: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Continuous metrics

lenadroid

Page 7: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Reaction

urgent not-so-urgent flexible

lenadroid

Page 8: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Reaction

urgent not-so-urgent flexible

near-real-time~ seconds

real-time~ sub milliseconds

batch~minutes, hours, days, weeks

lenadroid

Page 9: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Event Ingestion Processing & Reaction

real-time micro-batch

batch

lenadroid

Page 10: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Are data workflows flexible enough?

lenadroid

Page 11: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Challenges

Simplicity. Scalability. Reliability

lenadroid

Page 12: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Meet Apache Kafka

lenadroid

Page 13: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Apache Kafka is an open-source stream-

processing software platform developed by the

Apache Software Foundation written in Scala

and Java.

lenadroid

Page 14: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Kafka Brokers

lenadroid

Page 15: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Inside of a Kafka Topic

0 1 2 3 4

0 1 2 3

80 1 2 3 4 5 6 7

lenadroid

Page 16: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Kafka Topic Partition

80 1 2 3 4 5 6 7

lenadroid

Page 17: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Create a Kafka topic

lenadroid

Page 18: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data
Page 19: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Kafka Producers and Consumers

lenadroid

Page 20: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Systems for stream processing

Kafka

Storm

Spark

Flink

lenadroid

Page 21: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Meet Apache Spark

lenadroid

Page 22: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Apache Spark is a unified analytics engine for large-scale data

processing: batch, streaming, machine learning, graph

computation with access to data in hundreds of sources.

lenadroid

Page 23: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

ü Spark SQL and batch processing

ü Stream processing with Spark Streaming and Structured

Streaming

ü * Continuous processing

ü Machine Learning with Mllib

ü Graph computations with GraphX

* Experimental lenadroid

Page 24: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Spark program

lenadroid

Page 25: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

How does Spark work?

lenadroid

Page 26: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Sparkapplication(Driver)

ClusterManager

task

task

task

task

task

task

task

task

task

lenadroid

Page 27: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Apache Kafka + Apache Spark

lenadroid

Page 28: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Existing infrastructure and resources

ü Kafka cluster (HDInsight or other)

ü Spark cluster (Azure Databricks workspace, or other)

ü Peered Kafka and Spark Virtual Networks

ü Sources of data: Twitter & Slack

lenadroid

Page 29: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Databricks: Interactive Environment

lenadroid

Page 30: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data
Page 31: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Example

Processing streams of events from multiple sources with Apache Kafka and Spark

lenadroid

Page 32: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Data sources: external, internal, ...

• Big number of data sources

• Most of the data sources are independent

• Sources of data used for many processing tasks & end-goals

lenadroid

Page 33: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Feedback from Slack

ü Sending messages to Slack

lenadroid

Page 34: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Listener for new Slack messages

ü Messagesunderspecificchannelsü Focusedonaparticulartopicü SenttoaspecificKafkatopic

lenadroid

Page 35: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Receiving events in Kafka topic

ü SparkconsumerforKafkatopicsü SendingonlytopicrelatedmessagestoKafka

lenadroid

Page 36: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Sending Twitter feedback to Kafka

ü GettinglatesttweetsaboutspecifictopictoKafkaü ReceivingthoseeventsfromKafkainSpark

lenadroid

Page 37: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Analyzing feedback in real-time

ü Kafkaisreceivingeventsfrommanysourcesü SentimentanalysisonincomingKafkaeventsü Sentiment<=0.3à #negative-feedback forreviewü Sentiment>=0.9à #positive-feedback channel

lenadroid

Page 38: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Kafka + Spark = Reliable, scalable, durable event ingestion

and efficient stream processing

lenadroid

Page 39: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

trigger(Trigger.Continuous("1 second"))

Low (~1 ms) end-to-end latency

At-least-once fault-tolerance guarantees

Not nearly all operations are supported yet

No automatic retries of failed tasks

Needs enough cluster power to operate

lenadroid

Page 40: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

EveryXseconds EveryXseconds EveryXseconds

WheneventIsatsource

Wheneventisprocessedtosink

Check-pointingepoch

lenadroid

Page 41: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

EveryXseconds EveryXseconds EveryXseconds

WheneventIsatsource

Wheneventisprocessedtosink

Check-pointingepoch

~1ms

lenadroid

Page 42: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

aka.ms/eventhubs-kafkalenadroid

Page 43: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Operator

Operators

lenadroid

Page 44: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Thank you!

Apache Kafka: aka.ms/apache-kafka

Apache Spark: aka.ms/apache-sparkEvent stream processing architecture on Azure with Apache Kafka and Spark: aka.ms/kafka-spark-azure

Create HDInsight Kafka cluster using ARM: aka.ms/hdi-kafka-arm

Create Kafka topics in HDInsight: aka.ms/hdi-kafka-topic

lenadroid

Page 45: Distributed systems for stream processing · Distributed systems for stream processing Apache Kafka and Spark Structured Streaming Alena Hall - @lenadroid. ü High Scale & Big Data

Alena Hall - lenadroid

ü Works on Azure at ü Lives in Seattleü F# Software Foundation Board of Trusteesü Organizes @ML4ALLü Program Committee for Lambda World ü Has a channel: /c/AlenaHall