distributed systems for stream processing 2018.pdf · apache kafka and spark structured streaming...
TRANSCRIPT
![Page 1: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/1.jpg)
Distributed systems for stream processing
Apache Kafka and Spark Structured Streaming
Alena Hall lenadroid
![Page 2: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/2.jpg)
ü Large-scaledataprocessingü DistributedSystemsü FunctionalProgrammingü DataScience&MachineLearning
Alena Hall - lenadroid
![Page 3: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/3.jpg)
Natallia Dzenisenka
nata_dzen bit.ly/oscon-17
![Page 4: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/4.jpg)
Ever-increasing
Data
lenadroid
![Page 5: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/5.jpg)
Direct result of some action
lenadroid
![Page 6: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/6.jpg)
Produced as a side effect
lenadroid
![Page 7: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/7.jpg)
Continuous indicators
lenadroid
![Page 8: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/8.jpg)
Reaction
urgent not-so-urgent flexible
lenadroid
![Page 9: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/9.jpg)
Reaction
urgent not-so-urgent flexible
near-real-time~ seconds
real-time~ sub milliseconds
batch~minutes, hours, days, weeks
lenadroid
![Page 10: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/10.jpg)
Event Ingestion Processing & Reaction
real-time micro-batch
batch
lenadroid
![Page 11: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/11.jpg)
Are data workflows flexible enough?
lenadroid
![Page 12: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/12.jpg)
Challenges
Simplicity. Scalability. Reliability
lenadroid
![Page 13: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/13.jpg)
Meet Apache Kafka
lenadroid
![Page 14: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/14.jpg)
Apache Kafka is an open-source stream-
processing software platform developed by the
Apache Software Foundation written in Scala
and Java.
lenadroid
![Page 15: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/15.jpg)
Kafka Brokers
lenadroid
![Page 16: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/16.jpg)
Inside of a Kafka Topic
0 1 2 3 4
0 1 2 3
80 1 2 3 4 5 6 7
lenadroid
![Page 17: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/17.jpg)
Kafka Topic Partition
80 1 2 3 4 5 6 7
lenadroid
![Page 18: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/18.jpg)
Kafka Producers and Consumers
lenadroid
![Page 19: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/19.jpg)
Systems for stream processing
Kafka Streams
Storm
Spark
Flink
lenadroid
![Page 20: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/20.jpg)
Meet Apache Spark
lenadroid
![Page 21: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/21.jpg)
Apache Spark is a unified analytics engine for large-scale data
processing: batch, streaming, machine learning, graph
computation with access to data in hundreds of sources.
lenadroid
![Page 22: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/22.jpg)
ü Spark SQL and batch processing
ü Stream processing with Spark Streaming and Structured
Streaming
ü * Continuous processing
ü Machine Learning with Mllib
ü Graph computations with GraphX
* Experimental lenadroid
![Page 23: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/23.jpg)
How does Spark work?
lenadroid
![Page 24: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/24.jpg)
Sparkapplication(Driver)
ClusterManager
task
task
task
task
task
task
task
task
task
lenadroid
![Page 25: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/25.jpg)
Apache Kafka + Apache Spark
lenadroid
![Page 26: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/26.jpg)
Existing infrastructure and resources
ü Kafka cluster (HDInsight or other)
ü Spark cluster (Azure Databricks workspace, or other)
ü Peered Kafka and Spark Virtual Networks
ü Sources of data: Twitter & Slack & Nomics APIs
lenadroid
![Page 27: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/27.jpg)
Databricks: Interactive Environment
lenadroid
![Page 28: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/28.jpg)
Processing crypto currency trading data
Example
lenadroid
![Page 29: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/29.jpg)
markets exchanges trades
ETH / BTC
BTC / USDT
…
Bitfinex
Binance
…
lenadroid
…
base quote
![Page 30: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/30.jpg)
markets exchanges trades
lenadroid
{"volume":"5","price":"3.0871","id":"123456","timestamp":"2018-07-17T17:00:00.00Z"
}
![Page 31: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/31.jpg)
Indicators to watch and act on
ü Price spikes (all-time high, all-time low)
ü Significant changes in price or volume of trades
ü Profitability of potential trade at current moment
ü Price or volume of trades crossing given threshold during the past X minutes
ü Morelenadroid
![Page 32: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/32.jpg)
Getting trades data from API
ü Market and exchange dataü Trades data for given market and base/quote currenciesü Sending data to Kafka
lenadroid
![Page 33: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/33.jpg)
Processing trades
ü Consuming data coming from Kafka topicsü Watching relevant indicators
lenadroid
![Page 34: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/34.jpg)
More examples?
Processing streams of events from multiple sources with Apache Kafka and Spark
lenadroid
![Page 35: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/35.jpg)
Data sources: external, internal, ...
• Big number of data sources
• Most of the data sources are independent
• Sources of data used for many processing tasks & end-goals
lenadroid
![Page 36: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/36.jpg)
Feedback from Slack
ü Sending messages to Slack
lenadroid
![Page 37: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/37.jpg)
Listener for new Slack messages
ü Messagesunderspecificchannelsü Focusedonaparticulartopicü SenttoaspecificKafkatopic
lenadroid
![Page 38: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/38.jpg)
Receiving events in Kafka topic
ü SparkconsumerforKafkatopicsü SendingonlytopicrelatedmessagestoKafka
lenadroid
![Page 39: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/39.jpg)
Sending Twitter feedback to Kafka
ü GettinglatesttweetsaboutspecifictopictoKafkaü ReceivingthoseeventsfromKafkainSpark
lenadroid
![Page 40: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/40.jpg)
Analyzing feedback in real-time
ü Kafkaisreceivingeventsfrommanysourcesü SentimentanalysisonincomingKafkaeventsü Sentiment<=0.3à #negative-feedback forreviewü Sentiment>=0.9à #positive-feedback channel
lenadroid
![Page 41: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/41.jpg)
Kafka + Spark = Reliable, scalable, durable event ingestion
and efficient stream processing
lenadroid
![Page 42: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/42.jpg)
lenadroid
![Page 43: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/43.jpg)
trigger(Trigger.Continuous("1 second"))
Low (~1 ms) end-to-end latency
At-least-once fault-tolerance guarantees
Not nearly all operations are supported yet
No automatic retries of failed tasks
Needs enough cluster power to operate
lenadroid
![Page 44: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/44.jpg)
EveryXseconds EveryXseconds EveryXseconds
WheneventIsatsource
Wheneventisprocessedtosink
Check-pointingepoch
lenadroid
![Page 45: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/45.jpg)
EveryXseconds EveryXseconds EveryXseconds
WheneventIsatsource
Wheneventisprocessedtosink
Check-pointingepoch
~1ms
lenadroid
![Page 46: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/46.jpg)
aka.ms/eventhubs-kafkalenadroid
![Page 47: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/47.jpg)
Operator
Operators
lenadroid
![Page 48: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/48.jpg)
Thank you!
Apache Kafka: aka.ms/apache-kafka
Apache Spark: aka.ms/apache-sparkEvent stream processing architecture on Azure with Apache Kafka and Spark: aka.ms/kafka-spark-azure and aka.ms/oscon-18
Create HDInsight Kafka cluster using ARM: aka.ms/hdi-kafka-arm
Create Kafka topics in HDInsight: aka.ms/hdi-kafka-topic
lenadroid
![Page 49: Distributed systems for stream processing 2018.pdf · Apache Kafka and Spark Structured Streaming Alena Hall lenadroid. ü Large-scale data processing ü Distributed Systems ü Functional](https://reader030.vdocument.in/reader030/viewer/2022041004/5ea885d1f35fca1745303e92/html5/thumbnails/49.jpg)
ü Works on Azure at ü Lives in Seattleü F# Software Foundation Board of Trusteesü Organizes @ML4ALLü Program Committee for Lambda World ü Has a channel: /c/AlenaHall
Alena Hall - lenadroid