streaming data analytics (kinesis, emr/spark) - pop-up loft tel aviv
TRANSCRIPT
![Page 1: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/1.jpg)
Streaming Analytics on AWSDmitri TchikatilovAdTech BD, [email protected]
![Page 2: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/2.jpg)
Agenda
1. Streaming principles 2. Streaming analytics on AWS3. Kinesis and Apache Spark on EMR 4. Querying and Scaling 5. Best Practices
![Page 3: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/3.jpg)
Batch vs. Stream
Batch Processing Stream Processing
Data scopeQueries or processing over all or most of the
data
Queries or processing over data on rolling window or most recent data record
Data size Large batches of data Individual records or micro batches of few records
Performance Latencies in minutes to hours.
Requires latency in the order of seconds or milliseconds.
Analytics Complex analytics.Simple response functions,
aggregates, and rolling metrics.
![Page 4: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/4.jpg)
Streaming App Challenges
Simple & Flexible Analytics
Elastic - adapt to input surges and back
pressure
Fast ~ 1s to 100ms for the majority of apps
Scalable ~ 1M records/secAvailable - low tolerance
for record losses
Usability Performance
![Page 5: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/5.jpg)
“We are our choices...”
J.P. Sartre
![Page 6: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/6.jpg)
Stream Processing Choices on AWSOperations Analytics
Storm Zookeeper/Nimbus for HA SQL - 3rd party, roll your own
Kafka Zookeeper (failure detection, partitioning, replication) SQL - 3rd party, roll your own
Druid Zookeeper, multiple node roles scale independently
OLAP engine (JSON) on denormalized data, real time indexing
Kinesis AWS Service SQL - Kinesis Analytics (in development)
Spark Streaming
EMR bootstraps latest 1.6, Yarn, Monitoring
SparkSQL on DataFrames, Joins, Zeppelin notebooks
![Page 7: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/7.jpg)
Components
Storage layerIngest (record storing, ordering, strong consistency and replayable reads)
Storage Processing
Processing layerAnalytics (consume data from storage layer, run computations, removal from storage)
![Page 8: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/8.jpg)
Real-Time Streaming Data Ingestion
Custom-built Streaming Applications(KCL)
Inexpensive: $0.014 per 1,000,000 PUT Payload Units
Storage - Amazon Kinesis Streams
Kinesis Stream1 Shard< 1MB-in / 2MB-outEach record < 1 MBPutRecords() < 500 (5MB)Increased retention 7 days
![Page 9: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/9.jpg)
Processing - Spark Streaming
RE
CE
IVE
RS
Input data streams
SPARK Job
Results published to destinations
DStream
RDD = Resilient Distributed DatasetDStream = Collection of RDDs
![Page 10: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/10.jpg)
Spark Steaming – Long Running Spark App
Driver Program
StreamingContext
SparkContext
Spark jobs toprocess
received data
Worker Node
Executor
Long Task Receiver
Worker Node
Executor
Task Task Task
Input stream
Worker Node processes the
data
Output Batch
![Page 11: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/11.jpg)
Analytics - DataFrames on Streaming Data
• KCL – Kinesis Client Library (helps take data off Kinesis)• Spark Streaming uses KCL - reads data from Kinesis
and forms a DStream (Pull Mechanism)• Creates DataFrame in Spark Streaming
Kinesis KCL
![Page 12: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/12.jpg)
Kinesis and Spark Streaming
EMRKinesis
![Page 13: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/13.jpg)
Full Kinesis + Spark Pipeline
![Page 14: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/14.jpg)
What About Analytics?
What operations are possible?Filter, GroupBy, Join, Window Operations
Not all queries make sense to run on the stream.Large joins on RDDs in DStreams can be expensive
![Page 15: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/15.jpg)
Spark Streaming – Operations on DStreamsWindow Operations
![Page 16: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/16.jpg)
Query the Data in DStreams?
This is all great, but I’d like to query my data!
StreamingContext > DStream (RDDs) > DataFrame
DataFrame converted to temp. table and query with SQL through HiveContext
![Page 17: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/17.jpg)
Example: Querying DStreams with SQL
CourtesyAmo Abeyarante
AWS Big Data Blog
![Page 18: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/18.jpg)
Setup
1. Kinesis Stream with data provided by Python script2. KCL Scala app launched as spark-job
• Checks the number of Shards and instantiates the same number of Streams
• Receives data from Kinesis in small batches• Creates DataFrame, registers as temp table • Creates HiveContext
3. Use Hive app to query the data
![Page 19: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/19.jpg)
Demo – Querying Streams
![Page 20: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/20.jpg)
Analytics – Choosing Where to Join Data
Join the data in a custom KCL app – denormalize and publish to another Kinesis Stream
Storage Processing
Join the streaming data using DStreams
![Page 21: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/21.jpg)
Amazon Kinesis + Spark on EMR
Producer 1
Producer 2
Producer N
Shard1
Shard2
Kinesis
Receiver 1
KCL Worker 1Yarn Executor 1
RecordProcessor 1
RecordProcessor 2
EMR
Yarn Executor 2
![Page 22: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/22.jpg)
Create DStream to Scale Out
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
kinesisStream = KinesisUtils.createStream(streamingContext, [Kinesis app name], [Kinesis stream name], [endpoint URL], [region name], [initial position], [checkpoint interval], StorageLevel.MEMORY_AND_DISK_2)
![Page 23: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/23.jpg)
Amazon Kinesis + Spark on EMR
Producer 1
Producer 2
Producer N
Shard1
Shard2
Kinesis
Receiver 1
KCL Worker 1Yarn Executor 1
RecordProcessor 1
EMR
Yarn Executor 2KCL Worker 2
Receiver 2
RecordProcessor 2
![Page 24: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/24.jpg)
Scaling KinesisKinesis • Can accumulate data at any rate, but need input batching
for high rates of small messages to optimize cost• Scales inputs by splitting shards • Never “pressures” Spark – Spark and KCL is pulling data
![Page 25: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/25.jpg)
Scaling EMR/SparkEMR/Spark• Scales by adding task nodes – can be EC2 Spot instances• Yarn can be configured for “dynamic resource allocation”
with variable number of executors per app. New default for the upcoming EMR 4.4 release Works well for batch – but not always for Streaming
• Automatic – same number of Receivers (in case of a shard split/merge operations)
• Manual (app restart) – if you need to change the number of Receivers
![Page 26: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/26.jpg)
Stability in Spark Streaming
2s 2s 2s
0s 4s 8s
Tb (batch) = 4s Tp (process) = 2s
5s 5s
0s 4s 8s
Tb (batch) = 4s Tp (process) = 5s
Stable Tb <= Tp
Unstable Tb > Tp
Unstable state – increase in scheduling delay
Scheduling delay
5s
![Page 27: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/27.jpg)
Spark Backpressure Feature
After every micro batch finishes – statistic used to estimate processing rate
PID controller (proportional-integral-derivative) – estimates what the maximum rate of ingest for the system (rows/sec)
PID controller limits the ingestSparkConfspark.streaming.backpressure.enabled = true
![Page 28: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/28.jpg)
Analytics on Streaming Data
Is here today, but requires some work. Major advancements soon in Kinesis Analytics, Spark 2.0.
A lot of analytics can be done simply in a custom KCL app (moving averages, joins, filters, etc).
FLEXIBILITYPERFORMANCE
![Page 29: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/29.jpg)
Streaming Best Practices Summary
1. Total Processing time is less than Batch interval (Tp < Tb)2. Load is well balanced - # of Receivers is a multiple of # of Executors3. Spark Streaming reading from Kinesis defaults to 1 sec.4. Enable Spark Checkpoints for reliable (at-least-once) semantics. Use Spark 1.6 with EMRFS for S3. 5. Streaming apps using different names to avoid using same DynamoDB table
![Page 30: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/30.jpg)
Dmitri TchikatilovDigital Advertising [email protected]
![Page 31: Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv](https://reader034.vdocument.in/reader034/viewer/2022042907/58d1642f1a28aba3468b5921/html5/thumbnails/31.jpg)