![Page 1: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/1.jpg)
Essential Ingredients of Stream Processing @ Scale
Kartik Paramasivam
![Page 2: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/2.jpg)
About Me
• ‘Streams Infrastructure’ at LinkedIn – Pub-sub messaging : Apache Kafka– Change Capture from various data systems: Databus– Stream Processing platform : Apache Samza
• Previous– Microsoft Cloud/IOT Messaging (EventHub) and
Enterprise Messaging(Queues/Topics)– .NET WebServices and Workflow stack – BizTalk Server
![Page 3: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/3.jpg)
Agenda
• What is Stream Processing ?• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close
![Page 4: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/4.jpg)
Response latency
Stream processing
Milliseconds to minutes
RPC
Synchronous Later. Possibly much later.
0 ms
![Page 5: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/5.jpg)
Agenda
• Stream processing Intro• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close
![Page 6: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/6.jpg)
Newsfeed
![Page 7: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/7.jpg)
Cyber-security
![Page 8: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/8.jpg)
Internet of Things
![Page 9: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/9.jpg)
Agenda
• Stream processing Intro• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close
![Page 10: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/10.jpg)
CANONICAL ARCHITECTURE
Data-Bus
Real Time Processing
(Samza)
Batch Processing
(Hadoop/Spark)
Voldemort R/Oe.g.
Espresso
Processing
Bulk upload
Espresso
Services Tier
Ingestion Serving
Clients(browser,devices, sensors ….)
Kafka
![Page 11: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/11.jpg)
Agenda
• Stream processing Intro• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close
![Page 12: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/12.jpg)
Essential Ingredients to Stream Processing
1. Scale2. Reprocessing3. Accuracy of results4. Easy to program
![Page 13: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/13.jpg)
SCALE.. but not at any cost
![Page 14: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/14.jpg)
Basics : Scaling Ingestion
- Streams are partitioned- Messages sent to partitions
based on PartitionKey- Time based message
retention
Stream A
producers
Pkey=10
consumerA(machine1)
consumerA(machine2)
Pkey=25 Pkey=45
e.g. Kafka, AWS Kinesis, Azure EventHub
![Page 15: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/15.jpg)
Scaling Processing.. E.g. SamzaStream A
Task 1 Task 2 Task 3
Stream B
Samza Job
![Page 16: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/16.jpg)
Samza – Streaming DataflowStream A
Stream c
Stream D
Job 1
Job 2
Stream B
![Page 17: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/17.jpg)
Horizontal Scaling is great ! But..
• But more machines means more $$ • Need to do more with less.• So what’s the key bottleneck during
Event/Stream Processing ?
![Page 18: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/18.jpg)
Key Bottleneck: “Accessing Data”
• Big impact on CPU, Network, Disk
• Types of Data Access 1. Adjunct data – Read only data2. Scratchpad/derived data - Read-Write
data
![Page 19: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/19.jpg)
Adjunct Data – typical access
KafkaAdClicks Processing Job
AdQuality update
Kafka
Member Database
Read Member Info Concerns1. Latency2. CPU3. Network4. DDOS
![Page 20: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/20.jpg)
Scratch pad/Derived Data – typical access
KafkaSensor Data
Processing Job
Alerts
Kafka
DeviceState
Database
Concerns1. Latency2. CPU3. Network4. DDOS
Read + Update per Device Info
![Page 21: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/21.jpg)
Adjunct Data – with Samza
KafkaAdClicks
Processing Job
outputKafka
Member Database(espresso) Databus
Kafka, Databus, Database, Samza Job are all partitioned by MemberId
Member Updates
Task1
Task2
Task3
Rocks Db
![Page 22: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/22.jpg)
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Stable State
![Page 23: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/23.jpg)
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0P1
P2
P3
Host-A Host-B Host-C
Changelog Stream
Host A dies/fails
![Page 24: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/24.jpg)
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
YARN allocates the tasks to a container on a different host!
![Page 25: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/25.jpg)
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
Restore local state by reading from the
ChangeLog
![Page 26: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/26.jpg)
Fault Tolerance in a stateful Samza job
P0
P1
P2
P3
Task-0 Task-1 Task-2 Task-3
P0P1
P2
P3
Host-E Host-B Host-C
Changelog Stream
Back to Stable State
![Page 27: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/27.jpg)
Hardware Spec: 24 cores, 1Gig NIC, SSD
• (Baseline) Simple pass through job with no local state – 1.2 Million msg/sec
• Samza job with local state – 400k msg/sec
• Samza job with local state with Kafka backup– 300k msg/sec
Performance Numbers with Samza
![Page 28: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/28.jpg)
Local State - Summary
• Great for both read-only data and read-write data
• Secret sauce to make local state work 1. Change Capture System: Databus/DynamoDB
streams2. Durable backup with Kafka Log Compacted
topics
![Page 29: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/29.jpg)
Essential Ingredients to Stream Processing
1. Scale2. Reprocessing 3. Accuracy of results4. Easy to program
![Page 30: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/30.jpg)
REPROCESSING
![Page 31: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/31.jpg)
Why do we need it ?
• Software upgrades.. Yes bugs are a reality• Business logic changes• First time job deployment
![Page 32: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/32.jpg)
Reprocessing Data – with Samza
outputKafka
Member Database(espresso)
Databus
Member Updates
Company/Title/Location
StandardIzationJob
Machine Learning
modelbootstrap
![Page 33: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/33.jpg)
Reprocessing- Caveats
• Stream processors are fast.. They can DOS the system if you reprocess – Control max-concurrency of your job– Quotas for Kafka, Databases– Async load into databases (Project Venice)
• Capacity– Reprocessing a 100 TB source ?
• Doesn’t reprocessing mean you are no-longer being real-time ?
![Page 34: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/34.jpg)
Essential Ingredients to Stream Processing
1. Scale but at not at any cost2. Reprocessing 3. Accuracy of results4. Easy to Program
![Page 35: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/35.jpg)
ACCURACY OF RESULTS
![Page 36: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/36.jpg)
Querying over an infinite stream
1.00 pm
Ad View Event
1:01pm
Ad Click Event
AdQuality
ProcessorUser1
Did user click the Ad within 2 minutes of seeing the Ad
![Page 37: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/37.jpg)
DELAYS – AN EXAMPLE
Ad Quality Processor(Samza)
Services Tier
Kafka
Services Tier
Ad Quality Processor(Samza)
KafkaMirrored
kartik
DATACENTER 1 DATACENTER 2
AdViewEvent
LB
![Page 38: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/38.jpg)
DELAYS – AN EXAMPLE
Real Time Processing
(Samza)
Services Tier
Kafka
Services Tier
Real Time Processing
(Samza)
KafkaMirrored
kartik
DATACENTER 1 DATACENTER 2
AdClick EventLB
![Page 39: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/39.jpg)
What do we need to do to get accurate results?
Deal with• Late Arrivals
– E.g. AdClick event showed up 5 minutes late.• Out of order arrival
– E.g. AdClick event showed up before AdView event
• Influenced by “Google MillWheel”
![Page 40: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/40.jpg)
SolutionKafka
AdClicks
Processing Job
output
Kafka
Task1
Task2
Task3
Message Store
Kafka
AdView MessageStore
MessageStore
1. All events are stored locally2. Find impacted ‘window/s’ for late
arrivals3. Recompute result4. Choose strategy for emitting results
(absolute or relative value)
![Page 41: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/41.jpg)
Myth: This isn’t a problem with Lambda Architecture..
• Theory: Since the processing happens 1 hour or several hours later delays are not a problem.
• Ok.. But what about the “edges”– Some “sessions” start before the cut off time for
processing.. And end after the cut off time.– Delays and out of order processing make things
worse on the edges
![Page 42: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/42.jpg)
Essential Ingredients to Stream Processing
1. Scale but at not at any cost2. Reprocessing 3. Accuracy of results4. Easy Programmability
![Page 43: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/43.jpg)
Easy Programmability
• Support for “accurate” Windowing/Joins.( Google Cloud Dataflow )
• Ability to express workflows/DAGs in config and DSL (e.g. Storm)
• SQL support for querying over streams– Azure Stream Insight
• Apache Samza – working on the above
![Page 44: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/44.jpg)
Agenda
• Stream processing Intro• Scenarios• Canonical Architecture• Essential Ingredients of Stream Processing• Close
![Page 45: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/45.jpg)
Some scale numbers at LinkedIn
• 1.3 Trillion Messages get ingested into Kafka per day – Each message gets consumed 4-5 times
• Database change capture :– More than 2 Trillion Messages get consumed per
week• Samza jobs in production which process more
than 1 Million messages/sec Note: These numbers are not reflective of LinkedIn Site traffic
![Page 46: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/46.jpg)
References
• http://samza.apache.org/• http://kafka.apache.org/ • https://github.com/linkedin/databus • http://cs.brown.edu/~ugur/8rulesSigRec.pdf• http://www.cs.cmu.edu/~pavlo/courses/fall20
13/static/papers/p734-akidau.pdf
![Page 47: Essential Ingredients of Realtime Stream Processing @ Scale](https://reader034.vdocument.in/reader034/viewer/2022051707/58ed2bd81a28abc97f8b45ed/html5/thumbnails/47.jpg)
Thank You!