data streaming-systems
TRANSCRIPT
Every ad.Every sales channel.Every screen.One platform.
Building Distributed Data Streaming System
Ashish Tadose
Lead Software EngineerBig Data Analytics - PubMatic
Agenda
• What is stream processing
• Streaming architecture
• Scalable Data Ingestion
• RealTime Streaming Processing system
2
What is Streaming Process ?
3
In simple words, Streaming is…
4
Batch & Streaming processing
Data Generator
IngestionDistributed File system
Processing Data Store
Batch processing
Data Generator
IngestionMessage
QueueProcessing Data Store
Stream Data processing
Batch & Streaming processing
6
Data Generator
Ingestion
MessageQueue
Processing Data Store
Stream Data processing
Distributed File system
Processing Data Store
Batch processing
Batch & Streaming processing
7
Data Generator
IngestionMessage
Queue
Processing Data Store
Stream Data processing
Distributed File system
Processing Data Store
Batch processing
Lambda Architecture: Velocity & Volume
8
StreamingIngestion
Technologies
9
Ingestion Ecosystem
• Sources
• Machine data
• External stream & syslogs
• Data Collection
• Flume
• Kafka
• Kinesis
• Confluent10
Flume
• Easier to setup
• Rich set of in-build tools
• No inherent support for data replication
• Nodes works in isolation
• Memory channel vs File Channel 11
Kinesis
12
Kafka
13
http://kafka.apache.org/
Originated at LinkedIn, open sourced in early 2011
Implemented in Scala, some Java
9 core committers, plus ~ 20 contributors
Why is Kafka so fast?
• Fast writes:
• While Kafka persists all data to disk, essentially all writes go to thepage cache of OS, i.e. RAM.
• Fast reads:
• Very efficient to transfer data from page cache to a network socket
• Linux: sendfile() system call
• Combination of the two = fast Kafka!
• Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.
14
14
http://kafka.apache.org/documentation.html#persistence
Flafka – Flume meets Kafka
15
Confluent - Centralized Ingestion with Kafka Pipeline
16
StreamProcessing
17
RealTime Stream Processing• Processing system
• Apache Storm
• Apache Samza
• Apache Spark (Streaming)
• Project Apex - DataTorrent
• Storage
• Hive HDFS
• Hbase
• MySql
• Custom
• Access
• Depend of data storage
• Scalable query interface - Kafka 18
Streaming Design Patterns
• Micro batching
• Unpredictable incoming data
• Creating multiple streams
• Out of sequence events
• Stream joins
• Top N metrics
• External Lookup
19
Thank You
20