data streaming-systems

Every ad.Every sales channel.Every screen.One platform.

Building Distributed Data Streaming System

Ashish Tadose

Lead Software EngineerBig Data Analytics - PubMatic

Agenda

• What is stream processing

• Streaming architecture

• Scalable Data Ingestion

• RealTime Streaming Processing system

2

What is Streaming Process ?

3

In simple words, Streaming is…

4

Batch & Streaming processing

Data Generator

IngestionDistributed File system

Processing Data Store

Batch processing

Data Generator

IngestionMessage

QueueProcessing Data Store

Stream Data processing


6

Data Generator

Ingestion

MessageQueue



Distributed File system


Batch processing


7

Data Generator

IngestionMessage

Queue



Distributed File system


Batch processing

Lambda Architecture: Velocity & Volume

8

StreamingIngestion

Technologies

9

Ingestion Ecosystem

• Sources

• Machine data

• External stream & syslogs

• Data Collection

• Flume

• Kafka

• Kinesis

• Confluent10

Flume

• Easier to setup

• Rich set of in-build tools

• No inherent support for data replication

• Nodes works in isolation

• Memory channel vs File Channel 11

Kinesis

12

Kafka

13

http://kafka.apache.org/

Originated at LinkedIn, open sourced in early 2011

Implemented in Scala, some Java

9 core committers, plus ~ 20 contributors

http://kafka.apache.org/

Why is Kafka so fast?

• Fast writes:

• While Kafka persists all data to disk, essentially all writes go to thepage cache of OS, i.e. RAM.

• Fast reads:

• Very efficient to transfer data from page cache to a network socket

• Linux: sendfile() system call

• Combination of the two = fast Kafka!

• Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.

14

14

http://kafka.apache.org/documentation.html#persistence

http://kafka.apache.org/documentation.html#persistence

Flafka – Flume meets Kafka

15

Confluent - Centralized Ingestion with Kafka Pipeline

16

StreamProcessing

17

RealTime Stream Processing• Processing system

• Apache Storm

• Apache Samza

• Apache Spark (Streaming)

• Project Apex - DataTorrent

• Storage

• Hive HDFS

• Hbase

• MySql

• Custom

• Access

• Depend of data storage

• Scalable query interface - Kafka 18

Streaming Design Patterns

• Micro batching

• Unpredictable incoming data

• Creating multiple streams

• Out of sequence events

• Stream joins

• Top N metrics

• External Lookup

19

Thank You

20

data streaming-systems

Data & Analytics