development of a distributed stream processing system

Development of a Distributed Stream Processing System

Maycon Viana Bordin Final Assignment

Instituto de Informática

Universidade Federal do Rio Grande do Sul

CMP157 – PDP 2013/2, Claudio Geyer

What’s Stream Processing?

Stream Source: emits data continuously and

sequentially

Operators: count, join, filter, map

Data streams

Data Stream

Tuple -> (“word”, 55)

Tuples are ordered by a

timestamp or other attribute B 1 2 3 4 5 6 7

Data from the stream source may or may not be structured

The amount of data is usually unbounded in size

The input rate is variable and typically unpredictable

Operators

Receives one or

more data

streams

Sends one or

more data

streams

Operators Classification

OPERATORS

Stateless (map, filter)

OPERATORS

Stateful

OPERATORS

Stateful

Non-Blocking (count, sum)

OPERATORS

Stateful

Blocking (join, freq. itemset)

Non-Blocking (count, sum)

Blocking operators need all input in order to generate a result

but that’s not possible since data streams are unbounded

To solve this issue, tuples are grouped in windows

window start (ws)

window end (we)

Range in time units or number of tuples

old ws old we

new ws new we

advance

Implementation Architecture

master

client

submit/start/stop app

heartbeat

worker thread

The heartbeat carries the status of each worker in the slave

Tuples processed

Throughput

Latency

Implementation Application

Applications are composed as a DAG (Directed Acyclic Graph)

To illustrate, let’s look at the graph of a Trending Topics application

extract hashtags

countmin sketch

File Sink

stream

extract hashtags

countmin sketch

File Sink

stream

Stream source emits

tweets in JSON

format

extract hashtags

countmin sketch

File Sink

stream

Extract the text from

the tweet and add a

timestamp to each

extract hashtags

countmin sketch

File Sink

stream

Extract and emit

each #hashtag in

the tweet

extract hashtags

countmin sketch

File Sink

stream

Constant time and space

approximate frequent

itemset [Cormode and Muthukrishnan, 2005]

extract hashtags

countmin sketch

File Sink

stream

Without a window, it

will emit all top-k

items each time a

hashtag is received

extract hashtags

countmin sketch

File Sink

stream

With a window the

number of tuples emitted

is reduced, but the

latency is increasead

The second step in building an application is to set the number of

instances of each operator:

extract

File Sink

stream

extract extract extract extract

countmin countmin countmin countmin countmin

But the user has to choose the way tuples are going to be partitioned

among the operators

All-to-All Partitioning

extract

File Sink

stream

extract

File Sink

stream

Round-Robin Partitioning

extract

File Sink

stream

extract

File Sink

stream

extract

File Sink

stream

extract

File Sink

stream

extract

File Sink

stream

extract

File Sink

stream

Field Partitioning

extract

File Sink

stream

(“foo”, 1)

extract

File Sink

stream

(“foo”, 1)

extract

File Sink

stream

(“foo”, 1)

extract

File Sink

stream

(“foo”, 1)

The communication between operators is done with the pub/sub

pattern

extract

File Sink

stream

countmin countmin countmin countmin countmin U

extract

File Sink

stream

countmin countmin countmin countmin countmin U The operator subscribes

to all upstream

operators, with his ID as

a filter

extract

File Sink

stream

countmin countmin countmin countmin countmin U The operator will only

receive tuples with his

ID as prefix

The last step is to get each operator instance from the graph and assign it

to a node

extract

File Sink

stream

node-0 node-1 node-2

Currently the scheduler is static and only balances the number of

operators per node

Implementation Framework

trending-topics.js

Tests Specification

Application Trending Topics Dataset of 40GB from Twitter

GridRS - PUCRS

3 nodes

4 x 3.52 GHz (Intel Xeon)

2 GB RAM

Linux 2.6.32-5-amd64

Gigabit Ethernet

Test Environment

Variables

Metrics

Runtime

Latency: time to a tuple traverse the graph

Throughput: no. of tuples processed per sec.

Loss of Tuples

Methodology

5 runs per test.

Every 3s each operator sends its status with

no. of tuples processed.

The PerfMon sink collects a tuple every

100ms, and sends the average latency every

3s (and cleans up the collected tuples).

Number of nodes

Number of operator instances

Window size

Tests Number of Nodes

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

1600.00

1800.00

2000.00

No. of nodes

Runtime vs Latency

runtime (min)

latency (ms)

No. of nodes

Runtime vs Stream Rate

runtime (min)

stream rate (MB/s)

1000.00

2000.00

3000.00

4000.00

5000.00

6000.00

7000.00

No. of nodes

Throughput stream

extractor

countmin

filesink

perfmon

-2000.00

2000.00

4000.00

6000.00

8000.00

10000.00

No. of nodes

Loss of Tuples stream

extractor

countmin

filesink

perfmon

100000

Time (seconds)

Throughput and Latency Over Time (nodes=3, instances=5, window=20)

stream

extractor

countmin

filesink

latency

Tests Window Size

100.00

200.00

300.00

400.00

500.00

600.00

700.00

20 80 120 200

Window Size

Runtime vs Latency

runtime (min)

latency (ms)

Tests No. of Instances

No. of Instances

Runtime vs Stream Rate

runtime (min)

stream rate (MB/s)

Conclusions

The system was able to process more data with the inclusion of more nodes

On the other hand, distributing the load increased the latency

The scheduler has to reduce the network communication

The communication between workers in the same node has to happen

through main memory

References Chakravarthy, Sharma. Stream data processing: a quality of service perspective: modeling, scheduling, load shedding, and complex event processing. Vol. 36. Springer, 2009. Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75. Gulisano, Vincenzo Massimiliano, Ricardo Jiménez Peris, and Patrick Valduriez. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Diss. Informatica, 2012.

Source code @ github.com/mayconbordin/tempest

development of a distributed stream processing system

prefixfile sink

filterfile sink

tweet countmin sketchfile

number of operators

upstream operators

operators classification

data streamsop

way tuples

Technology

latency-aware elastic scaling for distributed data stream...

data management 13 stream processing - mboehm7.github.io ·...

thorsten papenbrock felix naumann stream processing ·...

iteblog€¦ · flink architecture (apache software...

1 distributed online learning and stream processing for a...

a benchmark suite for distributed stream processing systems

state management in apache flink : consistent stateful...

distributed stream processing on fluentd / #fluentd

the power of both choices: practical load balancing for...

data-trace types for distributed stream processing...

resource efﬁciency in flink stream processing with...

scalable distributed stream processing - networks and mobile

distributed processing of stream text mining

from a stream of relational queries to distributed stream...

distributed strategies for elasc data stream processing in...

mitigating network side channel leakage for stream ... ·...

distributed real-time stream processing: why and...

when two choices are not enough: balancing at scale in...

database systems 13 stream processing - github pages ·...

distributed stream processing with apache kafka