dancing with stream processing

Post on 14-Apr-2017

182 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dancing with Stream

Processing

Y.S. Horawalavithanasameera1@mail.usf.edu

07/10/16 MSc. Distributed Systems 2

● Motivation● Event Stream Processing

– Pub/sub, CEP, “”Buzzwords”– Stream processing Engines

● Spark Streaming, Storm, Etc.

● Graph Stream Processing– Theory... {“sketching”, “spanners”, “sparsifiers”}– Challenges

● Discussion !!

Lightning Talk

07/10/16 MSc. Distributed Systems 3

Motivation

07/10/16 MSc. Distributed Systems 4

The Streaming Era

● Today, most data is continuously produced- user activity logs, web logs, sensors, database

transactions, …● The common approach to analyze such data so far

- Record data stream to stable storage (DBMS, HDFS, …)- Periodically analyze data with batch processing engine (DBMS, MapReduce, …)

● Streaming processing engines analyze data while it arrives

07/10/16 MSc. Distributed Systems 5

The Streaming Era (Contd.)

● Decreases the overall latency to obtain results- No need to persist data in stable storage- No periodic batch analysis jobs

● Simplifies the data infrastructure- Fewer moving parts to be maintained and coordinated

● Makes time dimension of data explicit- Each event has a timestamp- Data can be processed based on timestamps

07/10/16 MSc. Distributed Systems 6

Event Streams[Immutable]

Web Page Event

Wikipedia Page Update Event

LinkedIn User Update Event

07/10/16 MSc. Distributed Systems 7

Middleware

07/10/16 MSc. Distributed Systems 8

Direct coupling Strict Identity Time coupling

Not good for volatile environment

Not a good way to communicate with several participants

Space uncoupling Anonymity Time uncoupling

Independent lifetimes between parties

Through persistent communication channel

Point-to-point communication

Indirect communication

07/10/16 MSc. Distributed Systems 9

Taxonomy

Indirect Communication

Communication based

Group communication

Message Queues

Publish/subscribe

State based

Tuple spaces

Distributed Shared Memory

07/10/16 MSc. Distributed Systems 10

Pub/Sub Messaging Pattern

Topic-based- Each event belongs to a

number of topics (e.g. “music”, “sport”)

- Users subscribe to topics and receive all relevant events

Content-based - Users subscribe to the

actual content of the events/ a structured summary of it

- More expressive

07/10/16 MSc. Distributed Systems 11

Pub/Sub Activities

Subscription processing Indexing and storing subscriptions.

Event Stream Processing (ESP) Pub/sub approach: upon arrival of events, access

subscription index and identify all matched subscriptions.

Event delivery deliver event to clients with matched subscriptions.

07/10/16 MSc. Distributed Systems 12

Event Stream Processing (ESP)

Wikipedia

07/10/16 MSc. Distributed Systems 13

Today's world...

Pub/sub ≈ ESP ≈

07/10/16 MSc. Distributed Systems 14

“Buzzwords”

https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html

07/10/16 MSc. Distributed Systems 15

Complex Event Processing (CEP)

● A set of event processing principles

● Match patterns of events– Comparable to SQL queries– High-level query language

● Cloud of causally related events– POSET (Partially Ordered Set of Events)

07/10/16 MSc. Distributed Systems 16

Complex Event Processing (CEP)

● Some CEP Examples:– When 2 transactions happen on an account from

radically different geographic locations within a certain time window then report as potential fraud.

– When a gold customer's trouble ticket is not resolved within 1 hour, then escalate.

– When a team meeting request overlaps with my lunch break, then deny the team meeting and demote the meeting organizer.

07/10/16 MSc. Distributed Systems 17

Complex Event Processing (CEP)

● Some CEP Examples:– When 2 transactions happen on an account from

radically different geographic locations within a certain time window then report as potential fraud.

– When a gold customer's trouble ticket is not resolved within 1 hour, then escalate.

– When a team meeting request overlaps with my lunch break, then deny the team meeting and demote the meeting organizer.

07/10/16 MSc. Distributed Systems 18

ESP and CEP[Timeline]

2002

AuroraAurora

2003

Medusa

2005

Borealis

STREAM

TelegraphCQ

<20001989 - 1995

Rapide

Esper Apama

StreamBase

SQLStream

WSO2 CEP

2016

07/10/16 MSc. Distributed Systems 19

ESP vs. CEP

http://www.slideshare.net/TimBassCEP/mythbusters-event-stream-processing-v-complex-event-processing-presentation

07/10/16 MSc. Distributed Systems 20

Today's world...

ESP ≈ CEP ≈

07/10/16 MSc. Distributed Systems 21

Laundry of “Buzzwords”

● Actor Frameworks– Better mechanism to handle concurrency– E.g. Akka, Orleans and Erlang OTP

● “Reactive”– Language semantics for bringing event streams to the user

interface– Responsive, Resilient, Elastic and Message Driven– E.g. Data flow languages, Functional reactive programming

● Event Sourcing● Change Data Capture (CDC)

07/10/16 MSc. Distributed Systems 22

Analytics ≈ Stream Transformations

https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html

07/10/16 MSc. Distributed Systems 23

https://martin.kleppmann.com/2015/01/29/stream-processing-event-sourcing-reactive-cep.html

07/10/16 MSc. Distributed Systems 24

Target

● Better Scalability● High Throughput● Low latency● Powerful semantics● Easy integration

via Low Level Stream Processing

Frameworks !!

07/10/16 MSc. Distributed Systems 25

Spark Streaming

● General purpose computing engine to run batch, interactive and streaming jobs

● Based on Resilient Distributed Datasets (RDD)– Restricted form of distributed shared memory– Immutable– Can only be built through deterministic

transformations● Efficient fault recovery using lineage graph

– Recompute lost partitions on failure– No cost if nothing fails

07/10/16 MSc. Distributed Systems 26

Spark Streaming (Contd.)[Key concepts]

● DStream – sequence of RDDs representing a stream of data– HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

● Transformations – modify data from one DStream to another– Standard RDD operations – map, countByValue, reduce,

join, …– Stateful operations – window, countByValueAndWindow, …

● Output Operations – send data to external entity– saveAsHadoopFiles – saves to HDFS– foreach – do anything with each batch of results

07/10/16 MSc. Distributed Systems 27

Spark Streaming (Contd.)

● Run a streaming computation as a series of very small, deterministic batch jobs– Chop up the live stream into batches of X seconds– Spark treats each batch of data as RDDs and processes them using RDD

operations– Finally, the processed results of the RDD operations are returned in

batches

07/10/16 MSc. Distributed Systems 28

Berkeley Data Stack

07/10/16 MSc. Distributed Systems 29

Spark 2.0 is coming !!

07/10/16 MSc. Distributed Systems 30

Apache Storm[Key concepts] ● Tuple

– Core Unit of Data– Immutable Set of Key/Value Pairs

● Spouts– Source of Streams– Wraps a streaming data source and emits Tuples

● Bolts– Core functions of a streaming computation– Receive tuples and do stuff– Optionally emit additional tuples

07/10/16 MSc. Distributed Systems 31

Apache Storm[Key concepts]

● Topology– DAG of Spouts and

Bolts– Data Flow

Representation– Streaming

Computation

07/10/16 MSc. Distributed Systems 32

Apache Storm[Physical View]

07/10/16 MSc. Distributed Systems 33

Twitter introduces Heron !!

[Storm's successor]

07/10/16 MSc. Distributed Systems 34

Stream Processing Engines

Many More !!!

07/10/16 MSc. Distributed Systems 35

07/10/16 MSc. Distributed Systems 36

Hidden computation paradigm

via pipelining !!

07/10/16 MSc. Distributed Systems 37

Pipelining ≈ Task Execution

https://martin.kleppmann.com/unix

07/10/16 MSc. Distributed Systems 38

Let's build the concept again...

07/10/16 MSc. Distributed Systems 39

Linux pipelining in modern middle-ware...

https://martin.kleppmann.com/unix

07/10/16 MSc. Distributed Systems 40

Spark, Storm, Samza, Flink Etc.

https://martin.kleppmann.com/unix

07/10/16 MSc. Distributed Systems 41

Spark, Storm, Samza, Flink Etc.

https://martin.kleppmann.com/unix

07/10/16 MSc. Distributed Systems 42

Pub/sub pitch

https://martin.kleppmann.com/unix

07/10/16 MSc. Distributed Systems 43

Streaming Machine Learning

● By using a programing abstraction for distributed streaming– Apache SAMOA

07/10/16 MSc. Distributed Systems 44

Graph Stream ProcessingReferred Author: Vasia Kalavri, KTHhttps://berlinbuzzwords.de/sites/berlinbuzzwords.de/files/media/documents/buzzwords-kalavri.pdf

07/10/16 MSc. Distributed Systems 45

Static Graph Processing

● Load: read the graph from disk and partition it in memory

● Compute: read and mutate the graph state

● Store: write the final graph state back to disk

07/10/16 MSc. Distributed Systems 46

Static Graph Processing[Drawbacks]

● It is slow– wait until the computation is over before you see

any result– pre-processing and partitioning

● It is expensive– lots of memory and CPU required in order to scale

● It requires re-computation for graph changes– no efficient way to deal with updates

07/10/16 MSc. Distributed Systems 47

Streaming Graph Processing

We consume events in real-time● Get results faster

– No need to wait for the job to finish– Sometimes, early approximations are better

than late exact answers

● Get results continuously– Process unbounded number of events

07/10/16 MSc. Distributed Systems 48

Real-world scenarios

● Targeted Advertisement– Finding Strongly Connected Components in a

social network graph– Targeted chain of advertisement on detected

communities

Jane Joeknows

#Tesla

postslikes

Self driving carsAds

Peter Taphousechecks-in

John

subscribesDinner Offer

Ads

07/10/16 MSc. Distributed Systems 49

Streaming Graph Processing[Challenges]● Maintain the graph structure

– How to apply state updates efficiently?

● Result updates– Re-run the analysis for each event?– Design an incremental algorithm?– Run separate instances on multiple snapshots?

● How to preserve graph properties?– Natural behavior?

07/10/16 MSc. Distributed Systems 50

Streaming Graph Processing[Current Research]

Each event is an edge addition

Jane Joeknows

Jane #Teslalikes

Joe #Teslaposts

Peter TapHousechecks-in

07/10/16 MSc. Distributed Systems 51

07/10/16 MSc. Distributed Systems 52

Dynamic Graph Processing

● Instead of analyzing the whole graph– Analyze it's properties by preserving them

continuously● Connectivity or Distance (spanners)● Graph cut estimation (sparsifiers)● Neighborhood or homomorphic properties (sketches)

07/10/16 MSc. Distributed Systems 53

Dynamic Graph Processing (Contd.)

Jane Joeknows

#Tesla

postslikes

Self driving carsAds

Peter Taphousechecks-in

John

subscribesDinner Offer

Ads

Peter Janeloves

loves

Self driving carsAds

07/10/16 MSc. Distributed Systems 54

Stream Connected Components

● State: a disjoint set data structure for the components

● Computation: For each edge– if seen for the 1st time, create a component with ID

the min of the vertex IDs– if in different components, merge them and update

the component ID to the min of the component IDs– if only one of the endpoints belongs to a

component, add the other one to the same component

07/10/16 MSc. Distributed Systems 55

Stream Connected Components

07/10/16 MSc. Distributed Systems 56

Stream Connected Components

07/10/16 MSc. Distributed Systems 57

Stream Connected Components

07/10/16 MSc. Distributed Systems 58

Stream Connected Components

07/10/16 MSc. Distributed Systems 59

Stream Connected Components

07/10/16 MSc. Distributed Systems 60

Stream Connected Components

07/10/16 MSc. Distributed Systems 61

Stream Connected Components

07/10/16 MSc. Distributed Systems 62

Stream Connected Components

07/10/16 MSc. Distributed Systems 63

Stream Connected Components

07/10/16 MSc. Distributed Systems 64

Stream Connected Components

07/10/16 MSc. Distributed Systems 65

Stream Connected Components

07/10/16 MSc. Distributed Systems 66

Distributed Stream Connected Components

07/10/16 MSc. Distributed Systems 67

Streaming Graph Processing[Current Work]● We're working with Gelly-Streams on

– Preserving natural properties in large scale real-world evolving graphs

– Joining multiple streams for detects graph causality/ bipartite

– Efficient graph partitioning mechanisms to on-board with popular data-stores like Cassandra, HDFS

– Producing a platform to benchmark NPC problems in real-world graphs

07/10/16 MSc. Distributed Systems 68

Discussion !!

07/10/16 MSc. Distributed Systems 69

Thank you !!

sameera1@mail.usf.edu

top related