reactconf 2014 - event stream processing

45
Dr Andy Piper Push Technology Reappt, a Push Technology product offers the enterprise grade Diffusion technology as a service. { MAKING SENSE OF THE FIRE-HOSE IN REAL-TIME } {EVENT PROCESSING}

Upload: andy-piper

Post on 02-Jul-2015

1.465 views

Category:

Software


1 download

DESCRIPTION

Presentation from reactconf 2014 in San Francisco. Covers Event Stream Processing, some of the theory behind it and some implementation details in the context of local and distributed. Also covers some Big Data technologies

TRANSCRIPT

Page 1: Reactconf 2014 - Event Stream Processing

Dr Andy Piper

Push Technology

Reappt, a Push Technology product offers the enterprise grade Diffusion technology as a

service.

{MAKING SENSE OF THE FIRE -HOSE IN REAL-TIME}

{EVENT PROCESSING}

Page 2: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Time @cobbscomedyclub

The past, the present and the future walked

into a bar

It was tense

Page 3: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Will the real Andy Piper please stand up?

– CTO at Push Technology

– Ex-BEA/Oracle

– Spring contributor (Spring DM) and Author

– Standards contributor – OMG, JCP etc

– PhD, Cambridge, Distributed Systems

✗ ✗

Page 4: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Agenda in 140 characters

What is it - What not? Why? History. Measure

infinity. Windows. Queries. Going fast –

reliably, distributed, distributed and fast and big

Page 5: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

What is Event Stream Processing?

Page 6: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

What is Event Stream Processing?

• It’s not stream processing

– Typically focused on local parallelism

– I have opinions but they get me in trouble

Page 7: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

What is Event Stream Processing?

• Not event passing

– Event exchange not processing, e.g. JMS

– Stateless

Page 8: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

What is Event Stream Processing?

• Not event mediation (brokering)

– Filtering, routing, and enrichment, e.g. ESB

– Stateless

Page 9: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

What is Event Stream Processing?

“Event Stream Processing deals with the

task of processing streams of event data

with the goal of identifying the meaningful

pattern within those streams” – Wikipedia

Page 10: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

What is Event Stream Processing?

• ESP is about querying data

streams

– Looking for something

– Haystack won’t stay still!

– Answers depend on multiple events

– Extremely stateful

Where the interesting questions

are!

Page 11: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Meta-analogy

“Producing thrust with a scramjet has been

compared to lighting a match in a hurricane

and keeping it burning” - NASA

“Event stream processing is like looking for

a needle in a haystack in a hurricane”

Page 12: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

It’s like an inverted database

Query Event

Data Query

RDBMS CEP

• Data is ‘dynamic’

• Queries are ‘static’

• Data is ‘static’

• Queries are ‘dynamic’

Page 13: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Why bother?

• Too much data

• Time is integral to the questions

• Data is moving too fast

• Databases assume static datasets

?

Page 14: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

History – Two schools of thought

• Database and make it time driven

• Logic approach with time constraints

Page 15: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Stream Processing History

• Tapestry – ’92– Early inverted database (not Apache!)

• Materialized views – ‘95– [A. Gupta and I. S. Mumick. “Maintenance of materialized views: Problems,

techniques, and applications.” 1995]

• David Luckham coined term CEP – “The Power Events”. 2001– Logic-based CEP

– Company acquired by Avaya

• Michael Franklin– Dataflow processing in PostgreSQL

– [“TelegraphCQ: continuous dataflow processing.” 2003]

• Aurora – ‘03– [Cherniack et al – “Scalable distributed stream processing.” 2003]

• STREAM – ‘03– [Arasu et al – “STREAM: The Stanford Stream Data Manager.” 2003]

• Borealis – ‘05 – [Abadi et al – “The design of the Borealis stream processing engine.” 2005]

Page 16: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Some definitions

• Tuple – a multi-set of elements ( e1, e2, … en )– A single tuple is a monad!

• Event or Data Stream 𝑺𝒏 - any ordered pair 𝒔, ∆ 𝒏

– 𝑠 is an unbounded sequence of tuples and

– ∆ is an unbounded sequence of positive real time intervals

– 𝑠 and ∆ are of equal length

• Event stream processing transforms event streams into new event streams through queries

• Outputs and inputs continuous– Operators are continuous queries

Page 17: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

How do you measure an event

stream if it’s unbounded?

How do you measure infinity?

Page 18: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Measuring infinity

• Don’t do it

– But just event passing – where is the fun in

that?!

• Synopses – store summary information

– Continuous average = running total + items

• Windows – define working set

– Continuous average over last N items

Page 19: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Measuring infinity

Page 20: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Types of window

• Sliding

• Jumping (batching)

• Partitioned

• Time-based

• Others

Page 21: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

What to do with a working set?

• Windows define the scope of interest

• Run queries against working set as it

changes

– Continuous Queries

Page 22: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

When should you run queries?

• Run queries when output is not

idempotent

• When is that?

– Contents of the window changes – maybe?

– Time advances – possibly?

– Depends on window and query

Linking cause and effect in an efficient manner

lies at the heart of CEP and is why the answer

is not simply programming

Page 23: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

How can we define queries on windows?

• Describe queries on windows using a

SQL-like syntax

• [Arasu et al. – “The CQL Continuous Query

Language: Semantic Foundations and Query

Execution” 2003]

SELECT AVG(price) FROM stream [ROWS N]

Page 24: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Querying windows

• SlidingSELECT * FROM s [ROWS 4 SLIDE 4]

• PartitionedSELECT a, b FROM s

[PARTITION BY b ROWS 3]

• Time-basedSELECT * FROM s [RANGE 30 SECONDS]

Page 25: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

How do you make it fast?

• Generally in-memory the only way

• Operate as a gigantic state machine and

optimize like crazy

– Go reactive!

– Talk to Martin

Page 26: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Why must it be fast?

• Not reactive streams!

• Flow control causes causal paradox

• Stream processing must keep up

Page 27: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

How do you make it resilient?

• Making stateful systems resilient has

challenges

• State generally changing extremely quickly

Page 28: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Resiliency approaches

• Save all the things and replay

– But infinite data?!

– Sometimes possible because append-only

• Save all the state

– Assumes there is less of it

– State is changing rapidly

– Too rapid to be effective

Page 29: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Resiliency approaches

• Elsewhere checkpoint and record changes

– Maybe we can record state and things

– Many commercial systems do

• No recording - identical parallel systems

– Synchronization an issue

– Catch-up an issue

Page 30: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

How do you scale stream processing?

• Follow the crowd

• Distribute processing

• Multiple input sources

– If independent

– Flume

– Kafka

Page 31: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

How do you distribute stream processing?

• DAG of event streams

– Inputs and outputs are event streams

– Nodes are operators or groups of operators

– Nodes can be distributed

Page 32: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Apache Storm

• Toolkit for creating distributed event flows

• Bolts (operators) and spouts (sources)

• Composed using a Clojure DSL

• Storm runs topologies

– Map-Reduce jobs finish – batch

– Topologies process forever – continuous

Page 33: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Apache Storm – a toolkit for distribution

(topology

{"1" (spout-spec twitter-feed-spout)}

{"2" (bolt-spec {"1"} filter :p "status" )}

{"3" (spout-spec database :p "retail" )}

{"4" (bolt-spec {"2"} top-n)}

{"5" (bolt-spec {"3" "4"} join :p "item" )}

...

)

Page 34: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

How do you reliably distribute?

• State is now distributed

– Synchronization all but impossible

– Deterministic if relative order is preserved• Depends on operators and their effect

• [L. Lamport - “Time, Clocks, and the Ordering of Events in a Distributed System.” 1978]

– In theory a replay of things through the network will recover the state

– Alternative of storing the state for all the operators is harder

Page 35: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

How do you reliably distribute?

• Different classes of recovery

– [Hwang et al. – “High-Availability Algorithms

for Distributed Stream Processing”. 2005]

• Precise recovery – failure effects hidden perfectly

• Rollback recovery – no data loss, but outputs

may be duplicated

• Gap recovery – data lost during recovery

• Reliable distribution overlaps distribution

– Upstream backup, reactive streams?

Page 36: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Reactive stream processing

• Message/event driven

• Discussed resiliency

• Continuous queries == responsive

– Push towards on-line queries

• Elasticity – harder

Page 37: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Stream Processing with Data

• Time dimension to data problems

• Data dimension to stream problems

• JOIN streams to tables

• Easy when small

• Large datasets harder

– Cache join data in memory?

– Push query into datastore?

Page 38: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Stream Processing with Big Data

• Time dimension to Big Data problems

– Velocity (vvv) implies stream processing

• Large dataset problem domain

• But now the data is distributed!

Page 39: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Shortcomings of Big Data

Page 40: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Fast Data Architecture

Page 41: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Fast Data Architecture

• Similar to recoverable architectures:

– Snapshot (queries) + incremental updates

– Current state = known state + changes

– Requires static queries - cached results

• Spark does this quite well

Page 42: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Fast data technology

• Storm – topology deployment

• Spark – logic queries on RDDs

• Spark streaming– repeating snapshots / micro-batch

– Fast data-ish

• Flume – fast ingest of log data

• Kafka – pub-sub messaging as distributed commit log

• Hadoop streaming– create M-R jobs using executable scripts

• Hive

• Cloudera Impala – MPP SQL query engine on top of Hadoop

Page 43: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Summary

Page 44: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014

Future Stream Processing

• Ease-of-use– CQL or Graphical - both have drawbacks

– Queries get really complicated really quickly

• Ease-of-use + distribution– Real systems challenge

• Fast data architectures

• Real-time machine learning– Spark ML Library

– Hadoop Mahout

• Interactive streaming queries – declarative and caching– Hive and Spark

Page 45: Reactconf 2014 - Event Stream Processing

Copyright Push Technology 2014