apache storm vs. spark streaming – two stream processing platforms compared
Post on 12-Jul-2015
5.364 Views
Preview:
TRANSCRIPT
2014 © Trivadis
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA
2014 © Trivadis
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared DBTA Workshop on Stream Processing Berne, 3.12.2014
Guido Schmutz
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
1
2014 © Trivadis
Guido Schmutz
§ Working for Trivadis for more than 18 years § Oracle ACE Director for Fusion Middleware and SOA § Co-Author of different books § Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data § Member of Trivadis Architecture Board § Technology Manager @ Trivadis
§ More than 25 years of software development
experience
§ Contact: guido.schmutz@trivadis.com § Blog: http://guidoschmutz.wordpress.com § Twitter: gschmutz
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
2
2014 © Trivadis
Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria.
We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems.
Our company
O P E R A T I O N
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
3
2014 © Trivadis
Agenda
1. Introduction
2. Apache Storm
3. Apache Spark (Streaming)
4. Unified Log
5. Stream Processing Architectures
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
4
2014 © Trivadis
What is Stream Processing?
Infrastructure for continuous data processing
Computational model can be as general as MapReduce but with the ability to produce low-latency results
Data collected continuously is naturally processed continuously
aka. Event Processing / Complex Event Processing (CEP)
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
5
2014 © Trivadis
Why Stream Processing?
Response latency
Stream Processing
Milliseconds to minutes
RPC
Synchronous Later. Possibly much later.
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
6
2014 © Trivadis
How to design a Stream Processing System?
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
7
Event Stream
event Collecting
event
Queue (Persist)
Event Stream
event Collecting
event
Processing
event Processing
result
result
Event Stream
event Collecting/ Processing
result
2014 © Trivadis
How to scale a Stream Processing System?
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
8
Queue (Persist)
Event Stream
event
Collecting Thread 1 event event
Processing Thread 1 result
Collecting Thread 2
Processing Thread 2
event event event result
Collecting Thread n
Processing Thread n
2014 © Trivadis
Collecting Process 1
Collecting Process 1
Collecting Process 1
Collecting Process 1
Collecting Process 1
How to scale a Stream Processing System?
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
9
Queue 1 (Persist)
Event Stream
event
Collecting Thread 1
event event Processing Process 1 result
Collecting Thread 1
Processing Process 1
Queue 2 (Persist) event
event event result
Processing Process 1
Queue n (Persist)
2014 © Trivadis
Collecting Process 1
Collecting Process 2
Processing A Process 2
Processing B Process 2
Processing A Process 1
Processing B Process 1
How to scale a Stream Processing System?
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
10
Event Stream
Collecting Process 1
Collecting Process 2
Processing A Thread 2 Q2
e Processing B
Thread 2 Q2 e
Processing A Thread 1 Q1
e Processing B
Thread 1 Q1 e
Processing A Process 2
Processing A Thread n Qn
e
2014 © Trivadis
How to make (stateful) Stream Processing System reliable?
Faults and stragglers inevitable in large clusters running big data applications
Streaming applications must recover from them quickly
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
11
Collecting Process 2
Processing A Process 2
Processing B Process 2
Event Stream
Collecting Process 2
Processing A Thread 2 Q2
e Processing B
Thread 2 Q2 e
Collecting Process 2
Processing A Process 2
Processing B Process 2
Event Stream
Collecting Process 2
Processing A Thread 2 Q2
e Processing B
Thread 2 Q2 e
2014 © Trivadis
How to make (stateful) Stream Processing System reliable?
Solution 1: using active/passive system (hot replication)
• Both systems process the full load
• In case of a failure, automatically switch and use the “passive” system
• Stragglers slow down both active and passive system
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
12 State = State in-memory and/or on-disk
Collecting Process 2
Processing A Process 2
Processing B Process 2
Event Stream
Collecting Process 2
Processing A Thread 2 Q2
e Processing B
Thread 2 Q2 e
Active
Collecting Process 2
Processing A Process 2
Processing B Process 2
Collecting Process 2
Processing A Thread 2 Q2
e Processing B
Thread 2 Q2 e
Passive
State
State
2014 © Trivadis
How to make (stateful) Stream Processing System reliable?
Solution 2: Upstream backup
• Nodes buffer sent messages and reply them to new node in case of failure
• Stragglers are treated as failures
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
13
State = State in-memory and/or on-disk
buffer = Buffer for replay in-memory and/or on-disk
Collecting Process 2
Processing A Process 2
Processing B Process 2
Event Stream
Collecting Process 2
Processing A Thread 2 Q2
e Processing B
Thread 2 Q2 e
State
2014 © Trivadis
Processing Models
Batch Processing • Familiar concept of processing data en masse • Generally incurs a high-latency
(Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives stateful computation, making windowing an easy task
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
14
2014 © Trivadis
Message Delivery Semantics
At most once [0,1] • Messages my be lost • Messages never redelivered
At least once [1 .. n] • Messages will never be lost • but messages may be redelivered (might be ok if consumer can handle it)
Exactly once [1] • Messages are never lost • Messages are never redelivered • Perfect message delivery • Incurs higher latency for transactional semantics
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
15
2014 © Trivadis
Requirements dictate the choice
Latency
• Is performance of streaming application paramount
Development Cost
• Is it desired to have similar code bases for batch and stream processing => lambda architecture
Message Delivery Guarantees
• Is there high importance on processing every single record, or is some normal amount of data loss acceptable
Process Fault Tolerance
• Is high-availability of primary concern
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
16
2014 © Trivadis
Agenda
1. Introduction
2. Apache Storm
3. Apache Spark (Streaming)
4. Unified Log
5. Stream Processing Architectures
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
17
2014 © Trivadis
Apache Storm
A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. • A highly distributed real-time computation system
• Provides general primitives to do real-time computation
• To simplify working with queues & workers
• scalable and fault-tolerant
• complementary to Hadoop
• Written in Clojure, supports Java, Clojure
• Originated at Backtype, acquired by Twitter in 2011
• Open Sourced late 2011
• Part of Apache Incubator since September 2013
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
18
2014 © Trivadis
Apache Storm – Core concepts
Tuple • Core data structure in storm • Immutable Set of Key/value pairs • You can think of Storm tuples as events • Values must be serializable
Stream • Key abstraction of Storm • an unbounded sequence of tuples that can be processed in parallel by Storm • Each stream is given ID and bolts can produce and consume tuples from
these streams on the basis of their ID • Each stream also has an associated schema of the tuples that will flow
through it
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
19
T T T T T T T T
2014 © Trivadis
Apache Storm – Core concepts
Topology • Wires data and functions via a DAG (directed acyclic graph) • Executes on many machines similar to a MR job in Hadoop
Spout • Source of data streams (tuples) • can be run in “reliable” and “unreliable” mode
Bolt • Consumes 1+ streams and potentially
produces new streams • Complex operations often require multiple
steps and thus multiple bolts • Calculate, Filter, Aggregate, Join, Talk to
database
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
20
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Source of Stream B
Subscribes: A Emits: C
Subscribes: A Emits: D
Subscribes: A & B Emits: -
Subscribes: C & D Emits: -
2014 © Trivadis
Storm – How does it work ?
August 2014 CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm
21
NFL: Peyton Manning and Denver’s elite offense fall flat in #Superbowl XLVIII
ow.ly/tdQZn #seahawks #broncos
#Superbowl
Split Sentence
TwitterSpout
WordCount
Split Sentence
WordCount
NFL
Manning
Superbowl
Superbowl
… #Superbowl
Peyton
...
2014 © Trivadis
Storm – How does it work ?
August 2014 CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm
22
Split Sentence
TwitterSpout
WordCount
Split Sentence
WordCount
INCR Superbowl
INCR NFL
INCR Manning
NFL = 1 Manning = 1
Superbowl = 1
… #Superbowl
INCR Superbowl
NFL: Peyton Manning and Denver’s elite offense fall flat in #SuperBowl XLVIII
ow.ly/tdQZn #seahawks #broncos
#Superbowl Superbowl = 2
NFL
Manning
Superbowl
Superbowl
Peyton
...
INCR Peyton Peyton = 1
2014 © Trivadis
Storm – How does it work ?
August 2014 CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm
23
Split Sentence
TwitterSpout
WordCount
Split Sentence
WordCount
INCR Superbowl
INCR NFL
INCR Manning
NFL = 1 Manning= 1
Superbowl = 1
… #Superbowl
INCR Superbowl
NFL: Peyton Manning and Denver’s elite offense fall flat in #SuperBowl XLVIII
ow.ly/tdQZn #seahawks #broncos
#Superbowl Superbowl = 2
NFL
Manning
Superbowl
Superbowl
Peyton
...
INCR Peyton Peyton = 1
Report
NFL = 1 Manning = 1
Superbowl = 2 Peyton= 1
2014 © Trivadis
Storm - Topology
Each Spout or Bolt are running N instances in parallel
August 2014 CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm
24
Split Sentence
TwitterSpout
WordCount
Split Sentence
WordCount
Shuffle Fields
Shuffle grouping is random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in the same thread as bolt/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive
Local or Shuffle grouping
similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior.
Report Global
2014 © Trivadis
Storm - Creating Topology
August 2014 CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm
25
2014 © Trivadis
Using a NoSQL database for storing results (keeping state with counter type columns)
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
26
Twitter Stream
HashtagSplitter
TwitterSpout
HashtagCounter
HashtagSplitter
HashtagCounter
seahawks
broncos
superbowl INCR superbowl
INCR seahawks
INCR broncos
seahawks= 1 broncos = 1
superbowl = 1
superbowl
… #Superbowl
INCR superbowl
NFL: Peyton Manning and Denver’s elite offense fall flat in #SuperBowl XLVIII
ow.ly/tdQZn #seahawks #broncos
#Superbowl
superbowl = 2
2014 © Trivadis
Storm Trident
High-Level abstraction on top of storm
Simplifies building topologies
Core data model is the stream • Processed as a series of batches (micro-batches) • Stream is partitioned among nodes in cluster
5 kinds of operations in Trident • Operations that apply locally to each partition and cause no network transfer • Repartitioning operations that don‘t change the contents • Aggregation operations that do network transfer • Operations on grouped streams • Merges and Joins
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
27
2014 © Trivadis
Storm Trident - Creating Topology
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
28
Twitter Stream
tweet tweet HashtagSplitter
TwitterSpout
hashtag HashtagNormalizer
Persistent Aggregate
hashtag
groupBy local
Bolt Bolt
2014 © Trivadis
Trident Concepts - Function
• takes in a set of input fields and emits zero or more tuples as output
• fields of the output tuple are appended to the original input tuple in the stream • If a function emits no tuples, the original input tuple is filtered out • Otherwise the input tuple is duplicated for each output tuple
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
29
2014 © Trivadis
Storm Core vs. Storm Trident
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
30
Core Storm Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala,
Python, Ruby, … Java, Clojure,
Scala
Processing Models Event-Streaming Micro-Batching Processing DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least
once Exactly Once
Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN
2014 © Trivadis
Agenda
1. Introduction
2. Apache Storm
3. Apache Spark (Streaming)
4. Unified Log
5. Stream Processing Architectures
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
31
2014 © Trivadis
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
• The hot trend in Big Data!
• Based on 2007 Microsoft Dryad paper
• Written in Scala, supports Java, Python, SQL and R
• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk
• Runs everywhere – runs on Hadoop, Mesos, standalone or in the cloud
• One of the largest OSS communities in big data with over 200 contributors in 50+ organizations
• Originally developed 2009 in UC Berkley’s AMPLab
• Open Sourced in 2010 – since 2014 part of Apache Software foundation
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
32
2014 © Trivadis
Apache Spark
Spark Core • General execution engine for the Spark platform • In-memory computing capabilities deliver speed • General execution model supports wide variety of use cases • DAG-based • Ease of development – native APIs in Java, Scala and Python
Spark Streaming • Run a streaming computation as a series of very small, deterministic batch jobs • Batch size as low as ½ sec, latency of about 1 sec • Exactly-once semantics • Potential for combining batch and streaming processing in same system • Started in 2012, first alpha release in 2013
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
33
2014 © Trivadis
Apache Spark - Generality
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
34
Spark SQL (Batch
Processing)
Blink DB (Approximate
Querying)
Spark Streaming (Real-Time)
MLLib, Spark R (Machine Learning)
GraphX (Graph
Processing)
Spark Core API and Execution Model
Spark Standalone MESOS YARN HDFS Elastic
Search Cassandra S3 / DynamoDB
Libraries
Core Runtime
Cluster Resource Managers Data Stores
Adapted from C. Fregly: http://slidesha.re/11PP7FV
2014 © Trivadis
Apache Spark – Core concepts
Resilient Distributed Dataset (RDD) • Core Spark abstraction • Collections of objects (partitions) spread across cluster • Partitions can be stored in-memory or on-disk (local) • Enables parallel processing on data sets • Build through parallel transformations • Immutable, recomputable, fault tolerant • Contains transformation history (“lineage”) for whole data set
Operations • Stateless Transformations (map, filter, groupBy) • Actions (count, collect, save)
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
35
2014 © Trivadis
RDD Lineage Example
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
36
HDFS File Input 1
HadoopRDD
FilteredRDD
MappedRDD
ShuffledRDD
HDFS File Output
HadoopRDD
MappedRDD
HDFS File Input 2 SparkContext.hadoopFile()
SparkContext.hadoopFile() filter()
map() map()
join()
SparkContext.saveAsHadoopFile()
Transformations (Lazy)
Action (Execute Transformations)
Adapted from Chris Fregly: http://slidesha.re/11PP7FV
2014 © Trivadis
RDD Execution Example
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
37
Partition 1
FileRDD
Partition 2
Partition 5
….
Partition 1
ShuffledRDD
Partition 2
Partition 5
….
Partition 1
FileRDD
Partition 2
Partition 5
….
Partition 1
FileRDD
Partition 2
Partition 1
FileRDD
Partition 2
Partition 1
ShuffledRDD
Partition 2
Partition 5
….
Partition 1
ShuffledRDD
Partition 2
Partition 5
….
Partition 1
MappedRDD
Partition 2
filter() map()
groupByKey() join()
join()
2014 © Trivadis
Apache Spark Streaming – Core concepts
Discretized Stream (DStream) • Core Spark Streaming abstraction • micro batches of RDD’s • Operations similar to RDD
Input DStreams • Represents the stream of raw data received from streaming sources • Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter,
ZeroMQ, TCP Socket, Akka actors, etc. • Custom Sources can be easily written for custom data sources
Operations • Same as Spark Core • Additional Stateful transformations (window, reduceByWindow)
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
38
2014 © Trivadis
Discretized Stream (DStream)
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
39
time 1 time 2 time 3
message
time n ….
f(message 1)
RDD @time 1
f(message 2)
f(message n)
….
message 1
RDD @time 1
message 2
message n
….
result 1
result 2
result n
….
message message message
f(message 1)
RDD @time 2
f(message 2)
f(message n)
….
message 1
RDD @time 2
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time 3
f(message 2)
f(message n)
….
message 1
RDD @time 3
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time n
f(message 2)
f(message n)
….
message 1
RDD @time n
message 2
message n
….
result 1
result 2
result n
….
Input Stream
DStream
MappedDStream map()
saveAsHadoopFiles()
Time Increasing
DSt
ream
Tra
nsfo
rmat
ion
Line
age
Act
ions
Tri
gger
Sp
ark
Jobs
Adapted from Chris Fregly: http://slidesha.re/11PP7FV
2014 © Trivadis
Spark Streaming Example
August 2014 CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm
40
2014 © Trivadis
Storm Core vs. Storm Trident vs. Spark Streaming
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
41
Core Storm Storm Trident Spark Streaming
Community > 100 contributors > 100 contributors > 280 contributors Adoption *** * * Language Options
Java, Clojure, Scala, Python, Ruby, …
Java, Clojure, Scala
Java, Scala Python (coming)
Processing Models
Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core)
Processing DSL No Yes Yes
Stateful Ops No Yes Yes
Distributed RPC Yes Yes No Delivery Guarantees
At most once / At least once
Exactly Once Exactly Once
Latency sub-second seconds seconds Platform Storm Cluster, YARN Storm Cluster, YARN
YARN, Mesos
Standalone, DataStax EE
2014 © Trivadis
Agenda
1. Introduction
2. Apache Storm
3. Apache Spark (Streaming)
4. Unified Log
5. Stream Processing Architectures
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
42
2014 © Trivadis
Unified Log
That’s what most people think about logs 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" 200 111 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 200 13593 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114 137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 - 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809
But this is what we mean here by Log
• a structured log (records are numbered beginning with 0 based on order they are written)
• aka. commit log or journal
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
43
0 1 2 3 4 5 6 7 8 9 10 11
1st record Next record written
2014 © Trivadis
Central Unified Log for (real-time) subscription
Take all the organization’s data and put it into a central log for subscription
Properties of the Unified Log:
• Unified: “Enterprise”, single deployment
• Append-Only: events are appended, no update in place => immutable
• Ordered: each event has an offset, which is unique within a shard
• Fast: should be able to handle thousands of messages / sec
• Distributed: lives on a cluster of machines
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
44
0 1 2 3 4 5 6 7 8 9 10 11
reads
writes
Collector
Consumer System A (time = 6)
Consumer System B
(time = 10)
reads
2014 © Trivadis
Apache Kafka - Overview
• A distributed publish-subscribe messaging system
• Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …)
• Initially developed at LinkedIn, now part of Apache
• Does not follow JMS Standards and does not use JMS API
• Kafka maintains feeds of messages in topics
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
45
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
0 1 2 3 4 5 6 7 8 9 1 0
1 1
1 2
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 1 0
1 1
1 2
Anatomy of a topic:
Partition 0
Partition 1
Partition 2
Writes
old new
2014 © Trivadis
Apache Kafka - Motivation
LinkedIn’s motivation for Kafka was: § “A unified platform for handling all the real-time data feeds a large company
might have.”
Must haves § High throughput to support high volume event feeds. § Support real-time processing of these feeds to create new, derived feeds. § Support large data backlogs to handle periodic ingestion from offline
systems. § Support low-latency delivery to handle more traditional messaging use
cases. § Guarantee fault-tolerance in the presence of machine failures.
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
46
2014 © Trivadis
Apache Kafka - Performance
Kafka at LinkedIn
Up to 2 million writes/sec on 3 cheap machines § Using 3 producers on 3 different machines
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
47
10+ billion writes per day
172k messages per second
(average)
55+ billion messages per day
to real-time consumers
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
2014 © Trivadis
Apache Kafka - Partition offsets
Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset
• Consumers track their pointers via (offset, partition, topic) tuples
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
48
Consumer group C1
2014 © Trivadis
Apache Kafka – two Options for Log Cleanup
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
49
Retaining a window of data • Ideal for event data • Window can be defined in time (days) or space (GBs) – defaults to 1 week
Retain a complete log (log compaction) • Ideal for keyed data • Keep a space-efficient complete
log of changes • Log compaction runs in the
background • Ensures that always at least the
last known value for each message key within the log of data is retained
2014 © Trivadis
Data Flow Graphs using Unified Log
Stream processing allows for computing feeds off of other feeds
Derived feeds are no different than original feeds they are computed off
Single deployment of “Unified Log” but logically different feeds
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
50
Meter Readings Collector
Enrich / Transform
Aggregate by Minute
Raw MeterReadings
Meter with Customer
Meter by Customer by Minute
Customer Aggregate by Minute
Meter by Minute
Persist
Meter by Minute
Persist
Raw Meter Readings
2014 © Trivadis
Agenda
1. Introduction
2. Apache Storm
3. Apache Spark (Streaming)
4. Unified Log
5. Stream Processing Architectures
6. Summary
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
51
2014 © Trivadis
Architectural Pattern: Standalone Event Stream Processing
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
52 52
Event Processing (ESP / CEP)
State Store / Event Store En
terp
rise
Even
t Bus
(In
gres
s)
Even
t Cl
oud
Internet of Things
Social Media Streams
Ente
rpris
e Ev
ent B
us
52
Analytical Applications
DB
Ente
rpris
e Se
rvic
e Bu
s
Business Rule Management
System Rules
Event Processing
Result Store
2014 © Trivadis
Hadoop Big Data Infrastructure
Architectural Pattern: Event Stream Processing as part of Lambda Architecture
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
53 53
Event Processing (ESP / CEP)
State Store / Event Store
Ente
rpris
e Ev
ent B
us
(Ingr
ess)
Even
t Cl
oud
Internet of Things
Social Media Streams
Ente
rpris
e Ev
ent B
us
53
Analytical Applications
DB
Ente
rpris
e Se
rvic
e Bu
s
Event Processing
Map/Reduce HDFS Result
Store
Result Store
2014 © Trivadis
Hadoop Big Data Infrastructure
Architectural Pattern: Event Stream Processing as part of Kappa Architecture
August 2014 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture
54 54
Event Processing (ESP / CEP)
State Store / Event Store
Ente
rpris
e Ev
ent B
us
(Ingr
ess)
Even
t Cl
oud
Internet of Things
Social Media Streams
54
Analytical Applications
DB Ente
rpris
e Se
rvic
e Bu
s
Event Processing
Replay HDFS
Result Store
2014 © Trivadis
Questions and answers ...
2014 © Trivadis
BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA
Guido Schmutz Technology Manager
3rd December 2014 Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared
55
top related