introduction to streaming analytics

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Introduction to Streaming Analytics

Guido Schmutz

Guido Schmutz

Working for Trivadis for more than 19 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis

More than 25 years of software development experience

Contact: [email protected]: http://guidoschmutz.wordpress.comTwitter: gschmutz

Our company.

© Trivadis – The Company3 03.06.16

Trivadis is a market leader in IT consulting, system integration, solution engineeringand the provision of IT services focusing on and and Open Source technologiesin Switzerland, Germany, Austria and Denmark. We offer our services in the followingstrategic business fields:

Trivadis Services takes over the interacting operation of your IT systems.

O P E R A T I O N

COPENHAGEN

MUNICH

LAUSANNEBERN

ZURICHBRUGG

GENEVA

HAMBURG

DÜSSELDORF

FRANKFURT

STUTTGART

FREIBURG

BASEL

VIENNA

With over 600 specialists and IT experts in your region.

© Trivadis – The Company4 03.06.16

14 Trivadis branches and more than600 employees

200 Service Level Agreements

Over 4,000 training participants

Research and development budget:CHF 5.0 million

Financially self-supporting andsustainably profitable

Experience from more than 1,900 projects per year at over 800customers

Agenda

1. Introduction & Foundation2. Designing Streaming Analytics Solutions

3. Implementing Event Hub

4. Implementing Data Ingestion

5. Implementing Streaming Analytics

6. Scalability & Reliability7. Streaming Analytics in Architecture

8. Summary

Introduction & Foundation

Big Data Definition (4 Vs)

+Timetoaction?– BigData+Real-Time=StreamProcessing

CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination

The world is changing …

The model of Generating/Consuming Data has changed ….

Old Model: few companies are generating data, all others are consuming data

New Model: all of use are generating data, and all of us are consuming data

Who is generating Big Data?

The progress and innovation is no longer hindered by the ability to collect data

But by the ability to manage, analyze, summarize, visualize and discover knowledge from the collected data in a timely manner and in a scalable fashion

Socialmediaandnetworks(allofusaregeneratingdata)

Scientificinstruments(collectingallsortsofdata)

Mobiledevices(trackingallobjectsallthetime)

Sensortechnologyandnetworks(measuringallkinds ofdata)

Traditional Data Processing - Challenges

• Introduces too much “decision latency”

• Responses are delivered “after the fact”

• Maximum value of the identified situation is lost

• Decision are made on old and stale data

• “Data a Rest”

The New Era: Streaming Data Analytics / Fast Data

• Events are analyzed and processed in real-time as the arrive

• Decisions are timely, contextual and based on fresh data

• Decision latency is eliminated

• “Data in motion”

Real Time Analytics Use Cases

• Algorithmic Trading

• Online Fraud Detection

• Geo Fencing

• Proximity/Location Tracking

• Intrusion detection systems

• Traffic Management

• Recommendations

• Churn detection

• Internet of Things (IoT) / Intelligence

Sensors

• Social Media/Data Analytics

• Gaming Data Feed

• …

What happen in an internet minute

Internet Of Things – Sensorsare/will be everywhereThere are more devices tapping into the internet than people on earth

How do we prepare our systems/architecture for the future?

Source:CiscoSource:TheEconomist

Different Types of Stream/Event Processing

Simple Event Processing (SEP)

Event Stream Processing (ESP)

Different Types of Stream/Event Processing

Complex Event Processing (CEP)

Native Streaming vs. Micro-Batching

Native Streaming• Events processed as they

arrive• + low-latency• - throughput• - fault tolerance is expensive

Micro-Batching• Splits incoming stream in

small batches• + high(er) throughput• + easier fault tolerance• - lower latency

Source: Distributed Real-TimeStreamProcessing:WhyandHowbyPetrZapletal

How to design a Streaming Analytics Solution?

EventStream

eventDataIngestion

event

Persist(Queue)

EventStream

eventDataIngestion

event

Analytics

eventAnalytics

result

result

EventStream

event DataIngestion/Analytics

result

Demo Use Case – Truck Sensors

Truck DataIngestion Geo-Fencing

2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631

{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}

RecklessDrivingDetector

NEAR

ENTER

TruckDriver

DashboardMovement MovementJSON

RecklessDriver

Designing Streaming Analytics Solutions

How to design a Streaming Analytics System?It usually starts very simple … just one data pipeline

EventStream

AnalyticseventData

Ingestion

New Event Stream sources are added …

EventStream

Analytics

2nd EventStream

3rd EventStream

nth EventStream

event

event

event

event

DataIngestion

2nd DataIngestion

3rd DataIngestion

Nth DataIngestion

New Processors are interested in the events …

EventStream

Analytics

2nd EventStream

3rd EventStream

nth EventStream

2nd Analyticsevent

event

event

event

DataIngestion

2nd DataIngestion

3rd DataIngestion

Nth DataIngestion

… and the solution becomes the problem

EventStream

Analytics

2nd EventStream

3rd EventStream

nth EventStream

2nd Analytics

3rd Analytics

Nth

Analytics

event

event

event

event

DataIngestion

2nd DataIngestion

3rd DataIngestion

Nth DataIngestion

… and the solution becomes the problem

NewCustomers

OperationalLogs

ClickStream

MeterReadings

event

event

event

event

CDCIngestion

LogIngestion

ClickStreamIngestion

SenorIngestion

Hadoop/DataWarehouse

RecommendationSystem

LogSearch

FraudDetection

Decouple event streams from consumers

„UnifiedLog“

Remember EnterpriseService Bus(ESB)?

Enterprise EventBus EventStreamAnalyticsEventStream Ingestion

CDCIngestion

LogIngestion

ClickStreamIngestion

SenorIngestion

Hadoop/DataWarehouse

RecommendationSystem

LogSearch

FraudDetection

What istheideaofaUnifiedLog?

NewCustomers

OperationalLogs

ClickStream

MeterReadings

Unified Log – What is it?

By Unified Log, we do not mean this ….137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114

137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747

137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 -

137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160

137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 -

137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 -

137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809

… but this• a structured log (records are numbered beginning with 0 based on order they are written)• aka. commit log or

journal

0 1 2 3 4 5 6 7 8 9 10

11

1st record Nextrecordwritten

Central Unified Log for (real-time) subscription

Take all the organization’s data (events) and put it into a central log for subscriptionProperties of the Unified Log:

• Unified: “Enterprise”, single deployment

• Append-Only: events are appended, no update in place => immutable

• Ordered: each event has an offset, which is unique within a shard

• Fast: should be able to handle thousands of messages / sec

• Distributed: lives on a cluster of machines

0 1 2 3 4 5 6 7 8 9 10

11

reads

writes

Collector

ConsumerSystemA(time=6)

ConsumerSystemB(time=10)

reads

Implementing Event Bus

Apache Kafka - Overview

Distributed publish-subscribe messaging system

Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …)

Initially developed at LinkedIn, now part of Apache

Does not use JMS API and standards

Kafka maintains feeds of messages in topics

Kafka Cluster

Consumer Consumer Consumer

Producer Producer Producer

Apache Kafka - Motivation

LinkedIn’s motivation for Kafka was:

• “A unified platform for handling all the real-time data feeds a large company might have.”

Must haves

• High throughput to support high volume event feeds.

• Support real-time processing of these feeds to create new, derived feeds.

• Support large data backlogs to handle periodic ingestion from offline systems.

• Support low-latency delivery to handle more traditional messaging use cases.

• Guarantee fault-tolerance in the presence of machine failures.

Apache Kafka - Architecture

Kafka Broker

Movement Processor

MovementTopic

Engine-MetricsTopic

1 2 3 4 5 6

EngineProcessor1 2 3 4 5 6

Truck

Apache Kafka - Architecture

Kafka Broker

Movement Processor

MovementTopic

Engine-MetricsTopic

1 2 3 4 5 6

EngineProcessor

Partition0

1 2 3 4 5 6Partition0

1 2 3 4 5 6Partition1 Movement

ProcessorTruck

ApacheKafka

Kafka Broker

Movement Processor

Truck

MovementTopic

Engine-MetricsTopic

EngineProcessor

P0

Movement Processor

1 2 3 4 5

P1 1 2 3 4 5

Kafka BrokerMovementTopic

Engine-MetricsTopic

P0 1 2 3 4 5

P1 1 2 3 4 5

P0 1 2 3 4 5

P0 1 2 3 4 5

Apache Kafka - Partition offsets

Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset

• Consumers track their pointers via (offset, partition, topic) tuples

Consumer groupC1

Apache Kafka - Performance

Kafka at LinkedIn => over 1100 brokers / 60 clusters

Kafka Performance at own setup => 6 brokers (VM) / 1 cluster

• 445’622 messages/second• 31 MB / second • 3.0405 ms average latency between producer / consumer

800billionmessages/day

175TBproduced/day

650TBconsumed/day

13millionmessages/second2.75GB/second

atbusiesttimeofday

http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

https://engineering.linkedin.com/kafka/running-kafka-scale






NEAR

ENTER

TruckDriver


RecklessDriver

Demo: Consuming Kafka Topic

Demo: Monitoring Kafka Cluster with Kafka Manager

Implementing Data Ingestion

StreamSets Data Collector

• Founded by ex-Cloudera, Informaticaemployees

• Continuous open source, intent-driven, big data ingest

• Visible, record-oriented approach fixes combinatorial explosion

• Batch or stream processing• Standalone, Spark cluster, MapReduce

cluster• IDE for pipeline development by ‘civilians’• Relatively new - first public release

September 2015• So far, vast majority of commits are from

StreamSets staff

Apache NiFi

• Originated at NSA as Niagarafiles

• Open sourced December 2014, Apache TLP July 2015

• Opaque, file-oriented payload

• Distributed system of processors with centralized control

• Based on flow-based programmingconcepts

• Data Provenance

• Web-based user interface






NEAR

ENTER

TruckDriver


RecklessDriver

Demo: Using Apache NiFi for Collection

Implementing Streaming Analytics

Streaming Analytics

Product

Framework/Infrastructure

OpenSource ClosedSource

Implementing Streaming Analytics: Oracle Stream Analytics

History of Oracle Stream Analytics

OracleComplexEventProcessing (OCEP)

OracleEventProcessing (OEP)

OracleStreamExplorer (SX)

OracleEventProcessingforJavaEmbedded

OracleStreamAnalytics(OSA)

OracleEdgeAnalytics(OAE)

BEAWeblogic EventServerOracleCQL

OracleIoT CloudService

2016

2015

2007

2008

2012

2013

OEA

• Filtering• Correlation• Aggregation• Pattern

matching

Devices / Gateways

Services

Computing Edge Enterprise

“Sea of data”

Macro-eventHigh-valueActionableIn-context

EDGEAnalytics

StreamAnalytics

FOG

• High Volume• Continuous Streaming• Extreme Low Latency• Disparate Sources• Temporal Processing• Pattern Matching• Machine Learning

Oracle Stream Analytics: From Noise to Value

• HighVolume• Continuous Streaming• Sub-Millisecond Latency• Disparate Sources• Time-Window Processing• PatternMatching

• HighAvailability /Scalability• Coherence Integration• Geospatial, Geofencing• BigDataIntegration

• Business EventVisualization

• Action!

Oracle Stream Analytics Platform

What it does• Compelling, friendly and visually stunning real time

streaming analytics user experience for Business users to dynamically create and implement Instant Insight solutions

Key Features• Analyze simulated or live data feeds to determine event

patterns, correlation, aggregation & filtering• Pattern library for industry specific solutions• Streams, References, Maps & Explorations

Benefits• Accelerated delivery time• Hides all challenges & complexities of underlying real-time

event-driven infrastructure

Oracle Stream Analytics - Connecting Everything & Anything of Interest to the Business

Understanding of CQL Filtering, Correlation, Pattern: NOT NEEDED

Understanding of IT Deployment and Management: NOT NEEDED

Understanding of Development, Java, Best Practices: NOT NEEDED

Understanding of the Event Driven Platform: NOT NEEDED

Business accessibility to Geo-Streaming Analytics

Real Time Streaming Solutions face an increasing need to track "assets of interest" and initiate actions based on encroachment of boundary proximity to fixed and moving objects and other geographic, temporal, or event conditions.

Geo-Fence,Fence,Polygon

Geo-Streaming

“Addvalue toyourreal timestreaming datadiscoveryandanalytics byapplying andincludingmathematical, statistical analysis totheliveoutput stream”

“These streaming “Excel spreadsheets” really docometolife”

Expression Builder enabling calculation for the Business User

Concept of Connections & Connection Reuse in Streams

Decision Table for Nested IF-THEN-ELSE Rules

Topology View and Navigation

Stream Analytics – Terminology for Business Users

Explorer: The Application User Interface Catalog: The repository for browsing resources


Stream: incoming flow of events that you want to analyze (CSV, Kafka, JMS, Rest, MQTT, …)

Exploration: application that correlates events from streams and data sources, using filters, groupings, summaries, ranges, and more


Shape: A blueprint of an event in a stream or data in a data source. How the business data is represented in the selected stream

Map: collection of geo-fences

Reference: A connection to static data that is joined to a stream to enrich it and/or to be used in business logic and output


Pattern: A pre-built Exploration that addresses a particular business scenario in a focused and simplified User Interface

Connection: collection of metadata required to connect to an external system

Targets: defines an interface with a downstream system






NEAR

ENTER

TruckDriver


RecklessDriver

Demo: Oracle Stream Analytics

Implementing Streaming Analytics: Spark Streaming

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing• The hot trend in Big Data!• Originally developed 2009 in UC Berkley’s AMPLab• Based on 2007 Microsoft Dryad paper• Written in Scala, supports Java, Python, SQL and R• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x

faster on disk• One of the largest OSS communities in big data with over 200 contributors in 50+

organizations• Open Sourced in 2010 – since 2014 part of Apache Software foundation

Apache Spark

SparkSQL(BatchProcessing)

BlinkDB(ApproximateQuerying)

SparkStreaming(Real-Time)

MLlib,SparkR(MachineLearning)

GraphX(GraphProcessing)

SparkCoreAPIandExecutionModel

SparkStandalone MESOS YARN HDFS Elastic

SearchNoSQL S3

Libraries

CoreRuntime

ClusterResourceManagers DataStores

Resilient Distributed Dataset (RDD)

Are• Immutable• Re-computable• Fault tolerant• Reusable

Have Transformations• Produce new RDD• Rich set of transformation available

• filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ...

Have Actions• Start cluster computing operations• Rich set of action available

• collect(), count(), fold(), reduce(), count(), …

RDD RDD

Input Source

• File• Database• Stream• Collection

.count() ->100

Data

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server1

Server2

Server3

Server4

Server5

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server2

Server3

Server4

Server5

Stage 1 – reduceByKey()

Stage 1 – flatMap() + map()

Spark Workflow InputHDFSFile

HadoopRDD

MappedRDD

ShuffledRDD

TextFileOutput

sc.hapoopFile()

map()

reduceByKey()

sc.saveAsTextFile()

Transformations(Lazy)

Action(Execute

Transformations)

Master

MappedRDD

P0

P1

P3

ShuffledRDD

P0

MappedRDD

flatMap()

DAGScheduler

Spark Execution Model

DataStorage

Worker

Master

Executer

Executer

Server

Executer

Stage 1 – flatMap() + map()


DataStorage

Worker

Master

Executer

DataStorage

Worker

Executer

DataStorage

Worker

Executer

RDD

P0

P1

P3

NarrowTransformationMaster

filter()map()sample()flatMap()

DataStorage

Worker

Executer

Stage 2 – reduceByKey()


DataStorage

Worker

Executer

DataStorage

Worker

Executer

RDD

P0

WideTransformation

Master

join()reduceByKey()union()groupByKey()

Shuffle!

DataStorage

Worker

Executer

DataStorage

Worker

Executer

Batch vs. Real-Time Processing

PetabytesofData

Gigabytes

PerSecond

Discretized Stream (DStream)

Kafka

Truck

Truck

Truck


Kafka

Truck

Truck

Truck Discretebytime

IndividualEvent

DStream =RDD


DStream DStream

XSeconds

Transform

.countByValue()

.reduceByKey()

.join

.map

Discretized Stream (DStream)time1 time2 time3

message

timen….

f(message 1)RDD@time1

f(message 2)

f(message n)

….

message 1RDD@time1

message 2

message n

….

result 1

result 2

result n

….

message message message


f(message 2)

f(message n)

….

message 1RDD@time2

message 2

message n

….

result 1

result 2

result n

….


f(message 2)

f(message n)

….

message 1RDD@time3

message 2

message n

….

result 1

result 2

result n

….

f(message 1)RDD@timen

f(message 2)

f(message n)

….

message 1RDD@timen

message 2

message n

….

result 1

result 2

result n

….

InputStream

EventDStream

MappedDStreammap()

saveAsHadoopFiles()

TimeIncreasing

DStream

TransformationLineage

Actio

nsTrig

ger

SparkJobs

Adapted fromChrisFregly: http://slidesha.re/11PP7FV






NEAR

ENTER

TruckDriver


RecklessDriver

Implementing Streaming Analytics: Apache Storm

Apache Storm

A platform for doing analysis on streams of data as they come in, so you can react to data as it happens.• highly distributed real-time computation system

• Provides general primitives to do real-time computation

• To simplify working with queues & workers

• scalable and fault-tolerant

Originated at Backtype, acquired by Twitter in 2011Open Sourced late 2011Part of Apache since September 2013

Apache Storm – Core concepts

Tuple• Immutable Set of Key/value pairs

Stream• an unbounded sequence of tuples that can be processed in parallel by Storm

Topology• Wires data and functions via a DAG (directed acyclic graph)• Executes on many machines similar to a MR job in Hadoop

Spout• Source of data streams (tuples)• can be run in “reliable” and “unreliable” mode

Bolt• Consumes 1+ streams and produces new streams• Complex operations often require multiple

steps and thus multiple bolts

Spout

Spout

Bolt

Bolt

Bolt

Bolt

SourceofStreamB

Subscribes:AEmits:C

Subscribes:AEmits:D

Subscribes:A&BEmits:-

Subscribes:C&DEmits:-

T T T T T T T T






NEAR

ENTER

TruckDriver


RecklessDriver

Apache Storm – How does it work ?

GeoHashing

TrucksMovement

GeoHashing

{"timestamp" :"2016-06-02

ShuffleGrouping

GeoHashing

{"timestamp" :"2016-06-0212:56:02.362","truckId" :35,"driverId" :26,"driverName" :"Michael Aube", "routeId" :1090292248, "eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}

TruckMovement

{"timestamp" :"2016-06-02

“geohash” :“dp206n3d“,


GeoHashing

TrucksMovement

GeoFencer

GeoHashing

GeoFencer

GeoHashing

ShuffleGrouping

FieldsGrouping

TruckMovement

{"timestamp" :"2016-06-02


{“geohash” :“dp206n3d“, "timestamp" :"2016-06-02 12:56:02.362","truckId" :35,"driverId" :26,"driverName" :"MichaelAube", "routeId" :1090292248,"eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}

{“geohash” :“f00hfh99“, ..

{ "timestamp" :"2016-06-02


GeoHashing

TrucksMovement

GeoFencer

GeoHashing

GeoFencer

Alerter

GeoHashing

ShuffleGrouping

FieldsGrouping

GlobalGrouping

TruckMovement

{"timestamp" :"2016-06-02


{“geohash” :“dp206n3d“, "timestamp" :"2016-06-02 12:56:02.362","truckId" :35,"driverId" :26,"driverName" :"MichaelAube", "routeId" :1090292248,"eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}

{"timestamp" :"2016-06-02

{"timestamp" :"2016-06-02 12:56:02.362","truckId" :35,"driverId" :26, "latitude" :40.86,"longitude" :"-89.91"}

{“geohash” :“f00hfh99“, ..

Apache Storm – Core concepts

Each Spout or Bolt are running N instances in parallel

GeoHashingnth

TrucksMovement

GeoFencingnth

GeoHashing GeoFencing1st

Shuffle Fields

Shufflegrouping israndomgroupingFieldsgrouping isgroupedbyvalue,suchthatequalvalueresultsinequaltaskAllgrouping replicatestoalltasksGlobalgrouping makesalltuples gotoonetaskNonegrouping makesboltruninthesamethreadasbolt/spout itsubscribestoDirectgrouping producer(taskthatemits)controlswhichconsumerwillreceiveLocal orShufflegrouping

similartotheshufflegroupingbutwillshuffletuplesamongbolttasksrunninginthesameworkerprocess,ifany.Fallsbacktoshufflegrouping behavior.

ReportGlobal

Scalability & Reliability

How to scale a Streaming Analytics System?

Queue(Persist)

EventStream

event

CollectingThread1 event event

ProcessingThread1 result

CollectingThread2

ProcessingThread2

event event event result

CollectingThreadn

ProcessingThreadn

CollectingProcess1

CollectingProcess1

CollectingProcess1

CollectingProcess1

CollectingProcess1


Queue1(Persist)

EventStream

event

CollectingThread1

event event ProcessingProcess1 result

CollectingThread1

ProcessingProcess1

Queue2(Persist)event

event event result

ProcessingProcess1

Queuen(Persist)

CollectingProcess1

CollectingProcess2

Processing AProcess 2

Processing BProcess 2




EventStream

CollectingProcess1

CollectingProcess2

Processing AThread 2

Q2e

Processing BThread 2

Q2e


Q1e


Q1e


Processing AThread n

Qne

How to make Streaminig Analytics System reliable?

Faults and stragglers inevitable in large clusters running big data applicationsStreaming applications must recover from them quickly

CollectingProcess2



EventStream

CollectingProcess2


Q2e


Q2e

CollectingProcess2



EventStream

CollectingProcess2


Q2e


Q2e

How to deal with “Stragglers”

Consumer goes slow

Backpressure Queue upDrop data

Other jobs grindto a halt L

Run out ofmemory L

Spill to diskNo thanks L

How to make Streaming Analytics System reliable?

Solution 1: using active/passive system (hot replication)• Both systems process the full load• In case of a failure, automatically switch and use the “passive” system• Stragglers slow down both active and passive system

State

=Statein-memoryand/oron-disk

CollectingProcess2



EventStream

CollectingProcess2


Q2e


Q2e

Active

CollectingProcess2



CollectingProcess2


Q2e


Q2e

Passive

State

State

How to make Streaming Analytics System reliable?

Solution 2: Upstream backup• Nodes buffer sent messages and reply them to new node in case of failure• Stragglers are treated as failures

State =Statein-memoryand/oron-disk

buffer =Bufferforreplayin-memoryand/oron-disk

CollectingProcess2



EventStream

CollectingProcess2


Q2e


Q2e

State

Message Delivery Semantics

At most once [0,1]• Messages my be lost • Messages never redelivered

At least once [1 .. n]• Messages will never be lost • but messages may be redelivered

(might be ok if consumer can handle it)

Exactly once [1]• Messages are never lost• Messages are never redelivered• Perfect message delivery• Incurs higher latency for transactional

semantics

Streaming Analytics in Architecture

“Traditional Architecture” for Big Data

DataCollection (Analytical)DataProcessing ResultStoreData

Sources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

Batchcompute

Stage

ResultStore

QueryEngine

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

Streaming Analytics Architecture for Big Dataaka. (Complex) Event Processing)

DataCollection

Batchcompute

DataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

Logfiles

Sensor

RDBMS

ERP

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

ResultStore

Messaging

ResultStore


Keep raw event data

DataCollection

Batchcompute

DataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

Logfiles

Sensor

RDBMS

ERP

Mobile

Machine



ResultStore

Messaging

ResultStore


(Analytical)BatchDataProcessing

RawData(Reservoir)

“Lambda Architecture” for Big Data

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

ResultStoreDataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine



Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)


“Kappa Architecture” for Big Data

DataCollection

“RawDataReservoir”

Batchcompute

DataSources

Messaging

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

Logfiles

Sensor

RDBMS

ERP

Mobile

Machine



ResultStore

Messaging

ResultStore

RawData(Reservoir)


ComputedInformation

“Unified Architecture” for Big Data

DataCollection

(Analytical)BatchDataProcessing(CalculateModelsofincomingdata)

Batchcompute

ResultStoreDataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine



Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)


PredictionModels

Summary

Summary

More and more use cases (such as IoT) make Streaming Analytics necessary

Treat events as events! Infrastructures for handling lots of events are available!

Platforms such as Oracle Stream Analytics enable the business to work directly on streaming data (empower the business analyst) => User Experience of an Excel Sheet on streaming data

Platform such as Apache Strom and Apache Spark Streaming provide a highly-scalable and fault-tolerant infrastructure for streaming analytics => Oracle Stream Analytics can use Spark Streaming as the runtime infrastructure

Platforms such as Kafka provide a high volume event broker infrastructure, a.k.a. Event Hub

ComparisonOracleStream Analytics SparkStreaming SparkStorm

Community n.a. >280contributors > 100contributors

Language Options Java,CQL Java,Scala, Python Java,Clojure, Scala,…

ProcessingModels Event-Streaming Micro-Batching Event-Streaming

Processing DSL Yes Yes No

Stateful Ops Yes Yes No

Patterndetection Yes No No

Scalability&Reliability limited yes yes

Distributed RPC No No Yes

DeliveryGuarantees At LeastOnce Exactly Once Atmostonce /Atleastonce

Latency sub-second seconds sub-second

”self-service”forBiz Yes No No

Platform OEP server,SparkStreaming(YARN,Mesos)

YARN,Mesos Standalone,DataStax EE

Storm Cluster,YARN

Guido SchmutzTechnology Manager

[email protected]

introduction to streaming analytics

Data & Analytics