introduction to streaming analytics

115
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Introduction to Streaming Analytics Guido Schmutz

Upload: guido-schmutz

Post on 21-Jan-2017

1.462 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Introduction to Streaming Analytics

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Introduction to Streaming Analytics

Guido Schmutz

Page 2: Introduction to Streaming Analytics

Guido Schmutz

Working for Trivadis for more than 19 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis

More than 25 years of software development experience

Contact: [email protected]: http://guidoschmutz.wordpress.comTwitter: gschmutz

Page 3: Introduction to Streaming Analytics

Our company.

© Trivadis – The Company3 03.06.16

Trivadis is a market leader in IT consulting, system integration, solution engineeringand the provision of IT services focusing on and and Open Source technologiesin Switzerland, Germany, Austria and Denmark. We offer our services in the followingstrategic business fields:

Trivadis Services takes over the interacting operation of your IT systems.

O P E R A T I O N

Page 4: Introduction to Streaming Analytics

COPENHAGEN

MUNICH

LAUSANNEBERN

ZURICHBRUGG

GENEVA

HAMBURG

DÜSSELDORF

FRANKFURT

STUTTGART

FREIBURG

BASEL

VIENNA

With over 600 specialists and IT experts in your region.

© Trivadis – The Company4 03.06.16

14 Trivadis branches and more than600 employees

200 Service Level Agreements

Over 4,000 training participants

Research and development budget:CHF 5.0 million

Financially self-supporting andsustainably profitable

Experience from more than 1,900 projects per year at over 800customers

Page 5: Introduction to Streaming Analytics

Agenda

1. Introduction & Foundation2. Designing Streaming Analytics Solutions

3. Implementing Event Hub

4. Implementing Data Ingestion

5. Implementing Streaming Analytics

6. Scalability & Reliability7. Streaming Analytics in Architecture

8. Summary

Page 6: Introduction to Streaming Analytics

Introduction & Foundation

Page 7: Introduction to Streaming Analytics

Big Data Definition (4 Vs)

+Timetoaction?– BigData+Real-Time=StreamProcessing

CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination

Page 8: Introduction to Streaming Analytics

The world is changing …

The model of Generating/Consuming Data has changed ….

Old Model: few companies are generating data, all others are consuming data

New Model: all of use are generating data, and all of us are consuming data

Page 9: Introduction to Streaming Analytics

Who is generating Big Data?

The progress and innovation is no longer hindered by the ability to collect data

But by the ability to manage, analyze, summarize, visualize and discover knowledge from the collected data in a timely manner and in a scalable fashion

Socialmediaandnetworks(allofusaregeneratingdata)

Scientificinstruments(collectingallsortsofdata)

Mobiledevices(trackingallobjectsallthetime)

Sensortechnologyandnetworks(measuringallkinds ofdata)

Page 10: Introduction to Streaming Analytics

Traditional Data Processing - Challenges

• Introduces too much “decision latency”

• Responses are delivered “after the fact”

• Maximum value of the identified situation is lost

• Decision are made on old and stale data

• “Data a Rest”

Page 11: Introduction to Streaming Analytics

The New Era: Streaming Data Analytics / Fast Data

• Events are analyzed and processed in real-time as the arrive

• Decisions are timely, contextual and based on fresh data

• Decision latency is eliminated

• “Data in motion”

Page 12: Introduction to Streaming Analytics

Real Time Analytics Use Cases

• Algorithmic Trading

• Online Fraud Detection

• Geo Fencing

• Proximity/Location Tracking

• Intrusion detection systems

• Traffic Management

• Recommendations

• Churn detection

• Internet of Things (IoT) / Intelligence

Sensors

• Social Media/Data Analytics

• Gaming Data Feed

• …

Page 13: Introduction to Streaming Analytics

What happen in an internet minute

Page 14: Introduction to Streaming Analytics

Internet Of Things – Sensorsare/will be everywhereThere are more devices tapping into the internet than people on earth

How do we prepare our systems/architecture for the future?

Source:CiscoSource:TheEconomist

Page 15: Introduction to Streaming Analytics

Different Types of Stream/Event Processing

Simple Event Processing (SEP)

Event Stream Processing (ESP)

Page 16: Introduction to Streaming Analytics

Different Types of Stream/Event Processing

Complex Event Processing (CEP)

Page 17: Introduction to Streaming Analytics

Native Streaming vs. Micro-Batching

Native Streaming• Events processed as they

arrive• + low-latency• - throughput• - fault tolerance is expensive

Micro-Batching• Splits incoming stream in

small batches• + high(er) throughput• + easier fault tolerance• - lower latency

Source: Distributed Real-TimeStreamProcessing:WhyandHowbyPetrZapletal

Page 18: Introduction to Streaming Analytics

How to design a Streaming Analytics Solution?

EventStream

eventDataIngestion

event

Persist(Queue)

EventStream

eventDataIngestion

event

Analytics

eventAnalytics

result

result

EventStream

event DataIngestion/Analytics

result

Page 19: Introduction to Streaming Analytics

Demo Use Case – Truck Sensors

Truck DataIngestion Geo-Fencing

2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631

{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}

RecklessDrivingDetector

NEAR

ENTER

TruckDriver

DashboardMovement MovementJSON

RecklessDriver

Page 20: Introduction to Streaming Analytics

Designing Streaming Analytics Solutions

Page 21: Introduction to Streaming Analytics

How to design a Streaming Analytics System?It usually starts very simple … just one data pipeline

EventStream

AnalyticseventData

Ingestion

Page 22: Introduction to Streaming Analytics

New Event Stream sources are added …

EventStream

Analytics

2nd EventStream

3rd EventStream

nth EventStream

event

event

event

event

DataIngestion

2nd DataIngestion

3rd DataIngestion

Nth DataIngestion

Page 23: Introduction to Streaming Analytics

New Processors are interested in the events …

EventStream

Analytics

2nd EventStream

3rd EventStream

nth EventStream

2nd Analyticsevent

event

event

event

DataIngestion

2nd DataIngestion

3rd DataIngestion

Nth DataIngestion

Page 24: Introduction to Streaming Analytics

… and the solution becomes the problem

EventStream

Analytics

2nd EventStream

3rd EventStream

nth EventStream

2nd Analytics

3rd Analytics

Nth

Analytics

event

event

event

event

DataIngestion

2nd DataIngestion

3rd DataIngestion

Nth DataIngestion

Page 25: Introduction to Streaming Analytics

… and the solution becomes the problem

EventStream

Analytics

2nd EventStream

3rd EventStream

nth EventStream

2nd Analytics

3rd Analytics

Nth

Analytics

event

event

event

event

DataIngestion

2nd DataIngestion

3rd DataIngestion

Nth DataIngestion

Page 26: Introduction to Streaming Analytics

… and the solution becomes the problem

NewCustomers

OperationalLogs

ClickStream

MeterReadings

event

event

event

event

CDCIngestion

LogIngestion

ClickStreamIngestion

SenorIngestion

Hadoop/DataWarehouse

RecommendationSystem

LogSearch

FraudDetection

Page 27: Introduction to Streaming Analytics

Decouple event streams from consumers

„UnifiedLog“

Remember EnterpriseService Bus(ESB)?

Enterprise EventBus EventStreamAnalyticsEventStream Ingestion

CDCIngestion

LogIngestion

ClickStreamIngestion

SenorIngestion

Hadoop/DataWarehouse

RecommendationSystem

LogSearch

FraudDetection

What istheideaofaUnifiedLog?

NewCustomers

OperationalLogs

ClickStream

MeterReadings

Page 28: Introduction to Streaming Analytics

Unified Log – What is it?

By Unified Log, we do not mean this ….137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114

137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747

137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 -

137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160

137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 -

137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 -

137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809

… but this• a structured log (records are numbered beginning with 0 based on order they are written)• aka. commit log or

journal

0 1 2 3 4 5 6 7 8 9 10

11

1st record Nextrecordwritten

Page 29: Introduction to Streaming Analytics

Central Unified Log for (real-time) subscription

Take all the organization’s data (events) and put it into a central log for subscriptionProperties of the Unified Log:

• Unified: “Enterprise”, single deployment

• Append-Only: events are appended, no update in place => immutable

• Ordered: each event has an offset, which is unique within a shard

• Fast: should be able to handle thousands of messages / sec

• Distributed: lives on a cluster of machines

0 1 2 3 4 5 6 7 8 9 10

11

reads

writes

Collector

ConsumerSystemA(time=6)

ConsumerSystemB(time=10)

reads

Page 30: Introduction to Streaming Analytics

Implementing Event Bus

Page 31: Introduction to Streaming Analytics

Apache Kafka - Overview

Distributed publish-subscribe messaging system

Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …)

Initially developed at LinkedIn, now part of Apache

Does not use JMS API and standards

Kafka maintains feeds of messages in topics

Kafka Cluster

Consumer Consumer Consumer

Producer Producer Producer

Page 32: Introduction to Streaming Analytics

Apache Kafka - Motivation

LinkedIn’s motivation for Kafka was:

• “A unified platform for handling all the real-time data feeds a large company might have.”

Must haves

• High throughput to support high volume event feeds.

• Support real-time processing of these feeds to create new, derived feeds.

• Support large data backlogs to handle periodic ingestion from offline systems.

• Support low-latency delivery to handle more traditional messaging use cases.

• Guarantee fault-tolerance in the presence of machine failures.

Page 33: Introduction to Streaming Analytics

Apache Kafka - Architecture

Kafka Broker

Movement Processor

MovementTopic

Engine-MetricsTopic

1 2 3 4 5 6

EngineProcessor1 2 3 4 5 6

Truck

Page 34: Introduction to Streaming Analytics

Apache Kafka - Architecture

Kafka Broker

Movement Processor

MovementTopic

Engine-MetricsTopic

1 2 3 4 5 6

EngineProcessor

Partition0

1 2 3 4 5 6Partition0

1 2 3 4 5 6Partition1 Movement

ProcessorTruck

Page 35: Introduction to Streaming Analytics

ApacheKafka

Kafka Broker

Movement Processor

Truck

MovementTopic

Engine-MetricsTopic

EngineProcessor

P0

Movement Processor

1 2 3 4 5

P1 1 2 3 4 5

Kafka BrokerMovementTopic

Engine-MetricsTopic

P0 1 2 3 4 5

P1 1 2 3 4 5

P0 1 2 3 4 5

P0 1 2 3 4 5

Page 36: Introduction to Streaming Analytics

Apache Kafka - Partition offsets

Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset

• Consumers track their pointers via (offset, partition, topic) tuples

Consumer groupC1

Page 37: Introduction to Streaming Analytics

Apache Kafka - Performance

Kafka at LinkedIn => over 1100 brokers / 60 clusters

Kafka Performance at own setup => 6 brokers (VM) / 1 cluster

• 445’622 messages/second• 31 MB / second • 3.0405 ms average latency between producer / consumer

800billionmessages/day

175TBproduced/day

650TBconsumed/day

13millionmessages/second2.75GB/second

atbusiesttimeofday

http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

https://engineering.linkedin.com/kafka/running-kafka-scale

Page 38: Introduction to Streaming Analytics

Demo Use Case – Truck Sensors

Truck DataIngestion Geo-Fencing

2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631

{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}

RecklessDrivingDetector

NEAR

ENTER

TruckDriver

DashboardMovement MovementJSON

RecklessDriver

Page 39: Introduction to Streaming Analytics

Demo: Consuming Kafka Topic

Page 40: Introduction to Streaming Analytics

Demo: Monitoring Kafka Cluster with Kafka Manager

Page 41: Introduction to Streaming Analytics

Implementing Data Ingestion

Page 42: Introduction to Streaming Analytics

StreamSets Data Collector

• Founded by ex-Cloudera, Informaticaemployees

• Continuous open source, intent-driven, big data ingest

• Visible, record-oriented approach fixes combinatorial explosion

• Batch or stream processing• Standalone, Spark cluster, MapReduce

cluster• IDE for pipeline development by ‘civilians’• Relatively new - first public release

September 2015• So far, vast majority of commits are from

StreamSets staff

Page 43: Introduction to Streaming Analytics

Apache NiFi

• Originated at NSA as Niagarafiles

• Open sourced December 2014, Apache TLP July 2015

• Opaque, file-oriented payload

• Distributed system of processors with centralized control

• Based on flow-based programmingconcepts

• Data Provenance

• Web-based user interface

Page 44: Introduction to Streaming Analytics

Demo Use Case – Truck Sensors

Truck DataIngestion Geo-Fencing

2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631

{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}

RecklessDrivingDetector

NEAR

ENTER

TruckDriver

DashboardMovement MovementJSON

RecklessDriver

Page 45: Introduction to Streaming Analytics

Demo: Using Apache NiFi for Collection

Page 46: Introduction to Streaming Analytics

Implementing Streaming Analytics

Page 47: Introduction to Streaming Analytics

Streaming Analytics

Product

Framework/Infrastructure

OpenSource ClosedSource

Page 48: Introduction to Streaming Analytics

Implementing Streaming Analytics: Oracle Stream Analytics

Page 49: Introduction to Streaming Analytics

History of Oracle Stream Analytics

OracleComplexEventProcessing (OCEP)

OracleEventProcessing (OEP)

OracleStreamExplorer (SX)

OracleEventProcessingforJavaEmbedded

OracleStreamAnalytics(OSA)

OracleEdgeAnalytics(OAE)

BEAWeblogic EventServerOracleCQL

OracleIoT CloudService

2016

2015

2007

2008

2012

2013

Page 50: Introduction to Streaming Analytics

OEA

• Filtering• Correlation• Aggregation• Pattern

matching

Devices / Gateways

Services

Computing Edge Enterprise

“Sea of data”

Macro-eventHigh-valueActionableIn-context

EDGEAnalytics

StreamAnalytics

FOG

• High Volume• Continuous Streaming• Extreme Low Latency• Disparate Sources• Temporal Processing• Pattern Matching• Machine Learning

Oracle Stream Analytics: From Noise to Value

• HighVolume• Continuous Streaming• Sub-Millisecond Latency• Disparate Sources• Time-Window Processing• PatternMatching

• HighAvailability /Scalability• Coherence Integration• Geospatial, Geofencing• BigDataIntegration

• Business EventVisualization

• Action!

Page 51: Introduction to Streaming Analytics

Oracle Stream Analytics Platform

What it does• Compelling, friendly and visually stunning real time

streaming analytics user experience for Business users to dynamically create and implement Instant Insight solutions

Key Features• Analyze simulated or live data feeds to determine event

patterns, correlation, aggregation & filtering• Pattern library for industry specific solutions• Streams, References, Maps & Explorations

Benefits• Accelerated delivery time• Hides all challenges & complexities of underlying real-time

event-driven infrastructure

Page 52: Introduction to Streaming Analytics

Oracle Stream Analytics - Connecting Everything & Anything of Interest to the Business

Understanding of CQL Filtering, Correlation, Pattern: NOT NEEDED

Understanding of IT Deployment and Management: NOT NEEDED

Understanding of Development, Java, Best Practices: NOT NEEDED

Understanding of the Event Driven Platform: NOT NEEDED

Page 53: Introduction to Streaming Analytics

Business accessibility to Geo-Streaming Analytics

Real Time Streaming Solutions face an increasing need to track "assets of interest" and initiate actions based on encroachment of boundary proximity to fixed and moving objects and other geographic, temporal, or event conditions.

Geo-Fence,Fence,Polygon

Geo-Streaming

Page 54: Introduction to Streaming Analytics

“Addvalue toyourreal timestreaming datadiscoveryandanalytics byapplying andincludingmathematical, statistical analysis totheliveoutput stream”

“These streaming “Excel spreadsheets” really docometolife”

Expression Builder enabling calculation for the Business User

Page 55: Introduction to Streaming Analytics

Concept of Connections & Connection Reuse in Streams

Page 56: Introduction to Streaming Analytics

Decision Table for Nested IF-THEN-ELSE Rules

Page 57: Introduction to Streaming Analytics

Topology View and Navigation

Page 58: Introduction to Streaming Analytics

Stream Analytics – Terminology for Business Users

Explorer: The Application User Interface Catalog: The repository for browsing resources

Page 59: Introduction to Streaming Analytics

Stream Analytics – Terminology for Business Users

Stream: incoming flow of events that you want to analyze (CSV, Kafka, JMS, Rest, MQTT, …)

Exploration: application that correlates events from streams and data sources, using filters, groupings, summaries, ranges, and more

Page 60: Introduction to Streaming Analytics

Stream Analytics – Terminology for Business Users

Shape: A blueprint of an event in a stream or data in a data source. How the business data is represented in the selected stream

Map: collection of geo-fences

Reference: A connection to static data that is joined to a stream to enrich it and/or to be used in business logic and output

Page 61: Introduction to Streaming Analytics

Stream Analytics – Terminology for Business Users

Pattern: A pre-built Exploration that addresses a particular business scenario in a focused and simplified User Interface

Connection: collection of metadata required to connect to an external system

Targets: defines an interface with a downstream system

Page 62: Introduction to Streaming Analytics

Demo Use Case – Truck Sensors

Truck DataIngestion Geo-Fencing

2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631

{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}

RecklessDrivingDetector

NEAR

ENTER

TruckDriver

DashboardMovement MovementJSON

RecklessDriver

Page 63: Introduction to Streaming Analytics

Demo: Oracle Stream Analytics

Page 64: Introduction to Streaming Analytics

Demo: Oracle Stream Analytics

Page 65: Introduction to Streaming Analytics

Demo: Oracle Stream Analytics

Page 66: Introduction to Streaming Analytics

Demo: Oracle Stream Analytics

Page 67: Introduction to Streaming Analytics

Implementing Streaming Analytics: Spark Streaming

Page 68: Introduction to Streaming Analytics

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing• The hot trend in Big Data!• Originally developed 2009 in UC Berkley’s AMPLab• Based on 2007 Microsoft Dryad paper• Written in Scala, supports Java, Python, SQL and R• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x

faster on disk• One of the largest OSS communities in big data with over 200 contributors in 50+

organizations• Open Sourced in 2010 – since 2014 part of Apache Software foundation

Page 69: Introduction to Streaming Analytics

Apache Spark

SparkSQL(BatchProcessing)

BlinkDB(ApproximateQuerying)

SparkStreaming(Real-Time)

MLlib,SparkR(MachineLearning)

GraphX(GraphProcessing)

SparkCoreAPIandExecutionModel

SparkStandalone MESOS YARN HDFS Elastic

SearchNoSQL S3

Libraries

CoreRuntime

ClusterResourceManagers DataStores

Page 70: Introduction to Streaming Analytics

Resilient Distributed Dataset (RDD)

Are• Immutable• Re-computable• Fault tolerant• Reusable

Have Transformations• Produce new RDD• Rich set of transformation available

• filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ...

Have Actions• Start cluster computing operations• Rich set of action available

• collect(), count(), fold(), reduce(), count(), …

Page 71: Introduction to Streaming Analytics

RDD RDD

Input Source

• File• Database• Stream• Collection

.count() ->100

Data

Page 72: Introduction to Streaming Analytics

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server1

Server2

Server3

Server4

Server5

Page 73: Introduction to Streaming Analytics

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server1

Server2

Server3

Server4

Server5

Page 74: Introduction to Streaming Analytics

Partitions RDD

Data

Partition0

Partition1

Partition2

Partition3

Partition4

Partition5

Partition6

Partition7

Partition8

Partition9

Server2

Server3

Server4

Server5

Page 75: Introduction to Streaming Analytics

Stage 1 – reduceByKey()

Stage 1 – flatMap() + map()

Spark Workflow InputHDFSFile

HadoopRDD

MappedRDD

ShuffledRDD

TextFileOutput

sc.hapoopFile()

map()

reduceByKey()

sc.saveAsTextFile()

Transformations(Lazy)

Action(Execute

Transformations)

Master

MappedRDD

P0

P1

P3

ShuffledRDD

P0

MappedRDD

flatMap()

DAGScheduler

Page 76: Introduction to Streaming Analytics

Spark Execution Model

DataStorage

Worker

Master

Executer

Executer

Server

Executer

Page 77: Introduction to Streaming Analytics

Stage 1 – flatMap() + map()

Spark Execution Model

DataStorage

Worker

Master

Executer

DataStorage

Worker

Executer

DataStorage

Worker

Executer

RDD

P0

P1

P3

NarrowTransformationMaster

filter()map()sample()flatMap()

DataStorage

Worker

Executer

Page 78: Introduction to Streaming Analytics

Stage 2 – reduceByKey()

Spark Execution Model

DataStorage

Worker

Executer

DataStorage

Worker

Executer

RDD

P0

WideTransformation

Master

join()reduceByKey()union()groupByKey()

Shuffle!

DataStorage

Worker

Executer

DataStorage

Worker

Executer

Page 79: Introduction to Streaming Analytics

Batch vs. Real-Time Processing

PetabytesofData

Gigabytes

PerSecond

Page 80: Introduction to Streaming Analytics

Discretized Stream (DStream)

Kafka

Truck

Truck

Truck

Page 81: Introduction to Streaming Analytics

Discretized Stream (DStream)

Kafka

Truck

Truck

Truck

Page 82: Introduction to Streaming Analytics

Discretized Stream (DStream)

Kafka

Truck

Truck

Truck

Page 83: Introduction to Streaming Analytics

Discretized Stream (DStream)

Kafka

Truck

Truck

Truck Discretebytime

IndividualEvent

DStream =RDD

Page 84: Introduction to Streaming Analytics

Discretized Stream (DStream)

DStream DStream

XSeconds

Transform

.countByValue()

.reduceByKey()

.join

.map

Page 85: Introduction to Streaming Analytics

Discretized Stream (DStream)time1 time2 time3

message

timen….

f(message 1)RDD@time1

f(message 2)

f(message n)

….

message 1RDD@time1

message 2

message n

….

result 1

result 2

result n

….

message message message

f(message 1)RDD@time2

f(message 2)

f(message n)

….

message 1RDD@time2

message 2

message n

….

result 1

result 2

result n

….

f(message 1)RDD@time3

f(message 2)

f(message n)

….

message 1RDD@time3

message 2

message n

….

result 1

result 2

result n

….

f(message 1)RDD@timen

f(message 2)

f(message n)

….

message 1RDD@timen

message 2

message n

….

result 1

result 2

result n

….

InputStream

EventDStream

MappedDStreammap()

saveAsHadoopFiles()

TimeIncreasing

DStream

TransformationLineage

Actio

nsTrig

ger

SparkJobs

Adapted fromChrisFregly: http://slidesha.re/11PP7FV

Page 86: Introduction to Streaming Analytics

Demo Use Case – Truck Sensors

Truck DataIngestion Geo-Fencing

2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631

{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}

RecklessDrivingDetector

NEAR

ENTER

TruckDriver

DashboardMovement MovementJSON

RecklessDriver

Page 87: Introduction to Streaming Analytics

Implementing Streaming Analytics: Apache Storm

Page 88: Introduction to Streaming Analytics

Apache Storm

A platform for doing analysis on streams of data as they come in, so you can react to data as it happens.• highly distributed real-time computation system

• Provides general primitives to do real-time computation

• To simplify working with queues & workers

• scalable and fault-tolerant

Originated at Backtype, acquired by Twitter in 2011Open Sourced late 2011Part of Apache since September 2013

Page 89: Introduction to Streaming Analytics

Apache Storm – Core concepts

Tuple• Immutable Set of Key/value pairs

Stream• an unbounded sequence of tuples that can be processed in parallel by Storm

Topology• Wires data and functions via a DAG (directed acyclic graph)• Executes on many machines similar to a MR job in Hadoop

Spout• Source of data streams (tuples)• can be run in “reliable” and “unreliable” mode

Bolt• Consumes 1+ streams and produces new streams• Complex operations often require multiple

steps and thus multiple bolts

Spout

Spout

Bolt

Bolt

Bolt

Bolt

SourceofStreamB

Subscribes:AEmits:C

Subscribes:AEmits:D

Subscribes:A&BEmits:-

Subscribes:C&DEmits:-

T T T T T T T T

Page 90: Introduction to Streaming Analytics

Demo Use Case – Truck Sensors

Truck DataIngestion Geo-Fencing

2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631

{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}

RecklessDrivingDetector

NEAR

ENTER

TruckDriver

DashboardMovement MovementJSON

RecklessDriver

Page 91: Introduction to Streaming Analytics

Apache Storm – How does it work ?

GeoHashing

TrucksMovement

GeoHashing

{"timestamp" :"2016-06-02

ShuffleGrouping

GeoHashing

{"timestamp" :"2016-06-0212:56:02.362","truckId" :35,"driverId" :26,"driverName" :"Michael Aube", "routeId" :1090292248, "eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}

TruckMovement

{"timestamp" :"2016-06-02

“geohash” :“dp206n3d“,

Page 92: Introduction to Streaming Analytics

Apache Storm – How does it work ?

GeoHashing

TrucksMovement

GeoFencer

GeoHashing

GeoFencer

GeoHashing

ShuffleGrouping

FieldsGrouping

TruckMovement

{"timestamp" :"2016-06-02

{"timestamp" :"2016-06-0212:56:02.362","truckId" :35,"driverId" :26,"driverName" :"Michael Aube", "routeId" :1090292248, "eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}

{“geohash” :“dp206n3d“, "timestamp" :"2016-06-02 12:56:02.362","truckId" :35,"driverId" :26,"driverName" :"MichaelAube", "routeId" :1090292248,"eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}

{“geohash” :“f00hfh99“, ..

{ "timestamp" :"2016-06-02

Page 93: Introduction to Streaming Analytics

Apache Storm – How does it work ?

GeoHashing

TrucksMovement

GeoFencer

GeoHashing

GeoFencer

Alerter

GeoHashing

ShuffleGrouping

FieldsGrouping

GlobalGrouping

TruckMovement

{"timestamp" :"2016-06-02

{"timestamp" :"2016-06-0212:56:02.362","truckId" :35,"driverId" :26,"driverName" :"Michael Aube", "routeId" :1090292248, "eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}

{“geohash” :“dp206n3d“, "timestamp" :"2016-06-02 12:56:02.362","truckId" :35,"driverId" :26,"driverName" :"MichaelAube", "routeId" :1090292248,"eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}

{"timestamp" :"2016-06-02

{"timestamp" :"2016-06-02 12:56:02.362","truckId" :35,"driverId" :26, "latitude" :40.86,"longitude" :"-89.91"}

{“geohash” :“f00hfh99“, ..

Page 94: Introduction to Streaming Analytics

Apache Storm – Core concepts

Each Spout or Bolt are running N instances in parallel

GeoHashingnth

TrucksMovement

GeoFencingnth

GeoHashing GeoFencing1st

Shuffle Fields

Shufflegrouping israndomgroupingFieldsgrouping isgroupedbyvalue,suchthatequalvalueresultsinequaltaskAllgrouping replicatestoalltasksGlobalgrouping makesalltuples gotoonetaskNonegrouping makesboltruninthesamethreadasbolt/spout itsubscribestoDirectgrouping producer(taskthatemits)controlswhichconsumerwillreceiveLocal orShufflegrouping

similartotheshufflegroupingbutwillshuffletuplesamongbolttasksrunninginthesameworkerprocess,ifany.Fallsbacktoshufflegrouping behavior.

ReportGlobal

Page 95: Introduction to Streaming Analytics

Scalability & Reliability

Page 96: Introduction to Streaming Analytics

How to scale a Streaming Analytics System?

Queue(Persist)

EventStream

event

CollectingThread1 event event

ProcessingThread1 result

CollectingThread2

ProcessingThread2

event event event result

CollectingThreadn

ProcessingThreadn

Page 97: Introduction to Streaming Analytics

CollectingProcess1

CollectingProcess1

CollectingProcess1

CollectingProcess1

CollectingProcess1

How to scale a Streaming Analytics System?

Queue1(Persist)

EventStream

event

CollectingThread1

event event ProcessingProcess1 result

CollectingThread1

ProcessingProcess1

Queue2(Persist)event

event event result

ProcessingProcess1

Queuen(Persist)

Page 98: Introduction to Streaming Analytics

CollectingProcess1

CollectingProcess2

Processing AProcess 2

Processing BProcess 2

Processing AProcess 1

Processing BProcess 1

How to scale a Streaming Analytics System?

EventStream

CollectingProcess1

CollectingProcess2

Processing AThread 2

Q2e

Processing BThread 2

Q2e

Processing AThread 1

Q1e

Processing BThread 1

Q1e

Processing AProcess 2

Processing AThread n

Qne

Page 99: Introduction to Streaming Analytics

How to make Streaminig Analytics System reliable?

Faults and stragglers inevitable in large clusters running big data applicationsStreaming applications must recover from them quickly

CollectingProcess2

Processing AProcess 2

Processing BProcess 2

EventStream

CollectingProcess2

Processing AThread 2

Q2e

Processing BThread 2

Q2e

CollectingProcess2

Processing AProcess 2

Processing BProcess 2

EventStream

CollectingProcess2

Processing AThread 2

Q2e

Processing BThread 2

Q2e

Page 100: Introduction to Streaming Analytics

How to deal with “Stragglers”

Consumer goes slow

Backpressure Queue upDrop data

Other jobs grindto a halt L

Run out ofmemory L

Spill to diskNo thanks L

Page 101: Introduction to Streaming Analytics

How to make Streaming Analytics System reliable?

Solution 1: using active/passive system (hot replication)• Both systems process the full load• In case of a failure, automatically switch and use the “passive” system• Stragglers slow down both active and passive system

State

=Statein-memoryand/oron-disk

CollectingProcess2

Processing AProcess 2

Processing BProcess 2

EventStream

CollectingProcess2

Processing AThread 2

Q2e

Processing BThread 2

Q2e

Active

CollectingProcess2

Processing AProcess 2

Processing BProcess 2

CollectingProcess2

Processing AThread 2

Q2e

Processing BThread 2

Q2e

Passive

State

State

Page 102: Introduction to Streaming Analytics

How to make Streaming Analytics System reliable?

Solution 2: Upstream backup• Nodes buffer sent messages and reply them to new node in case of failure• Stragglers are treated as failures

State =Statein-memoryand/oron-disk

buffer =Bufferforreplayin-memoryand/oron-disk

CollectingProcess2

Processing AProcess 2

Processing BProcess 2

EventStream

CollectingProcess2

Processing AThread 2

Q2e

Processing BThread 2

Q2e

State

Page 103: Introduction to Streaming Analytics

Message Delivery Semantics

At most once [0,1]• Messages my be lost • Messages never redelivered

At least once [1 .. n]• Messages will never be lost • but messages may be redelivered

(might be ok if consumer can handle it)

Exactly once [1]• Messages are never lost• Messages are never redelivered• Perfect message delivery• Incurs higher latency for transactional

semantics

Page 104: Introduction to Streaming Analytics

Streaming Analytics in Architecture

Page 105: Introduction to Streaming Analytics

“Traditional Architecture” for Big Data

DataCollection (Analytical)DataProcessing ResultStoreData

Sources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

Batchcompute

Stage

ResultStore

QueryEngine

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

Page 106: Introduction to Streaming Analytics

Streaming Analytics Architecture for Big Dataaka. (Complex) Event Processing)

DataCollection

Batchcompute

DataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

Logfiles

Sensor

RDBMS

ERP

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

ResultStore

Messaging

ResultStore

=DatainMotion =DataatRest

Page 107: Introduction to Streaming Analytics

Keep raw event data

DataCollection

Batchcompute

DataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

Logfiles

Sensor

RDBMS

ERP

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

ResultStore

Messaging

ResultStore

=DatainMotion =DataatRest

(Analytical)BatchDataProcessing

RawData(Reservoir)

Page 108: Introduction to Streaming Analytics

“Lambda Architecture” for Big Data

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

ResultStoreDataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

Page 109: Introduction to Streaming Analytics

“Kappa Architecture” for Big Data

DataCollection

“RawDataReservoir”

Batchcompute

DataSources

Messaging

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

Logfiles

Sensor

RDBMS

ERP

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

ResultStore

Messaging

ResultStore

RawData(Reservoir)

=DatainMotion =DataatRest

ComputedInformation

Page 110: Introduction to Streaming Analytics

“Unified Architecture” for Big Data

DataCollection

(Analytical)BatchDataProcessing(CalculateModelsofincomingdata)

Batchcompute

ResultStoreDataSources

Channel

DataConsumer

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest

PredictionModels

Page 111: Introduction to Streaming Analytics

Summary

Page 112: Introduction to Streaming Analytics

Summary

More and more use cases (such as IoT) make Streaming Analytics necessary

Treat events as events! Infrastructures for handling lots of events are available!

Platforms such as Oracle Stream Analytics enable the business to work directly on streaming data (empower the business analyst) => User Experience of an Excel Sheet on streaming data

Platform such as Apache Strom and Apache Spark Streaming provide a highly-scalable and fault-tolerant infrastructure for streaming analytics => Oracle Stream Analytics can use Spark Streaming as the runtime infrastructure

Platforms such as Kafka provide a high volume event broker infrastructure, a.k.a. Event Hub

Page 113: Introduction to Streaming Analytics

ComparisonOracleStream Analytics SparkStreaming SparkStorm

Community n.a. >280contributors > 100contributors

Language Options Java,CQL Java,Scala, Python Java,Clojure, Scala,…

ProcessingModels Event-Streaming Micro-Batching Event-Streaming

Processing DSL Yes Yes No

Stateful Ops Yes Yes No

Patterndetection Yes No No

Scalability&Reliability limited yes yes

Distributed RPC No No Yes

DeliveryGuarantees At LeastOnce Exactly Once Atmostonce /Atleastonce

Latency sub-second seconds sub-second

”self-service”forBiz Yes No No

Platform OEP server,SparkStreaming(YARN,Mesos)

YARN,Mesos Standalone,DataStax EE

Storm Cluster,YARN

Page 114: Introduction to Streaming Analytics
Page 115: Introduction to Streaming Analytics

Guido SchmutzTechnology Manager

[email protected]