building big data operational intelligence platform with apache spark

22
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. Eric Carr (VP Core Systems Group) Spark Summit 2014 BUILDING BIG DATA OPERATIONAL INTELLIGENCE PLATFORM WITH APACHE SPARK 1

Upload: lyndon

Post on 25-Feb-2016

114 views

Category:

Documents


3 download

DESCRIPTION

Building Big Data Operational Intelligence platform with Apache Spark. Eric Carr (VP Core Systems Group) Spark Summit 2014. Communication Service Providers & Big Data Analytics. Market & Technology Imperatives. Industry Context for Communication Service Providers (CSPs). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Eric Carr (VP Core Systems Group)Spark Summit 2014

BUILDING BIG DATA OPERATIONAL INTELLIGENCE PLATFORM WITH APACHE SPARK

1

Page 2: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Market & Technology Imperatives

Communication Service Providers & Big Data Analytics

Page 3: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Industry Context for Communication Service Providers (CSPs)

3

Big Data is at the heart of two core strategies for CSPs:• Improve current revenue sources through greater operational efficiencies• Create new revenue source with the 4th wave

Source: Chetan Consulting

Page 4: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. 4

CSPs - Industry Value Chain Shift

Source: asymco

Page 5: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

CSPs - A High Bar for Operational Intelligence

5

CSPs require solutions engineered to meet very stringent requirements

Distributed Network

Dozens of locations for

capturing data, scattered

around a vast territory

High Availability

No data loss;no down time

Timely Insights

Automated reactions

triggered in seconds

Data Diversity

Hundreds of sources from

different equipment types

and vendors

Exponential Data Growth

Petabytes of data per day;

billions of records per day

Page 6: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Type of DataTy

pe o

f Ana

lysi

s

6

Different Platforms target Different Questions

Data Lake Data StreamsD

isco

very

Dec

isio

ning

Data Warehouses + Business

IntelligenceSearch Centric

Stream Analytics + Operational Intelligence

Page 7: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Streaming Analytics & Machine Learning to Action

Page 8: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Driving Streaming Analytics to Action

Network FlowAnalytics

Usage Awareness

OperationalInteractions

Real-TimeActions

NetFlow, RoutingPlanning

Layer 7 Visibility

Content &CDN Analytics

Policy Profile Triggers

Care & Experience Mgmt

New Service Creation & Monetization

Small Cell / RAN / Backhaul Differentiation

SON / SDN / Virtualization

Content Optimization

Operational Intelligence

8

Page 9: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Reflex 1.0 Pipeline – Timely Cube Reporting

Collector Hadoop Compute

Analytics Store

RAM Cache

Reflex 1.5 Pipeline – Spark / Yarn

Reflex 2.0 Pipeline – Spark Streaming core

UI

Collector Spark / YARN

Analytics Store

RAM Cache UI

CollectorSpark / YARN / HDFS2 UI

ML Store SQLStreams Cache Msg Queue

Page 10: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Stream Engine

Stream Engine - Operational Intelligence Analytics

Data Streams

Contextual Data Stores

Feature Engine

Data fusionMetrics creation

Variables selectionCausality inference

Anomaly Engine

Multivariate analysis Outlier Detection

RCA Engine

Statistical learning Clustering

Pattern identificationItem set mining

Targeted Actions

Optimized algorithmic support for common stream data processing & fusions• Detect unusual events occurring in stream(s) of data. Once detected, isolate root cause• Anomaly / outlier detection, commonalities / root cause forensics, prediction / forecasting,

actions / alerts / notifications• Record enrichment capabilities – e.g. URL categorization, device id, etc.

10

Page 11: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Example - State of the art causality analysis

Linear Linear + Non linear

Random variables(Time independent)

Stochastic processes (Time dependent)

MaximalInformation Coefficient

Correlation

Granger Causality

Transfer Entropy

RELATIONSHIP TYPE

TIM

E D

EPEN

DEN

CY

Ranking of metrics: 1) Transfer Entropy 2) Maximal Information Coefficient,

Granger Causality 3) Correlation

11

Page 12: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Transfer entropy from a process X to another process Y is the amount of uncertainty reduced in future values of Y by knowing the past values of X given past values of Y.

Example - Causality Techniques

PROSModel free, information theory based approachMost generic estimation of causality between two random processes

CONS

Challenging joint probability estimation

Large amount of data needed for calculation

Choice of time lags

12

Maximal Information Coefficient is a measure of the strength between two random variables based on mutual information. Methodology for empirical estimation based on maximizing the mutual information over a set of grids

PROS

Model free, information theory based approach

Can find linear and non-linear relationships

Estimation possible with smaller dataset

CONS

No time information

Page 13: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved. 13

Network Operations / Care ExampleIdentifying Commonalities, Anomalies, RCA

Anomaly Detection

Root-Cause Analysis

Event Drivers

Event Chaining

Page 14: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

BinStream Details

Page 15: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Use Case

• Use IP Traffic Records to calculate Bandwidth Usage as a Time Series (continuously …), can’t do that based on the time the records are received by Spark.– In general for any record which has a timestamp, it important to

analyze based on the time of event rather than the reception of the event record.

15

Page 16: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Challenges With

• For one dataset. Make the time stamp part of the key.

• For continuously streamed data sets.

– You do not know if you have received all data for a particular time slot. Caused by event delay, or event duration.

– An event could span multiple time slots. Caused by event duration.

16

Page 17: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Data Processing – Timing (Map-Reduce world)

Collector/Adapter

Collector/Writer

MR Jobs/Cube Generator

MR Job/Cube exporter

STOREColumnar Storage

BINS

HDFSData Files

HDFSCubes

AvailableFor visualization

Delayed 2x bin interval(10mins) to allow last bin to be closed

kth hour (k+1)th hour (k+2)th hour

Bin Interval5mins

Job Duration 45-55mins

Write on bin intervalor size

AvailableFor visualization

Events

Past current FutureClosed

17

Page 18: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Proposed Binning & Proration Solution

18

Page 19: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Solution (cond.)

• The typical solutions are:– For event delay: Wait (Buffer).– For event duration: Prorate events across time slots.

• Introduce a concept of BinStream. An abstraction over the Dstream, which needs a– Function to extract the time fields from the records.– Function to prorate the records.

Note that this can be trivially achieved by using ‘window’ functionality, by having the batch equal to the time series interval and the window size equal to maximum possible delay.

19

Page 20: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Problems / Solutions

• window == wait & buffer.– This has two issues.

A. Need memory for buffering.B. Downstream needs to wait for the result (or any part of it)

• BinStream provides two additional options– Gets rid of delay for getting partial results, by sending regular

latest snapshots for the old time slots. This does not solve the memory and increases the processing load.

– If the client can handle partial results, i.e. if it can aggregate partial results, it can get updates to the old bins. This reduces the memory for the spark-streaming application.

20

Page 21: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Limitations.

• The number of time series slots for which the updates can be generated is fixed, basically governed by the event delay characteristics.

21

Page 22: Building Big Data Operational Intelligence platform with Apache Spark

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

THANK YOU!