built for the speed of businessfiles.meetup.com/1624468/the art and craft of big data analytics v...

60
BUILT FOR THE SPEED OF BUSINESS

Upload: others

Post on 16-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

BUILT FOR THE SPEED OF BUSINESS

Page 3: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

3 © Copyright 2013 Pivotal. All rights reserved.

Talk Abstract "Big Data" can be used to prevent fraud of fine art. It can be used to detect anomalies in aircraft

engines. That is all super cool but how does that relate to your business? Your opportunities may not feel as world-changing as the genome project. However, innovation is critical to your business and you don't want to miss out on something that could have significant impact on revenue only because the practical application of this new technology is not entirely obvious.

Consider this interesting bit of trivia… There are five and only five categories of analytics. Therefore, the similarities between art, aircraft engines, and the your genetic sequence are closer than you think to customer loyalty, price optimization, and traffic monitoring.

In this meet-up we will define and illustrate the five major sections of analytics and big data. We’ll describe use cases (examples) that demonstrate how this new analytics functionality can be applied to derive insights. We’ll then take a deep dive into an interesting use case that resonates with the audience, and expose what really happens at the developer level, and what is required to make these new “models” work. We’ll cover technologies such as R, SQL, Mahout, etc., and will detail how to operationalize a big-data analytics model.

Page 4: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

4 © Copyright 2013 Pivotal. All rights reserved.

Agenda

Introduction

Business Use Cases

Deep Dive: Incorporate Analytics in Business Strategy

Simplification Framework

Operationalization of Analytics

Pivotal Perspective

Questions and Answers

Page 5: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

5 © Copyright 2013 Pivotal. All rights reserved.

What is Analytics? Analytics is the discovery and communication of meaningful patterns in

data – Models to gain valuable knowledge (insights) from data - data analysis – Insights to recommend action or to guide decision making – communication

Business wants to know - What is around the corner

Descriptive Predictive Prescriptive

Questions What happened?

What is happening?

What will happen?

Why will it happen?

What should I do?

Why should I do it? (What If)

Enablers • Business Reporting

• Dashboard and Scorecards

• Data Warehousing

• Data & Text mining

• Web/Media mining

• Forecasting

• Optimization & Simulation

Decision Modeling

• Expert System

Outcomes Well defined business problems

and opportunities

Accurate projections of the future

states and conditions

Best possible business decisions

and transactions

Page 6: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

6 © Copyright 2013 Pivotal. All rights reserved.

Is Analytics New?

•Grog uses two sticks and four rocks to graph the upward trend in sales of his new invention, the wheel. 5000 B.C.

•Sumerian analysts predict the world's use of letters will be greater than Mesopotamia's supply of clay tablets by 3000 B.C. Analysts suggest something called "papyrus" may solve the problem. 3200 B.C.

•Roman leader Caesar receives analysts' predictions that March will be a "down month," but disregards the data. 44 B.C.

•Michelangelo uses an advanced abacus to estimate the amount of paint needed to cover the Sistine Chapel. 1508 A.D.

•The Globe Theatre of London text mines peasants' comments after a play by a fellow named Shakespeare and decides to ask him to write more plays like the last one. 1590 A.D.

•Henry Ford conducts what-if analysis that clearly shows that limiting the Model-T to one color, black, is the best way to maximize profits. 1908 A.D.

•The Beatles' manager uses early marketing automation software to reveal that Ringo should not sing lead on "I Want to Hold Your Hand." John and Paul take over on the microphones. 1962 A.D.

http://www.sas.com/news/sascom/2008q4/column_bestblogs.html

Page 8: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

8 © Copyright 2013 Pivotal. All rights reserved.

What do you believe?

•Garbage in – Garbage Out (Data is NO Good) Data

•Not Big/Fast Enough (Technology cannot scale) Technology

•Maslow’s Hammer (Algorithms not well understood) Data Science Skills

•No Business Understanding Domain Experts

+ + =

Page 9: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

9 © Copyright 2013 Pivotal. All rights reserved.

Agenda

Introduction

Business Use Cases

Deep Dive: Incorporate Analytics in Business Strategy

Simplification Framework

Operationalization of Analytics

Pivotal Perspective

Questions and Answers

Page 10: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

10 © Copyright 2013 Pivotal. All rights reserved.

Analytics Use Cases

Online Advertising – Advertisement Targeting – Spend Optimization – Page View Guarantees – Ad selection – Fraud Prediction – Traffic Quality

Telco – Churn Prediction – Expansion/Growth Planning – Bundle Selection – Advertisement Targeting – Network Analysis – Fault Prediction

Manufacturing – Production Planning – Processing Optimization – Early Event Detection – Risk Reduction – Inventory Optimization – Capital Minimization – Price Projections

– Corporate Finance – Optimize Cash Flow – Investment Optimization

– HR – Workforce Analytics – Talent Selection – Retention and Churn Prediction

Page 11: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

11 © Copyright 2013 Pivotal. All rights reserved.

Analytics Use Cases

Sports – Moneyball

Social - Flavors – Sentiment Analysis – Customer Segmentation

Oil – Reservoir location estimation – Reservoir size estimation – Demand Prediction

Energy – Demand Prediction – Production Estimation – Fault Prediction

Finance

– Portfolio Planning

– Risk Minimization

– Next Best Offer

– Capital Minimization

– Price Projections

Banking

Page 12: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

12 © Copyright 2013 Pivotal. All rights reserved.

Complex: My Favorite Analytics

Image Analytics – Seismic Analysis – Painting Analysis – Medical Image Analysis

Text Analytics – Natural Language Processing

▪ Topic Modeling ▪ Search Intent

– Sentiment Prediction

Video Analytics – Product Identification – Face recognition – Surveillance

Page 13: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

13 © Copyright 2013 Pivotal. All rights reserved.

Image Analytics

http://www.gigapan.com/

2 x 2 mm high resolution pictures of Painting/Artifact in great detail

Cool Visualizations

Analytics to get the paint stroke

Pattern Analysis for Fraud Analysis

Page 14: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

14 © Copyright 2013 Pivotal. All rights reserved.

Seismic Data

Sonar Readings

Quick visualization

Data Cleanup

Mix with Prior Knowledge

Modelled Visualization

www.norsar.no

Page 15: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

15 © Copyright 2013 Pivotal. All rights reserved.

Agenda

Introduction

Business Use Cases

Deep Dive: Incorporate Analytics in Business Strategy

Simplification Framework

Operationalization of Analytics

Pivotal Perspective

Questions and Answers

Page 16: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

16 © Copyright 2013 Pivotal. All rights reserved. 16 © Copyright 2013 Pivotal. All rights reserved.

Retail Brick & Mortar Deep Dive

Page 17: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

17 © Copyright 2013 Pivotal. All rights reserved.

Retail Value Chain in a Nutshell

Maximizing (profitable) inventory turns – Virtuous cycle

▪ Stop needing working capital in the Buy step if the turns are fast enough.

– What to buy and where to place to satisfy/generate demand ▪ Balance needed between product

availability and required markdown.

Scaling by opening more stores – Reduce average store operating

cost. – Increase brand awareness and

generate demand.

1. Buy ($$)

2. Place ($$)

3. Sell ($$)

4. Markdown

($$)

Page 18: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

18 © Copyright 2013 Pivotal. All rights reserved.

Analytics in Retail : Reduce Costs

• Distribution Route Optimization

• Inventory Optimization

Supply Chain Mgmt

• Image Analytics

• Transaction Anomaly Detection Theft Prevention

• Vendor Scorecard

• Lead Time Estimation

Procurement Optimization

• Workforce Analytics & Employee Churn

• IT Security Analytics

General and Administrative

Page 19: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

19 © Copyright 2013 Pivotal. All rights reserved.

Analytics in Retail: Increase Revenue: Customer

• Segmentation and Targeting

• Store Clustering

Customer Targeting

• Customer Satisfaction

• Customer Care Analytics

Customer Satisfaction

• Loyalty Program Analytics

• Customer Lifetime Value Customer Loyalty

• Churn Prediction

• Churn Prevention Customer Churn

Page 20: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

20 © Copyright 2013 Pivotal. All rights reserved.

Analytics in Retail: Increase Revenue: Demand

• Ad Effectiveness Analytics

• Market Mix Modeling

Increase Ad Spend Lift

• Site Selection Analytics

• Digital Marketing and Social Media. Communication Optimizations

Increase Reach

• Affinity Analysis, Next Best Offer

• Cross Sell/ Up Sell, Store Experimentation

Increase Basket Size

Page 21: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

21 © Copyright 2013 Pivotal. All rights reserved.

Marketing Mix Modeling

“Half the money I spend in marketing is wasted; the trouble is I don’t know which half” – John Wanamaker

Estimate the impact of various marketing tactics (marketing mix) on sales – Base and incremental volume

▪ Base (volume that would be generated in absence of any marketing activity) and incremental (volume generated by marketing activities in the short run)

– Media and advertising ▪ Effectiveness of 15-second vis-à-vis 30-second executions;

▪ Comparisons in ad performance when run during prime-time vis-à-vis off-prime-time dayparts

– Trade promotions – Pricing – Distribution

▪ incremental volume through 1% more presence in a neighborhood Kirana store is 180% greater than that through 1% more presence in a supermarket

– Launches – Competition

Page 22: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

22 © Copyright 2013 Pivotal. All rights reserved.

MMM: Key Insights

Contribution by marketing activity

ROI by marketing activity

Effectiveness of marketing activity

Optimal distribution of spends

Page 23: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

23 © Copyright 2013 Pivotal. All rights reserved.

Agenda

Introduction

Business Use Cases

Deep Dive: Incorporate Analytics in Business Strategy

Simplification Framework

Operationalization of Analytics

Pivotal Perspective

Questions and Answers

Page 24: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

24 © Copyright 2013 Pivotal. All rights reserved. 24 © Copyright 2013 Pivotal. All rights reserved.

Simplification Framework

Page 25: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

25 © Copyright 2013 Pivotal. All rights reserved.

Simplified View of Analytics

•Clustering

•Group Similar Items and find structure or commonality within Data. E.g. people with similar tastes

Group

•Classification

•Good or Bad, Male or Female, Buyer or Browser … Categorize

•Regression

•Find relationship between the outcome and the input variables Estimate

•Association Rules & Collaborative Filtering (What do others with similar taste like me prefer)

•E.g. If a customer buys onions and potatoes likely to buy burgers too Recommend

•Minimize or Maximize factors in a Business Process

•E.g. Risk reduction, Inventory reduction, Profit Maximization, Spend Optimization Optimize

Page 26: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

26 © Copyright 2013 Pivotal. All rights reserved.

Ok – It kinda works! Now What?

Input Data

+ = Insights

Favorite Analytic Tool

Page 27: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

27 © Copyright 2013 Pivotal. All rights reserved.

Secret to Success – Transformation Input Data

+ =Insights

Favorite Analytic Tool

Business

Transformation

Data

Transformation

Internet Companies Traditional Enterprises

Data

Tra

nsfo

rma

tio

n

Easy to get events Event generation integration is complex

Easy to transform events Partial events need to be combined

Easy to connect events at source All connections done at the back end

Easy to add payload from source Complex payload additions

Bu

sin

ess

Tra

nsfo

rma

tio

n

Easy to integrate insights with automated actions Education to influence manual actions

Quick feedback from experiments Business process changes to take insights into account

Many experiments in parallel Selective experiments in distinct regions

Fast fail paradigm applicable Fail safe mechanisms – cannot make mistakes

Page 28: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

28 © Copyright 2013 Pivotal. All rights reserved.

Agenda

Introduction

Business Use Cases

Deep Dive: Incorporate Analytics in Business Strategy

Simplification Framework

Operationalization of Analytics

Pivotal Perspective

Questions and Answers

Page 29: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

29 © Copyright 2013 Pivotal. All rights reserved.

Operationalization of Analytics

• Domain Knowledge and Business KPIs

• fKPIs ( Business Data ) = Input Data for Modeling

Business Data Input Data

• Where to run the Model

• How to monitor the model

• Re-tuning or Model Refresh Model Execution

• How do Insights influence Business Decisions

• Business Transformation to include Insights

Insights Business Actions

Page 30: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

30 © Copyright 2013 Pivotal. All rights reserved.

Input Data

+ =Insights

Favorite Analytic Tool

Business

Transformation

Data

Transformation

Data Fabric: Integrated Stack

Sqoop

Flume

Distcp

HDFS Put

Data Loader

Talend

Informatica

Cross Platform Workflow

MADlib

Page 31: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

31 © Copyright 2013 Pivotal. All rights reserved.

Input Data

+ =Insights

Favorite Analytic Tool

Business

Transformation

Data

Transformation Big Data Integration

Big Data Analytics

Platform Big Data Applications

Data Fabric: Integrated Stack

Sqoop

Flume

Distcp

HDFS Put

Data Loader

Talend

Informatica

Cross Platform Workflow

MADlib

Page 32: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

32 © Copyright 2013 Pivotal. All rights reserved.

Spring XD

Spring

XD

Description

Streams Define how event driven data is collected, processed, and

stored or forwarded. For example, a stream might collect syslog

data, filter, and store it in HDFS

Jobs Define how coarse grained and time consuming batch

processing steps are orchestrated, for example a job could be

defined to coordinate performing HDFS operations and the

subsequent execution of multiple MapReduce processing tasks

Taps Used to process data in a non-invasive way as data is being

processed by a Stream or a Job. Much like wiretaps used on

telephones, a Tap on a Stream lets you consume data at any

point along the Stream’s processing pipeline. The behavior of

the original stream is unaffected by the presence of the Tap.

$bin>./xd-shell

Welcome to the Spring XD shell. For assistance hit TAB or type "help".

xd:>stream create --name httpStream --definition "http | file"

xd:>tap create --name httpTap --definition "tap httpStream | counter"

xd:>http post --target <a href="http://localhost:9000">http://localhost:9000</a> --data "helloworld"

Page 33: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

33 © Copyright 2013 Pivotal. All rights reserved.

Analytics Tools History

1960's 1970's 1980's 1990's 2000's 2010's

Microsoft Excel MATLAB MATLAB

FORTRAN STATISTICA

SAS

Oracle SAP

SPSS IBM S-PLUS TIBCO

R WEKA

Mahout MADlib

Page 34: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

34 © Copyright 2013 Pivotal. All rights reserved.

Group Insights Input Data Format

Item Description

ID Unique Object Id

F1 Feature 1

F2 Feature 2

F3 Feature 3

F4 Feature 4

Item Description

ID Unique Object Id

F1 Feature 1

F2 Feature 2

F3 Feature 3

F4 Feature 4

CID Cluster Id

# of Clusters (optional)

Experiment to come up with an

appropriate representation of the cluster

Big Data Description

Data Size Number of Data sets to cluster (1M, 100B)

Features # of features to account for in clustering

Distance

Function

Function used as a distance between two

features

Same Model Code Experiment and Production

Icon: www.cedarchestdesigns.com

Page 35: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

35 © Copyright 2013 Pivotal. All rights reserved.

Under the Hood Clustering

Type Description Examples

Connectivity Models Models based on distance connectivity Hierarchical Clustering

Centroid Models Represents each cluster by a single mean vector K-Means Clustering

Distribution Models Clusters modeled using statistical distributions Expectation-maximization algorithm

Density Models Defines clusters as connected dense regions in the data

space

DBSCAN and OPTICS

Subspace Models Clusters are modeled with both cluster members and

relevant attributes

Biclustering or Co-Clustering

Graph-based Models Some algorithms do not provide a refined model for their

results and just provide the grouping information

… … …

100+ Clustering Algorithms

Clustering Paradigms

Strict Partitioning: Each object belongs to only one cluster Strict Partitioning w/ Outliers: Some objects belong to no cluster. Outlier

Overlapping: Objects belong to one or more clusters Hierarchical: Objects of child clusters also belong to parent cluster

… …

Page 36: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

36 © Copyright 2013 Pivotal. All rights reserved.

SQL - kmeans

Select madlib.kmeanspp

(

input_data_table,

feature_array_column,

number_of_clusters,

distance_function,

max_num_iterations

)

MADlib

Page 37: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

37 © Copyright 2013 Pivotal. All rights reserved.

Mahout - kmeans

bin/mahout kmeans \

-i <input vectors directory> \

-c <input clusters directory> \

-o <output working directory> \

-dm <DistanceMeasure> \

-x <maximum number of iterations> \

Page 38: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

38 © Copyright 2013 Pivotal. All rights reserved.

R example - kmeans

Kmeans

(

data_matrix,

number_of_clusters,

max_iterations,

number_of_random_starts

)

Page 39: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

39 © Copyright 2013 Pivotal. All rights reserved.

Categorize Insights Input Data Format

Item Description

Category The target variable whose

category we want to

predict for new incoming

data.

F1 Feature 1

F2 Feature 2

F3 Feature 3

F4 Feature 4

Item Description

Predict

_Cate

gory

Equation to calculate

probablity of category

membership

Experiment to come up with an

appropriate equation to model category as

a function of the features

Big Data Description

Data Size Number of Data sets (1M, 100B)

Features # of features to account for in classification

Modeling to get the equation Experiment and Training

Production – Run the equation Production Environment

Page 40: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

40 © Copyright 2013 Pivotal. All rights reserved.

Under the Hood Categorize

Type Description Examples

Linear classifiers Models based on linear combination of characteristics Logistic regression

Kernel Density Size of kernels used in estimate varies depending on

location

SVM

100+ Categorize Algorithms

Page 41: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

41 © Copyright 2013 Pivotal. All rights reserved.

SQL – logistic regression

Select madlib.logregr_train

(

Input_data_table,

Model_output_table

Target_category_to_predict,

Feature_variables

)

MADlib

Page 42: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

42 © Copyright 2013 Pivotal. All rights reserved.

R – logistic regression

fit <- glm

(

F~x1+x2+x3,

data=mydata,

family=binomial()

)

Page 43: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

43 © Copyright 2013 Pivotal. All rights reserved.

Forecast Insights Input Data Format

Item Description

Value The target variable

whose value we want to

predict for new

incoming data.

F1 Feature 1

F2 Feature 2

F3 Feature 3

F4 Feature 4

Item Description

Predi

ct_V

alue

Equation to calculate

value of target

variable

Experiment to come up with an

appropriate equation to model continuous

value as a function of the features

Big Data Description

Data Size Number of Data sets (1M, 100B)

Features # of features to account for in value estimation

Icon: www.cedarchestdesigns.com http://en.wikipedia.org/

Modeling to get the equation Experiment and Training

Production – Run the equation Production Environment

Page 44: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

44 © Copyright 2013 Pivotal. All rights reserved.

SQL – multiple regression

Select madlib.linregr_train

(

Input_data_table,

Model_output_table,

Target_value_to_predict,

Feature_variables

)

MADlib

Page 45: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

45 © Copyright 2013 Pivotal. All rights reserved.

R – multiple regression

# Multiple Linear Regression Example

fit <- lm

(

y ~ x1 + x2 + x3, data=mydata

)

summary(fit) #show results

Page 46: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

46 © Copyright 2013 Pivotal. All rights reserved.

Recommend Insights Input Data Format

Item Description

TID Transaction ID

P1 Product 1

P2 Product 2

P3 Product 3

P4 Product 4

Item Description

Item

Set

Rule

If Product 1, then

Product 3 +

confidence, etc…

Experiment to come up with a list of

association of items based on their

frequency appearing as a set

Big Data Description

Data Size Number of Data sets (1M, 100B)

Transactions # of transactions to analyze

Icon: www.cedarchestdesigns.com http://en.wikipedia.org/

Modeling to get the equation Experiment and Training

Production – Run the equation Production Environment

Page 47: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

47 © Copyright 2013 Pivotal. All rights reserved.

Recommend – MADlib example SELECT * FROM madlib.assoc_rules

(

set_support,

set_confidence,

transaction_id_column,

item_name_column,

input_data_table,

output_schema_name,

set_verbose_level

);

MADlib

Page 48: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

48 © Copyright 2013 Pivotal. All rights reserved.

Recommend – R example

txn = read.transactions(file=”Transactions_sample.csv”, rm.duplicates= FALSE, format=”single”,sep=”,”,cols =c(1,2));

# Run the apriori algorithm

basket_rules <- apriori(txn,parameter = list(sup = 0.5, conf = 0.9,target=”rules”));

# Check the generated rules using inspect

inspect(basket_rules);

#If huge number of rules are generated specific rules can read using index

inspect(basket_rules[1]);

Page 49: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

49 © Copyright 2013 Pivotal. All rights reserved.

Recommend - Mahout

bin/mahout fpg \

-i core/src/test/resources/retail.dat \ #input data

-o patterns \

-k 50 \ #find the top-k patterns

-method mapreduce \

-s 2 #minimum number of times a pattern occurs

Page 50: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

50 © Copyright 2013 Pivotal. All rights reserved.

Optimize: Work in Progress

Page 51: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

51 © Copyright 2013 Pivotal. All rights reserved.

Agenda

Introduction

Business Use Cases

Deep Dive: Incorporate Analytics in Business Strategy

Simplification Framework

Operationalization of Analytics

Pivotal Perspective

Questions and Answers

Page 52: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

BUILT FOR THE SPEED OF BUSINESS

Page 53: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

53 © Copyright 2013 Pivotal. All rights reserved.

Observed Market Trends

• HDFS interface is becoming common storage environment for the future HDFS

• Enterprises are looking for SQL capabilities to leverage their existing investments SQL

• Flexibility and elasticity for the data infrastructure

Cloud vs Baremetal

Page 54: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

54 © Copyright 2013 Pivotal. All rights reserved.

PIVOTAL HD

Enabling the Data Driven

Enterprise

Page 55: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

55 © Copyright 2013 Pivotal. All rights reserved.

HAWQ: The Crown Jewels of Greenplum

SQL compliant

World-class query optimizer

Interactive query

Horizontal scalability

Robust data management

Common Hadoop formats

Deep analytics

Page 56: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

56 © Copyright 2013 Pivotal. All rights reserved.

Spring XD • Unified Platform

• Ingestion and stream processing • Workflow and data export

• Developer Productivity • Modular Extensibility • Distributed Architecture • Portable Runtime • Hadoop Distribution Agnostic • Proven Foundation

Page 57: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

57 © Copyright 2013 Pivotal. All rights reserved.

Pivotal HD Architecture

HDFS

HBase Pig, Hive,

Mahout

Map Reduce

Sqoop Flume

Resource

Management

& Workflow

Yarn

Zookeeper

Apache Pivotal HD Enterprise

Configure,

Deploy, Monitor,

Manage

Command

Center

Data Loader

Pivotal HD

Enterprise

Xtension

Framework

Catalog

Services

Query

Optimizer

Dynamic Pipelining

ANSI SQL + Analytics Spring

Unified Storage

Service

HAWQ

HAWQ – Advanced

Database Services

Hadoop Virtualization

Extension

Page 58: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

58 © Copyright 2013 Pivotal. All rights reserved.

Committed to Open Source

Pivotal is a major contributor to multiple open source projects

Pivotal has signed Apache CCLA (July 17, 2013) Contributing to Apache Hadoop (Pig patch, Hadoop Virtualization Extensions) Integrating with other Open Source projects

Chorus

MADlib

Apache Web Server

Page 59: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

59 © Copyright 2013 Pivotal. All rights reserved.

Questions

“We want our customers running North/South”

Ward Maddux – [email protected]

Input Data

+ =Insights

Favorite Analytic Tool

Business

Transformation

Data

Transformation

Sqoop

FlumeDistcp

HDFS Put

Data LoaderTalend

Informatica

Cross Platform Workflow

MADlib

Page 60: BUILT FOR THE SPEED OF BUSINESSfiles.meetup.com/1624468/The Art and Craft of Big Data Analytics v Share.pdf · © Copyright 2013 Pivotal. All rights reserved. 3 Talk Abstract "Big

BUILT FOR THE SPEED OF BUSINESS