built for the speed of businessfiles.meetup.com/1624468/the art and craft of big data analytics v...
TRANSCRIPT
BUILT FOR THE SPEED OF BUSINESS
2 © Copyright 2013 Pivotal. All rights reserved. 2 © Copyright 2013 Pivotal. All rights reserved.
The Art and Craft of Big Data Analytics
Susheel Kaushik
October 16, 2013
http://www.meetup.com/Boulder-Denver-Big-Data/events/141972012/
3 © Copyright 2013 Pivotal. All rights reserved.
Talk Abstract "Big Data" can be used to prevent fraud of fine art. It can be used to detect anomalies in aircraft
engines. That is all super cool but how does that relate to your business? Your opportunities may not feel as world-changing as the genome project. However, innovation is critical to your business and you don't want to miss out on something that could have significant impact on revenue only because the practical application of this new technology is not entirely obvious.
Consider this interesting bit of trivia… There are five and only five categories of analytics. Therefore, the similarities between art, aircraft engines, and the your genetic sequence are closer than you think to customer loyalty, price optimization, and traffic monitoring.
In this meet-up we will define and illustrate the five major sections of analytics and big data. We’ll describe use cases (examples) that demonstrate how this new analytics functionality can be applied to derive insights. We’ll then take a deep dive into an interesting use case that resonates with the audience, and expose what really happens at the developer level, and what is required to make these new “models” work. We’ll cover technologies such as R, SQL, Mahout, etc., and will detail how to operationalize a big-data analytics model.
4 © Copyright 2013 Pivotal. All rights reserved.
Agenda
Introduction
Business Use Cases
Deep Dive: Incorporate Analytics in Business Strategy
Simplification Framework
Operationalization of Analytics
Pivotal Perspective
Questions and Answers
5 © Copyright 2013 Pivotal. All rights reserved.
What is Analytics? Analytics is the discovery and communication of meaningful patterns in
data – Models to gain valuable knowledge (insights) from data - data analysis – Insights to recommend action or to guide decision making – communication
Business wants to know - What is around the corner
Descriptive Predictive Prescriptive
Questions What happened?
What is happening?
What will happen?
Why will it happen?
What should I do?
Why should I do it? (What If)
Enablers • Business Reporting
• Dashboard and Scorecards
• Data Warehousing
• Data & Text mining
• Web/Media mining
• Forecasting
• Optimization & Simulation
Decision Modeling
• Expert System
Outcomes Well defined business problems
and opportunities
Accurate projections of the future
states and conditions
Best possible business decisions
and transactions
6 © Copyright 2013 Pivotal. All rights reserved.
Is Analytics New?
•Grog uses two sticks and four rocks to graph the upward trend in sales of his new invention, the wheel. 5000 B.C.
•Sumerian analysts predict the world's use of letters will be greater than Mesopotamia's supply of clay tablets by 3000 B.C. Analysts suggest something called "papyrus" may solve the problem. 3200 B.C.
•Roman leader Caesar receives analysts' predictions that March will be a "down month," but disregards the data. 44 B.C.
•Michelangelo uses an advanced abacus to estimate the amount of paint needed to cover the Sistine Chapel. 1508 A.D.
•The Globe Theatre of London text mines peasants' comments after a play by a fellow named Shakespeare and decides to ask him to write more plays like the last one. 1590 A.D.
•Henry Ford conducts what-if analysis that clearly shows that limiting the Model-T to one color, black, is the best way to maximize profits. 1908 A.D.
•The Beatles' manager uses early marketing automation software to reveal that Ringo should not sing lead on "I Want to Hold Your Hand." John and Paul take over on the microphones. 1962 A.D.
http://www.sas.com/news/sascom/2008q4/column_bestblogs.html
7 © Copyright 2013 Pivotal. All rights reserved.
Elementary, my dear Watson
Business
Actionable Insights
• Access to the right data Data
• Scalable Technology to deal with data Technology
• Solid understanding of the Analytical Algorithms Data Science Skills
• Excellent understanding of the Business Environment Domain Experts
8 © Copyright 2013 Pivotal. All rights reserved.
What do you believe?
•Garbage in – Garbage Out (Data is NO Good) Data
•Not Big/Fast Enough (Technology cannot scale) Technology
•Maslow’s Hammer (Algorithms not well understood) Data Science Skills
•No Business Understanding Domain Experts
+ + =
9 © Copyright 2013 Pivotal. All rights reserved.
Agenda
Introduction
Business Use Cases
Deep Dive: Incorporate Analytics in Business Strategy
Simplification Framework
Operationalization of Analytics
Pivotal Perspective
Questions and Answers
10 © Copyright 2013 Pivotal. All rights reserved.
Analytics Use Cases
Online Advertising – Advertisement Targeting – Spend Optimization – Page View Guarantees – Ad selection – Fraud Prediction – Traffic Quality
Telco – Churn Prediction – Expansion/Growth Planning – Bundle Selection – Advertisement Targeting – Network Analysis – Fault Prediction
Manufacturing – Production Planning – Processing Optimization – Early Event Detection – Risk Reduction – Inventory Optimization – Capital Minimization – Price Projections
– Corporate Finance – Optimize Cash Flow – Investment Optimization
– HR – Workforce Analytics – Talent Selection – Retention and Churn Prediction
11 © Copyright 2013 Pivotal. All rights reserved.
Analytics Use Cases
Sports – Moneyball
Social - Flavors – Sentiment Analysis – Customer Segmentation
Oil – Reservoir location estimation – Reservoir size estimation – Demand Prediction
Energy – Demand Prediction – Production Estimation – Fault Prediction
Finance
– Portfolio Planning
– Risk Minimization
– Next Best Offer
– Capital Minimization
– Price Projections
Banking
…
12 © Copyright 2013 Pivotal. All rights reserved.
Complex: My Favorite Analytics
Image Analytics – Seismic Analysis – Painting Analysis – Medical Image Analysis
Text Analytics – Natural Language Processing
▪ Topic Modeling ▪ Search Intent
– Sentiment Prediction
Video Analytics – Product Identification – Face recognition – Surveillance
13 © Copyright 2013 Pivotal. All rights reserved.
Image Analytics
http://www.gigapan.com/
2 x 2 mm high resolution pictures of Painting/Artifact in great detail
Cool Visualizations
Analytics to get the paint stroke
Pattern Analysis for Fraud Analysis
14 © Copyright 2013 Pivotal. All rights reserved.
Seismic Data
Sonar Readings
Quick visualization
Data Cleanup
Mix with Prior Knowledge
Modelled Visualization
www.norsar.no
15 © Copyright 2013 Pivotal. All rights reserved.
Agenda
Introduction
Business Use Cases
Deep Dive: Incorporate Analytics in Business Strategy
Simplification Framework
Operationalization of Analytics
Pivotal Perspective
Questions and Answers
16 © Copyright 2013 Pivotal. All rights reserved. 16 © Copyright 2013 Pivotal. All rights reserved.
Retail Brick & Mortar Deep Dive
17 © Copyright 2013 Pivotal. All rights reserved.
Retail Value Chain in a Nutshell
Maximizing (profitable) inventory turns – Virtuous cycle
▪ Stop needing working capital in the Buy step if the turns are fast enough.
– What to buy and where to place to satisfy/generate demand ▪ Balance needed between product
availability and required markdown.
Scaling by opening more stores – Reduce average store operating
cost. – Increase brand awareness and
generate demand.
1. Buy ($$)
2. Place ($$)
3. Sell ($$)
4. Markdown
($$)
18 © Copyright 2013 Pivotal. All rights reserved.
Analytics in Retail : Reduce Costs
• Distribution Route Optimization
• Inventory Optimization
Supply Chain Mgmt
• Image Analytics
• Transaction Anomaly Detection Theft Prevention
• Vendor Scorecard
• Lead Time Estimation
Procurement Optimization
• Workforce Analytics & Employee Churn
• IT Security Analytics
General and Administrative
19 © Copyright 2013 Pivotal. All rights reserved.
Analytics in Retail: Increase Revenue: Customer
• Segmentation and Targeting
• Store Clustering
Customer Targeting
• Customer Satisfaction
• Customer Care Analytics
Customer Satisfaction
• Loyalty Program Analytics
• Customer Lifetime Value Customer Loyalty
• Churn Prediction
• Churn Prevention Customer Churn
20 © Copyright 2013 Pivotal. All rights reserved.
Analytics in Retail: Increase Revenue: Demand
• Ad Effectiveness Analytics
• Market Mix Modeling
Increase Ad Spend Lift
• Site Selection Analytics
• Digital Marketing and Social Media. Communication Optimizations
Increase Reach
• Affinity Analysis, Next Best Offer
• Cross Sell/ Up Sell, Store Experimentation
Increase Basket Size
21 © Copyright 2013 Pivotal. All rights reserved.
Marketing Mix Modeling
“Half the money I spend in marketing is wasted; the trouble is I don’t know which half” – John Wanamaker
Estimate the impact of various marketing tactics (marketing mix) on sales – Base and incremental volume
▪ Base (volume that would be generated in absence of any marketing activity) and incremental (volume generated by marketing activities in the short run)
– Media and advertising ▪ Effectiveness of 15-second vis-à-vis 30-second executions;
▪ Comparisons in ad performance when run during prime-time vis-à-vis off-prime-time dayparts
– Trade promotions – Pricing – Distribution
▪ incremental volume through 1% more presence in a neighborhood Kirana store is 180% greater than that through 1% more presence in a supermarket
– Launches – Competition
22 © Copyright 2013 Pivotal. All rights reserved.
MMM: Key Insights
Contribution by marketing activity
ROI by marketing activity
Effectiveness of marketing activity
Optimal distribution of spends
23 © Copyright 2013 Pivotal. All rights reserved.
Agenda
Introduction
Business Use Cases
Deep Dive: Incorporate Analytics in Business Strategy
Simplification Framework
Operationalization of Analytics
Pivotal Perspective
Questions and Answers
24 © Copyright 2013 Pivotal. All rights reserved. 24 © Copyright 2013 Pivotal. All rights reserved.
Simplification Framework
25 © Copyright 2013 Pivotal. All rights reserved.
Simplified View of Analytics
•Clustering
•Group Similar Items and find structure or commonality within Data. E.g. people with similar tastes
Group
•Classification
•Good or Bad, Male or Female, Buyer or Browser … Categorize
•Regression
•Find relationship between the outcome and the input variables Estimate
•Association Rules & Collaborative Filtering (What do others with similar taste like me prefer)
•E.g. If a customer buys onions and potatoes likely to buy burgers too Recommend
•Minimize or Maximize factors in a Business Process
•E.g. Risk reduction, Inventory reduction, Profit Maximization, Spend Optimization Optimize
26 © Copyright 2013 Pivotal. All rights reserved.
Ok – It kinda works! Now What?
Input Data
+ = Insights
Favorite Analytic Tool
27 © Copyright 2013 Pivotal. All rights reserved.
Secret to Success – Transformation Input Data
+ =Insights
Favorite Analytic Tool
Business
Transformation
Data
Transformation
Internet Companies Traditional Enterprises
Data
Tra
nsfo
rma
tio
n
Easy to get events Event generation integration is complex
Easy to transform events Partial events need to be combined
Easy to connect events at source All connections done at the back end
Easy to add payload from source Complex payload additions
Bu
sin
ess
Tra
nsfo
rma
tio
n
Easy to integrate insights with automated actions Education to influence manual actions
Quick feedback from experiments Business process changes to take insights into account
Many experiments in parallel Selective experiments in distinct regions
Fast fail paradigm applicable Fail safe mechanisms – cannot make mistakes
28 © Copyright 2013 Pivotal. All rights reserved.
Agenda
Introduction
Business Use Cases
Deep Dive: Incorporate Analytics in Business Strategy
Simplification Framework
Operationalization of Analytics
Pivotal Perspective
Questions and Answers
29 © Copyright 2013 Pivotal. All rights reserved.
Operationalization of Analytics
• Domain Knowledge and Business KPIs
• fKPIs ( Business Data ) = Input Data for Modeling
Business Data Input Data
• Where to run the Model
• How to monitor the model
• Re-tuning or Model Refresh Model Execution
• How do Insights influence Business Decisions
• Business Transformation to include Insights
Insights Business Actions
30 © Copyright 2013 Pivotal. All rights reserved.
Input Data
+ =Insights
Favorite Analytic Tool
Business
Transformation
Data
Transformation
Data Fabric: Integrated Stack
Sqoop
Flume
Distcp
HDFS Put
Data Loader
Talend
Informatica
Cross Platform Workflow
MADlib
31 © Copyright 2013 Pivotal. All rights reserved.
Input Data
+ =Insights
Favorite Analytic Tool
Business
Transformation
Data
Transformation Big Data Integration
Big Data Analytics
Platform Big Data Applications
Data Fabric: Integrated Stack
Sqoop
Flume
Distcp
HDFS Put
Data Loader
Talend
Informatica
Cross Platform Workflow
MADlib
32 © Copyright 2013 Pivotal. All rights reserved.
Spring XD
Spring
XD
Description
Streams Define how event driven data is collected, processed, and
stored or forwarded. For example, a stream might collect syslog
data, filter, and store it in HDFS
Jobs Define how coarse grained and time consuming batch
processing steps are orchestrated, for example a job could be
defined to coordinate performing HDFS operations and the
subsequent execution of multiple MapReduce processing tasks
Taps Used to process data in a non-invasive way as data is being
processed by a Stream or a Job. Much like wiretaps used on
telephones, a Tap on a Stream lets you consume data at any
point along the Stream’s processing pipeline. The behavior of
the original stream is unaffected by the presence of the Tap.
$bin>./xd-shell
Welcome to the Spring XD shell. For assistance hit TAB or type "help".
xd:>stream create --name httpStream --definition "http | file"
xd:>tap create --name httpTap --definition "tap httpStream | counter"
xd:>http post --target <a href="http://localhost:9000">http://localhost:9000</a> --data "helloworld"
33 © Copyright 2013 Pivotal. All rights reserved.
Analytics Tools History
1960's 1970's 1980's 1990's 2000's 2010's
Microsoft Excel MATLAB MATLAB
FORTRAN STATISTICA
SAS
Oracle SAP
SPSS IBM S-PLUS TIBCO
R WEKA
Mahout MADlib
34 © Copyright 2013 Pivotal. All rights reserved.
Group Insights Input Data Format
Item Description
ID Unique Object Id
F1 Feature 1
F2 Feature 2
F3 Feature 3
F4 Feature 4
…
Item Description
ID Unique Object Id
F1 Feature 1
F2 Feature 2
F3 Feature 3
F4 Feature 4
…
CID Cluster Id
# of Clusters (optional)
Experiment to come up with an
appropriate representation of the cluster
Big Data Description
Data Size Number of Data sets to cluster (1M, 100B)
Features # of features to account for in clustering
Distance
Function
Function used as a distance between two
features
Same Model Code Experiment and Production
Icon: www.cedarchestdesigns.com
35 © Copyright 2013 Pivotal. All rights reserved.
Under the Hood Clustering
Type Description Examples
Connectivity Models Models based on distance connectivity Hierarchical Clustering
Centroid Models Represents each cluster by a single mean vector K-Means Clustering
Distribution Models Clusters modeled using statistical distributions Expectation-maximization algorithm
Density Models Defines clusters as connected dense regions in the data
space
DBSCAN and OPTICS
Subspace Models Clusters are modeled with both cluster members and
relevant attributes
Biclustering or Co-Clustering
Graph-based Models Some algorithms do not provide a refined model for their
results and just provide the grouping information
… … …
100+ Clustering Algorithms
Clustering Paradigms
Strict Partitioning: Each object belongs to only one cluster Strict Partitioning w/ Outliers: Some objects belong to no cluster. Outlier
Overlapping: Objects belong to one or more clusters Hierarchical: Objects of child clusters also belong to parent cluster
… …
36 © Copyright 2013 Pivotal. All rights reserved.
SQL - kmeans
Select madlib.kmeanspp
(
input_data_table,
feature_array_column,
number_of_clusters,
distance_function,
max_num_iterations
)
MADlib
37 © Copyright 2013 Pivotal. All rights reserved.
Mahout - kmeans
bin/mahout kmeans \
-i <input vectors directory> \
-c <input clusters directory> \
-o <output working directory> \
-dm <DistanceMeasure> \
-x <maximum number of iterations> \
38 © Copyright 2013 Pivotal. All rights reserved.
R example - kmeans
Kmeans
(
data_matrix,
number_of_clusters,
max_iterations,
number_of_random_starts
)
39 © Copyright 2013 Pivotal. All rights reserved.
Categorize Insights Input Data Format
Item Description
Category The target variable whose
category we want to
predict for new incoming
data.
F1 Feature 1
F2 Feature 2
F3 Feature 3
F4 Feature 4
…
Item Description
Predict
_Cate
gory
Equation to calculate
probablity of category
membership
Experiment to come up with an
appropriate equation to model category as
a function of the features
Big Data Description
Data Size Number of Data sets (1M, 100B)
Features # of features to account for in classification
Modeling to get the equation Experiment and Training
Production – Run the equation Production Environment
40 © Copyright 2013 Pivotal. All rights reserved.
Under the Hood Categorize
Type Description Examples
Linear classifiers Models based on linear combination of characteristics Logistic regression
Kernel Density Size of kernels used in estimate varies depending on
location
SVM
100+ Categorize Algorithms
41 © Copyright 2013 Pivotal. All rights reserved.
SQL – logistic regression
Select madlib.logregr_train
(
Input_data_table,
Model_output_table
Target_category_to_predict,
Feature_variables
)
MADlib
42 © Copyright 2013 Pivotal. All rights reserved.
R – logistic regression
fit <- glm
(
F~x1+x2+x3,
data=mydata,
family=binomial()
)
43 © Copyright 2013 Pivotal. All rights reserved.
Forecast Insights Input Data Format
Item Description
Value The target variable
whose value we want to
predict for new
incoming data.
F1 Feature 1
F2 Feature 2
F3 Feature 3
F4 Feature 4
…
Item Description
Predi
ct_V
alue
Equation to calculate
value of target
variable
Experiment to come up with an
appropriate equation to model continuous
value as a function of the features
Big Data Description
Data Size Number of Data sets (1M, 100B)
Features # of features to account for in value estimation
Icon: www.cedarchestdesigns.com http://en.wikipedia.org/
Modeling to get the equation Experiment and Training
Production – Run the equation Production Environment
44 © Copyright 2013 Pivotal. All rights reserved.
SQL – multiple regression
Select madlib.linregr_train
(
Input_data_table,
Model_output_table,
Target_value_to_predict,
Feature_variables
)
MADlib
45 © Copyright 2013 Pivotal. All rights reserved.
R – multiple regression
# Multiple Linear Regression Example
fit <- lm
(
y ~ x1 + x2 + x3, data=mydata
)
summary(fit) #show results
46 © Copyright 2013 Pivotal. All rights reserved.
Recommend Insights Input Data Format
Item Description
TID Transaction ID
P1 Product 1
P2 Product 2
P3 Product 3
P4 Product 4
…
Item Description
Item
Set
Rule
If Product 1, then
Product 3 +
confidence, etc…
Experiment to come up with a list of
association of items based on their
frequency appearing as a set
Big Data Description
Data Size Number of Data sets (1M, 100B)
Transactions # of transactions to analyze
Icon: www.cedarchestdesigns.com http://en.wikipedia.org/
Modeling to get the equation Experiment and Training
Production – Run the equation Production Environment
47 © Copyright 2013 Pivotal. All rights reserved.
Recommend – MADlib example SELECT * FROM madlib.assoc_rules
(
set_support,
set_confidence,
transaction_id_column,
item_name_column,
input_data_table,
output_schema_name,
set_verbose_level
);
MADlib
48 © Copyright 2013 Pivotal. All rights reserved.
Recommend – R example
txn = read.transactions(file=”Transactions_sample.csv”, rm.duplicates= FALSE, format=”single”,sep=”,”,cols =c(1,2));
# Run the apriori algorithm
basket_rules <- apriori(txn,parameter = list(sup = 0.5, conf = 0.9,target=”rules”));
# Check the generated rules using inspect
inspect(basket_rules);
#If huge number of rules are generated specific rules can read using index
inspect(basket_rules[1]);
49 © Copyright 2013 Pivotal. All rights reserved.
Recommend - Mahout
bin/mahout fpg \
-i core/src/test/resources/retail.dat \ #input data
-o patterns \
-k 50 \ #find the top-k patterns
-method mapreduce \
-s 2 #minimum number of times a pattern occurs
50 © Copyright 2013 Pivotal. All rights reserved.
Optimize: Work in Progress
51 © Copyright 2013 Pivotal. All rights reserved.
Agenda
Introduction
Business Use Cases
Deep Dive: Incorporate Analytics in Business Strategy
Simplification Framework
Operationalization of Analytics
Pivotal Perspective
Questions and Answers
BUILT FOR THE SPEED OF BUSINESS
53 © Copyright 2013 Pivotal. All rights reserved.
Observed Market Trends
• HDFS interface is becoming common storage environment for the future HDFS
• Enterprises are looking for SQL capabilities to leverage their existing investments SQL
• Flexibility and elasticity for the data infrastructure
Cloud vs Baremetal
54 © Copyright 2013 Pivotal. All rights reserved.
PIVOTAL HD
Enabling the Data Driven
Enterprise
55 © Copyright 2013 Pivotal. All rights reserved.
HAWQ: The Crown Jewels of Greenplum
SQL compliant
World-class query optimizer
Interactive query
Horizontal scalability
Robust data management
Common Hadoop formats
Deep analytics
56 © Copyright 2013 Pivotal. All rights reserved.
Spring XD • Unified Platform
• Ingestion and stream processing • Workflow and data export
• Developer Productivity • Modular Extensibility • Distributed Architecture • Portable Runtime • Hadoop Distribution Agnostic • Proven Foundation
57 © Copyright 2013 Pivotal. All rights reserved.
Pivotal HD Architecture
HDFS
HBase Pig, Hive,
Mahout
Map Reduce
Sqoop Flume
Resource
Management
& Workflow
Yarn
Zookeeper
Apache Pivotal HD Enterprise
Configure,
Deploy, Monitor,
Manage
Command
Center
Data Loader
Pivotal HD
Enterprise
Xtension
Framework
Catalog
Services
Query
Optimizer
Dynamic Pipelining
ANSI SQL + Analytics Spring
Unified Storage
Service
HAWQ
HAWQ – Advanced
Database Services
Hadoop Virtualization
Extension
58 © Copyright 2013 Pivotal. All rights reserved.
Committed to Open Source
Pivotal is a major contributor to multiple open source projects
Pivotal has signed Apache CCLA (July 17, 2013) Contributing to Apache Hadoop (Pig patch, Hadoop Virtualization Extensions) Integrating with other Open Source projects
Chorus
MADlib
Apache Web Server
59 © Copyright 2013 Pivotal. All rights reserved.
Questions
“We want our customers running North/South”
Ward Maddux – [email protected]
Input Data
+ =Insights
Favorite Analytic Tool
Business
Transformation
Data
Transformation
Sqoop
FlumeDistcp
HDFS Put
Data LoaderTalend
Informatica
Cross Platform Workflow
MADlib
BUILT FOR THE SPEED OF BUSINESS