machine learning in big data
TRANSCRIPT
Machine Learning in Big Data- Look forward or be left behind
V. William Porto Hadoop Summit Dublin 2016
Overview of RedPoint Global
2 RedPoint Global Inc. 2016 Confidential
Launched in 2006
Founded and staffed by industry veterans
Headquarters: Wellesley, Massachusetts
Offices in US, UK, Australia, Philippines
Global customer base
Serves most major industries
Overview of RedPoint Global
3 RedPoint Global Inc. 2016 Confidential
MAGIC QUADRANTData Quality
MAGIC QUADRANTIntegrated Marketing
Management
MAGIC QUADRANTMultichannel Campaign
Management
MAGIC QUADRANTDigital Marketing Hubs
FORRESTER WAVE™Cross-channel
Campaign Management
FORRESTER WAVE™Data Quality Solutions
4 RedPoint Global Inc. 2015 Confidential
With apologies to Gary Larson
Hadoop
5 RedPoint Global Inc. 2015 Confidential
Machine Learning – why bother?
If you have always done it that way, it is probably wrong” - Charles Kettering
6 RedPoint Global Inc. 2015 Confidential
Machine Learning – keeping ahead of the curve
• Three basic tenants for success in today’s world
• Prediction - you need to learn and use what you’ve learned
• Optimization - the world is a dynamic place
• Automation - because people don’t scale well
7 RedPoint Global Inc. 2015 Confidential
Machine Learning – what really is it all about?
• Learning vs. instruction
• Humans learn instinctively – computers not so much
• Intelligent Systems
• Memory
• Prediction (modeling)
• Assessment
• Feedback
• Adaptation
8 RedPoint Global Inc. 2015 Confidential
Data Modeling – what, why, how
• Regression – what happened in the past• Prediction – what will happen in the future
“Prediction is very difficult – especially if it’s about the future”
- Nihls Bohr
9 RedPoint Global Inc. 2015 Confidential
Data Modeling – what, why, how
The wide world of data modeling
• Supervised models• you have historical data and known correlated outputs (truth)
• Unsupervised models• historical data, but may not have (or trust) associated outputs
10 RedPoint Global Inc. 2015 Confidential
Decision Trees
Major Assumption: the world is discrete• fast, easy to understand, no linearity assumptions
• ‘human time’ required, unbalanced and/or large trees
11 RedPoint Global Inc. 2015 Confidential
Standard Linear Models
Assumption: the world is linear• the real world really isn’t linear
• all errors are not all equal
• easy to get misleading results
? !
Which line is best?
12 RedPoint Global Inc. 2015 Confidential
Generalized ‘Non-Linear’ Models
Assumptions• underlying functional mapping is known
• all errors are equal
• data is ‘well-conditioned’
• ‘standard’ error distribution
• Polynomials
• Exponentials (e.g., Gaussian, Poisson)
• Piece-wise linear
13 RedPoint Global Inc. 2015 Confidential
Non-Linear Models
Assumption: data is representative• ‘universal’ modeling tools
• fast execution
• no linearity assumptions
• lots of parameters, many techniques
• difficult to explain
Artificial Neural Network
14 RedPoint Global Inc. 2015 Confidential
User Story: Predict Retention / Attrition
Historical Behavioral Data
Customer Rating
Retention Customer NameLoyalty
MemberDays Since
Last PurchaseImmediate Relatives
Household Children
Customer IDLatest
Purchase Price
Latest Purchase Item ID
Region Code
Customer Capture Method
Customer Contact Code
Domicile
1 1 Allen, Geraldine yes 29 0 2 24160 211.39 B5 MW 2 6 St Louis, MO1 1 Anderson, Harry no 48 0 3 19952 26.55 E12 NE 3 New York, NY1 1 Andrews, Cynthia yes 63 1 0 13502 77.95 D7 NE 10 6 Hudson, NY1 0 Andrews, Thomas Jr no 39 0 0 112050 0 A36 SW Los Angeles, CA1 1 Appleton, Mary yes 53 2 3 11769 51.49 C101 NE D Bayside, Queens, NY1 0 Ashbury, Jeffrey no 47 1 0 PC 17757 29.99 C62 C64 NE 124 New York, NY1 1 Aston, Mrs. yes 18 1 0 PC 17757 29.99 C62 C64 NE 4 New York, NY1 1 Barber, Ellen yes 26 0 2 19877 78.85 S 61 1 Barkley, Henry no 80 0 0 27042 30 A23 NE B Yorktown, PA1 0 Baumann, David no 0 0 PC 17318 25.99 NE New York, NY1 1 Bazzeno, Alice yes 32 0 1 11813 76.95 D15 C 8 341 0 Beattie, Mr. Samuel no 36 0 0 13050 75.29 C6 C A 11 Winnipeg, MN1 1 Beckworth, June yes 47 1 1 11751 52.49 D35 NE 5 New York, NY1 1 Behr, John no 26 0 0 111369 30 C148 NE 5 New York, NY1 1 Biden, Roseanne yes 42 0 0 PC 17757 127.99 C 41 1 Bird, Ellen yes 29 0 0 PC 17483 18.95 C97 S 81 0 Birnbaum, Jason no 25 0 0 13905 26 C 148 San Francisco, CA
15 RedPoint Global Inc. 2015 Confidential
User Story: Predict Customer Retention / Attrition
Machine Learning Processing Chain - Training
16 RedPoint Global Inc. 2015 Confidential
User Story: Predict Retention / Attrition
Machine Learning Processing Chain - Prediction
Reward predicted ‘retainees’ with
targeted product offerings
Give potential attrition customers special
incentives to stay with the business
17 RedPoint Global Inc. 2015 Confidential
User Story: Accurate vs. Useful Prediction
Sparse data + Least-Squares (Linear) Classifier• Task: predict chance of purchasing a sundry item
• Result: ‘best’ model always predicts “none”
• Analysis: LS algorithm assumes all errors are equalBread
Cake & Pie
Chocolate Coffee Cookie DieselJuice & Smoothies
Lubricants MilkOther Bakery
Premium Sandwich Snack TeaTotal Transaction
Total Revenue
0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 30000 0 0 0 0 3 0 0 0 0 0 0 0 0 3 20000 0 0 0 0 0 0 0 0 0 0 0 0 0 6 18000 0 0 0 0 5 0 0 0 0 0 0 0 0 6 48000 0 0 2 0 0 0 0 0 0 0 0 0 0 2 1000 0 0 0 0 1 0 0 0 0 0 0 0 0 1 18280 0 0 0 0 0 0 0 0 0 0 0 0 0 13 164600 0 0 0 0 2 0 0 0 0 0 0 0 0 2 10000 0 0 0 0 2 0 0 0 0 0 0 0 0 2 15000 0 0 0 0 0 0 0 0 0 0 0 0 0 7 46000 0 0 0 0 11 0 0 0 0 0 0 0 0 11 19381.50 0 0 0 0 1 0 0 0 0 0 0 0 0 1 18600 0 0 0 0 0 0 0 0 0 0 0 0 0 3 30000 0 0 0 0 0 0 0 0 0 0 0 0 0 18 9838.820 0 0 0 0 0 0 0 0 0 0 0 0 0 22 110000 0 0 0 0 5 0 0 0 0 0 0 0 0 19 182250 0 0 0 0 0 0 0 0 0 0 0 0 0 1 5000 0 0 0 0 0 0 0 0 0 0 0 0 0 1 8000 0 0 0 0 0 0 0 0 0 0 1 0 0 7 79900 0 0 0 0 0 0 0 0 0 0 0 0 0 5 38200 0 0 0 0 1 0 0 0 0 0 0 0 0 55 43230
18 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation – group think
Collaborative FilteringRelationship Matrix
19 RedPoint Global Inc. 2015 Confidential
Personalization – not really
!=
20 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation
Similarity?
Customer Browser GenderAge
SectorIncome Sector
Married Children HomeownerRecent Baby
Clothes Purchase
George IE9 M 0 A N 0 1 NCarol Chrome F 1 B Y 1 0 YMary IE9 F 0 A N 1 0 Y
Dist(George,Carol) = 8Dist(George,Mary) = 4Dist(Carol,Mary) = 4
Can you afford to target (George,Mary) the same way as (Carol,Mary) ?
21 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation
Basic Question – which one describes the data the best?
Raw data
How many clusters are there ?
Two Clusters
Four Clusters
Six Clusters
22 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation with Statistics
• relatively simple
• data distribution assumptions
• initialization dependencies
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100Raw Data
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100Ellipsoidal Clustering
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100K-Means Clustering
23 RedPoint Global Inc. 2015 Confidential
Clustering/Segmentation – data driven
• let the data speak for itself
• multiple data projection ‘views’
• important boundary relationships
(“swing voters”)
Customer Demographics
24 RedPoint Global Inc. 2015 Confidential
User Story: Clustering / Segmentation
ML Clustering - Training ML Clustering – Processing New Data
25 RedPoint Global Inc. 2015 Confidential
Model Selection – how to choose?
• Basic Model Type (prediction or segmentation)• inputs + correlated outputs• inputs only?
• Basic Questions:• what to use for my problem?• parameters?• is this the best choice?• could I do better, and how?
26 RedPoint Global Inc. 2015 Confidential
Optimization – Evolving better solutions
• Simulated Evolution• fast, efficient search• always have a solution• arbitrary ‘evaluation’ functions• can start with existing solution(s)
• Variation – alter model type, parameters• Assessment – how well does the model work?• Selection – survival of the fittest
27 RedPoint Global Inc. 2015 Confidential
Evolutionary Optimization – Evaluation Function
• can use any measureable data• no continuity assumptions• no differentiability assumptions• no symmetry assumptions
Sunshine Hurricane
20 -10005 50
SunshineHurricane
Prediction
Reality (Truth)
28 RedPoint Global Inc. 2015 Confidential
User Story: Optimizing Classification Models
Task: Predict Retention/Attrition
0 1 2 3 4 5 60.00
20.00
40.00
60.00
80.00
100.00
34.828.8
24.5 22.1 20.9
62
70.2 72.3 73.4 75.2
Model Performance Optimization
Classification AccuracyTest Set Error (RMS)
GenerationPe
rfor
man
ce
17 Potential input features(customer demographics)
2 outputs (retention/attrition)
1300 Training Samples (50 – 50, A / B Split)1300 Test Samples ( naïve test data )
29 RedPoint Global Inc. 2015 Confidential
Use Case – Fully Adaptive Feedback (Next Best Offer)
DB
Historical User Behavior
(stimulus/response)
Train / Update Model
Non-Adaptive (Fixed) Mode
Randomized A/B/C Offer Selection
Adaptive ML Mode
ML Prediction Offer Selection
Operation (Trigger)
Ad / Offer (stimulus)
Feedback Cycle
30 RedPoint Global Inc. 2015 Confidential
Five Keys to Successful Machine Learning
• Let the data speak for itself – don’t force fit your models• Remember, all errors are not all equal – use this to your advantage• True learning requires continual adaptation !• Automate the process with feedback – remove the “man-in-the-loop”• Trust the optimization process – it really works!
31 RedPoint Global Inc. 2015 Confidential
Q&A
Contact InfoVisit : www.redpoint.net
Bill PortoSr. Engineering AnalystRedPoint Global [email protected]
Want More Information about this topic?
Fill out your card or go to redpoint.net/hadoopeurope