getting started with big data - asq section 1414..."from the book 'predictive analytics:...

31
©2017 Firefly Consulting, All Rights Reserved 1 A Lean Six Sigma Practitioner’s Guide Getting Started with Big Data Kristine Bradley, Principal, Firefly Consulting

Upload: others

Post on 16-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved1

A Lean Six Sigma Practitioner’s Guide

Getting Started with Big Data

Kristine Bradley, Principal, Firefly Consulting

Page 2: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved2

WHY CARE ABOUT BIG DATA?

“Data Scientist: The Sexiest Job of the 21st Century.” - Harvard Business

Review article

GE has bet big on the Internet of Things – committing $1B to put sensors on gas turbines, jet engines, and other

machines, connecting them to the cloud and analyzing the resulting flow of data to identify ways to improve machine

productivity and reliability – MIT Sloan Case Study

CNN recently stated that “the amount of data captured globally is estimated to reach 40 zettabytes by

2020.“ That’s 40 with 21 zeros!

Page 3: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved3

BIG DATA APPLICATIONS

▪ Reduced credit card fraud▪ Decreased loan default

rate▪ Increased response rate

with significantly reduced mailing costs

▪ Supply chain analytics▪ Improved process

monitoring and control▪ Reduced equipment

downtime

▪ Cancer detection▪ Hospital readmission▪ Nonadherence to

medication prescriptions▪ Billing errors

▪ Airfare pricing optimization

▪ Personalized product recommendations

▪ Tax returns▪ Casinos

Manufacturing

Financial Services Daily Life

Healthcare

Page 4: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved4

FUN FACT

Airline customers who pre-

order a vegetarian meal, are

more likely to make their

flight on time.

- An airline study

Page 5: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved5

BIG DATA AND LEAN SIX SIGMA

Lean Six Sigma Skills

Business Skills

Data Science

IT Skills

Expertise needed for Big Data solutions:

− Data Science

− IT Skills

− Business Skills

Expertise needed for Lean Six Sigma solutions:

− Business Skills

− Data Analysis Skills

− IT Partnership

Page 6: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved6

SIMILARITIES AND DIFFERENCES

More data – linkage to external

More powerful analytics

Stronger systems linkage

More real time visualization

Links with Artificial Intelligence and the Internet of Things

What’s New? What’s the Same?What’s the Same?

Up to 80% of the work can be in the data preparation

To get the value, you still have to do something with it!

Analysis using statistical tools

Understand the relationship between your inputs (x’s) and outputs (y’s)

Correlation still does not equal causation

Page 7: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved7

FUN FACT

Hungry judges rule negatively on parole decisions. Your chances of favorable parole hearing right after a food break are 65% favorable, which drops to nearly 0 before the next break.

– Columbia and Ben Gurio Universities

Page 8: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved8

DEFINING THE “BIG DATA” TERMS

Business Need Internal DataExternal Data

Modeling, “Machine Learning”

Predictive Model

Target, Prediction, Outcome, Response, Y

Business Insights

Individual Characteristics,

Attributes, Factors, Variables, Predictors, X’s

• “Big Data”• “Big Data Analytics”• “Business Analytics”• “Predictive Analytics• “Business Intelligence”

“Artificial Intelligence” “Internet of Things” “IoT”

“Data Mining”

CO

DIN

G

Prediction

Page 9: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved9

“BIG DATA” ANALYTICS PROCESS

Business Question

Extract Insights

Acquire Data

Prepare the Data

Choose Algorithm

Build Model

Test and Evaluate Model

Deploy Model

• Cast the business problem or goal into one or more modeling problems

• Determine use scenario

• Identify data sources

• Understand data

• Evaluate cost/benefit of sources

• Extract data

• Clean up the data –structure, missing values

• Visualize the data• Dimensionality

reduction and/or feature selection

• Validate the data

• Select the modeling technique(s) that will best solve your modeling problem and suit your use scenario

• Utilize statistical software to build model

• Use test data set to assess model accuracy and reliability

• Assess if model satisfies original business goal

• Beware of overfitting

• Code model into production systems

• Make near or real time decisions

• Use model to solve business problem

DEFINE

MEASURE ANALYZE

IMPROVE/CONTROL

IMPROVE

Page 10: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved10

Online loan applicants who complete the form using correct capitalization are more likely to pay on time, all lowercase next likely, all caps, least likely.

- Financial services startup

FUN FACT

My Namemy name

MY NAME

Page 11: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved11

Data Retrieval and Visualization

Statistical Hypothesis Testing

Similarity and Clustering

Classification

Prediction

4 Will a particular customer be profitable?

5 How much potential revenue can I generate from this particular customer?

2 Is there a difference between profitable and average customers?

3 What are common characteristics of profitable customers?

1 Who are the most profitable customers?

DIFFERENT BUSINESS QUESTIONS REQUIRE DIFFERENT TOOLS

Specificity

Specific

General

Page 12: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved12

WHAT TOOLS AS A LSS PROFESSIONAL DO YOU HAVE NOW?

▪ Data Retrieval and Visualization− Basic statistics

− Graphical tools

− Measurement System Analysis

− Control Charts

▪ Statistical Hypothesis Testing− T-Tests

− ANOVA

▪ Prediction− Multiple Linear Regression

− General Linear Model

Page 13: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved13

FAMILIAR PREDICTION TOOLS

Business Question

Statistical Tools

Description Example Applications

Prediction

Linear Regression

Models a straight line relationship between continuous predictors and a single response variable

Financial Services: Premium table development in property insurance

Healthcare: Predict future healthcare costs using prior costs, demographics and diagnoses

Manufacturing: Develop acceptable ranges for input materials to optimize pharmaceutical particle size

Nonlinear Regression

Models a nonlinear curve – concave, convex, exponential, s shaped, asymptotic, etc

General Linear Model

Uses ANOVA and regression to model the relationship between continuous or attribute predictors and a continuous response

Page 14: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved14

EXAMPLE: LINEAR REGRESSION IN FINANCIAL SERVICES

Predictors“Big Data”

“Data Mining”

Modeling, “Machine Learning”

Predictive Model

Business Insights

Business Need

Prediction

Develop a premium table for property insurance

Predictors = driver age, credit score, gender, auto attributes…

Linear Regression

Target = Predicted Claims

Use predicted claims to set better premiums and reduce risk

𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐶𝑙𝑎𝑖𝑚𝑠 = 𝛽0 + 𝛽1 𝐴𝑔𝑒+𝛽2 𝐶𝑟𝑒𝑑𝑖𝑡 𝑆𝑐𝑜𝑟𝑒+ 𝛽3𝐺𝑒𝑛𝑑𝑒𝑟 + ⋯ 𝛽𝑥

Page 15: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved15

KEEP IN MIND

▪ Multicollinearity − Lots of data and lots of variables brings

risk of double dipping

▪ Nonlinear responses− Responses aren’t always straight lines

▪ Standardization− You may need to standardize your data

to eliminate differences in variable scale

▪ Homoskedasticity− Important to linear regression models− It’s also fun to say

▪ Many tools can solve the same types of problems in different ways− Tool selection is sometimes an art vs. a

science

▪ Model validation is required− Set aside 20-50% of your data points

to assess model accuracy

▪ Overfitting− Given enough data and variables,

something will correlate

− Consider diminishing returns

Bigger Picture Things

Nerdy Things

Page 16: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved16

FUN FACT

Liking “curly fries” on Facebook is a predictor of high intelligence.

– University of Cambridge and Microsoft Research

Page 17: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved17

Other

Supervised Tools

Regression

General Linear Model

Regression Tree/Forest

Gaussian Process

Support Vector Machines

(SVM)

Neural Network

Regression

Nearest Neighbor Methods

Linear Regression

Nonlinear Regression

Classification

Decision Trees, Forests

Neural Network

Naïve Bayesk-Nearest Neighbor

Discriminant Analysis

Logistic Regression

Support Vector Machines

Unsupervised Tools

Clustering

Hard Clustering

Hierarchical

k-Means

Soft Clustering

Fuzzy c-Means

Gaussian Mixture Mode

Anomaly Identification

One Class SVM

k-Nearest Neighbor

Principal Component

Analysis

Data Reduction Methods

Principal Component

Analysis (PCA)

Factor Analysis

A SELECTION OF NEXT LEVEL TOOLS

You have an output value you are trying to predict

You do not have a specific output value

Natural Language Processing

Image Processing /

Pattern RecognitionExamples follow

Page 18: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved18

LOGISTIC REGRESSION – THE NEXT TOOL TO ADD TO YOUR KIT

Business Question

Statistical Tools

Description Applications

ClassificationLogistic Regression

Regression where the dependent (target) variable is binary or categorical

Financial Services: Predict likelihood that a consumer will accept or reject credit card offer

Healthcare: Quantify odds of developing post surgical site infection

Manufacturing: Predict product pass or fail based on upstream sensor data

Page 19: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved19

EXAMPLE: LOGISTIC REGRESSION IN HEALTHCARE

Predictors“Big Data”

Modeling, “Machine Learning”

Predictive Model

Business Insights

Business Need

Prediction

Predict which patients are high risk for readmission within 30 days

Predictors = underlying diagnosis, age, discharge day, days to follow up visit post discharge, nurse call follow up…

Logistic Regression

Target = Probability of Post Discharge Readmission (< 30 Days)

Improve patient outcomes and reduce costs by identifying and addressing readmission risk factors

ln𝑝𝑟𝑒𝑎𝑑𝑚𝑖𝑠𝑠𝑖𝑜𝑛

1−𝑝𝑟𝑒𝑎𝑑𝑚𝑖𝑠𝑠𝑖𝑜𝑛=

𝛽0 +𝛽1 𝐷𝑖𝑎𝑔𝑛𝑜𝑠𝑖𝑠+𝛽2𝐷𝐶 𝐷𝑎𝑦…+ 𝛽6𝐹𝑜𝑙𝑙𝑜𝑤 𝑈𝑝

“Data Mining”

Page 20: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved20

FUN FACT

If you buy diapers from a pharmacy, you are more likely to also buy beer – NCR and Osco Drug study

Page 21: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved21

SOME ADDITIONAL CLASSIFICATION METHODSBusiness Question

Statistical Tools Description Example Applications

Classification

k-Nearest Neighbor Categorizes data based on where their nearest neighbors are in the data set.

Manufacturing: Using logged machine sensor data to predict equipment failures before they happen as part of a Total Productive Maintenance system

Healthcare: Predicting pulmonary tuberculosis in hospitalized patients

Retail: Consumer decision trees that classify shopper behavior and quantify shopper decision making

Classification or Decision Trees, Forests

Easy to use method that allows you to predict responses to data by following a series of branching conditions leading to a binary or categorical response

Discriminant Analysis Classifies data by finding linear combinations of features.

Other Methods: Neural Network, Naïve Bayes, Support Vector Machines

This is still only a partial list!

Page 22: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved22

CLUSTERING METHODS

Business Question

Statistical Tools Description Applications

Similarity and Clustering

Hierarchical

Creates nested sets of clusters by measuring similarities between pairs and groups objects into a tree. Produces dendrogram graphic which shows hierarchy.

Financial Services: Place securities into groups based on similarities found amongst returns and investment strategies

Healthcare: Identifying subgroups of patients with similar condition patterns to drive targeted care management

Manufacturing: Part family identification for cell design and optimization

k-Means

Partitions data into k numbers of mutually exclusive clusters based on the distance between the data point and the cluster’s center.

Fuzzy c-MeansSimilar to k-Means, but allows for overlap of the clusters. Clusters are not mutually exclusive.

Methods that assign data (or variables) into similar groups. Unlike classification, groups are not known beforehand.

Page 23: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved23

EXAMPLE: CLUSTERING IN RETAIL

Predictors“Big Data”

Modeling, “Machine Learning”

Predictive Model

Business Insights

Business Need

Prediction

Offer relevant similar items to customers during online shopping for whiskey

A historical database of review descriptions: Color, Nose, Body, Palate, Finish

k-Means Clustering

Offer a group of whiskeys as alternate choices to customer’s first selection

Provides customer options, and keeps them on your site

“Data Mining”

Page 24: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved24

ADVANCED PREDICTION TOOLS

Business Question

Statistical Tools Description Example Applications

Prediction

Regression Tree/Forest Similar to decision trees for classification, but predicts a continuous response vs. categorical.

Financial Services: Predicting likelihood a mortgage will go into default or be paid off early

Healthcare: Predicting hospital average length of stay

Manufacturing: Predicting wafer reject rates in semiconductor manufacturing

Gaussian Process Nonparametric models often used for spatial data.

Support Vector Machines (SVM)

Fits a “hyperplane” that deviates from measured data by no more than a small amount.

Others: Neural Network Regression, Nearest Neighbor Methods

And more!

Page 25: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved25

EXAMPLE: REGRESSION TREES IN MORTGAGES

"From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel,"

Predictors“Big Data”

Modeling, “Machine Learning”

Predictive Model

Business Insights

Business Need

Prediction

Predict likelihood that mortgage will be paid off early

Predictors: Interest rate, income, payoff amount, property type, loan to value ratio…

Regression TreeTarget = probability of prepayment or default

Bank can screen refinance applicants and make better business decisions

“Data Mining”

Page 26: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved26

▪ An application would be processed through the tree to calculate likelihood of pre-payment

▪ Greatest risk:− Interest ≥7.94

−Mortgage ≥ $182,926

− Property not a condo or co-op

REGRESSION TREE INTERPRETATION

Yes No

Yes No Yes No

Yes No Yes No Yes No

Yes No

Yes No

Yes No

Page 27: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved27

DATA REDUCTION METHODS

Business Question

Statistical Tools

Description Applications

Similarity and Clustering

Principal Component Analysis

Transforms the data so that most of the variance in your data is accounted for in the first few principal components. Model improvement

Key factor identificationFactor Analysis

Identifies underlying correlations between variables in your data set so you can identify commonality amongst factors.

Methods that help you reduce the number of variables in your models to reduce collinearity, model noise and risk of overfit. Very helpful in regression models.

Page 28: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved28

FUN FACT

Your reliability as a debtor varies by your use of your credit card: at a pool hall (less reliable); at the dentist (more reliable); to buy felt pads for under your furniture legs (most reliable) – 2002 study by Canadian Tire

Page 29: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved29

CONCLUSIONS

▪ Big Data is more than a buzzword

▪ We are in a new age of analytics

▪ There is opportunity for LSS practitioners to expand skills in this rapidly growing and complementary area

▪ What data exists in your organization today where you could start to apply these tools?

Page 30: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved30

SOURCES AND RESOURCES

▪ Eric Sigel, Predictive Analytics, Wiley, 2016 (particularly the Fun Facts)

▪ Foster Provost and Tom Fawcett, Data Science for Business, O’Reilly, 2013

▪ Tools and Tutorials for Data Mining and Predictive Analytics Software: https://www.salford-systems.com/

Page 31: Getting Started with Big Data - ASQ Section 1414..."From the book 'Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die' by Eric Siegel," Predictors “Big Data”

©2017 Firefly Consulting, All Rights Reserved31

▪ Contact me at

[email protected]

http://www.firefly-consulting.com

THANK YOU!