supervised and unsupervised learning€¦ · (regression) classification . 1/7/2009 11 statistical...

67
1/7/2009 1 Supervised and Unsupervised Learning Kwok-Leung Tsui Industrial & Systems Engineering Georgia Institute of Technology

Upload: others

Post on 18-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 1

Supervised and Unsupervised Learning

Kwok-Leung TsuiIndustrial & Systems EngineeringGeorgia Institute of Technology

Page 2: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 2

Data Mining (KDD) Process

Determine Business

Objectives

Data Preparation

Mining & Modeling

Consolidation and

Application

Page 3: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 3

Data Mining and Modeling

Page 4: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 4

Start

Data Mining & Modeling

Choose Models

Build/Fit Model

Refine / Tune Model(model size & diagnosis)

Evaluate Model(e.g. Prediction error)

Prediction Make Decisions

Collect more data

ConsiderAlternateModels

Test Data(Evaluation Data)

Score DataYES

NO

Sample Data

Train Data

Validation Data

Meet accuracy reqt.

Page 5: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 5

• Descriptive statistical measures– Central tendency/ location– Dispersion/spread– Shape & symmetry

• Class Characterization and Comparisons– Analytical characterization– Attribute relevance analysis– Class discrimination and comparisons

• Data Visualization– Scatter-plot matrix & density plot– 3-D stereoscopic scatter-plot– Parallel coordinate plot

Data Description & Visualization

Page 6: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 6

• Supervised learning: – Learning with a teacher

– Classification, e.g. online shoppers (buyers Vs. non-buyers)

• Unsupervised learning: – Learning without a teacher

– Clustering, e.g. online shoppers (segmentation of non-buyers)

• Other related terms:– Machine Learning (analogies to human receiving)

– Neural Networks (biological analogies to brain)

Supervised & Unsupervised Learning

Page 7: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 7

• Inputs: (Predictors, independent variables, y)– A set of variables which are measured or preset.

• Outputs: (Responses, dependent variables, x)– A set of measurable variables which are influenced by the

inputs

• Steps:– Establish models / systems(y hat) based on collected inputs

& outputs (x and y). – Predict the values of outputs based on the established

models / systems and a new set of specified inputs.

Supervised Learning

Page 8: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 8

• Learning with a teacher (generalization)– Student presents answer ( given xi)

– Teacher provides the correct answer yi or an error for student’s answer

– The result is characterized by some loss function:

– Objective: Minimize the expected loss

• Function approximation: Y=f(x, ε)

iy)

(L

Supervised Learning

), yy )

Page 9: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 9

Problems in Supervised Learning

• (Application/Problem Oriented)– Classification problem:

Output is categorical / qualitative.

– Prediction (Regression) problem: Output is continuous / quantitative.(also called prediction problem.)

– Forecasting problem: Output in future domain.

Page 10: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 10

Supervised Learning Methods

X Y

Continuous

Categorical

Prediction (Regression)

Classification

Page 11: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 11

Statistical Problems and Decision Theory

Page 12: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 12

Formulation of Statistical Problems

• Estimation (Point and Interval)• Hypothesis Testing• Ranking and Selection• Prediction and Forecasting• Decision Making• Etc.

Page 13: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 13

Statistical Decision Theory

Page 14: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 14

Statistical Decision Theory

Page 15: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 15

Statistical Decision TheoryLeast Squares Estimation

Page 16: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 16

Statistical Decision Theory

Classification & Bayes Classifier

Page 17: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 17

Statistical Decision TheoryClassification & Bayes Classifier

Bayes Classifier: Choose the class with maximum probability

Page 18: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 18

Model Complexity and Prediction/Classification Error

Page 19: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 19

Datasets

Training Set

Testing Set

Dataset used for creating classifiers

Dataset used for validating classifier obtained from training set.

Page 20: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 20

Classification Example

Linear Regression Method for Classification

Page 21: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 21

Page 22: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 22

Classification Example

Nearest Neighbor Method for Classification

Page 23: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 23

Page 24: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 24

Page 25: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 25

Page 26: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 26

Training error

Test error

Model ComplexityLow High

Pre

dict

ion

Err

orPrediction or Classification Error

Overfitting

Page 27: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 27

Training Error, Cross-Validation Error, Testing Error

Training error based on training data

Testing data Training data

1 2 3 K. . .

Cross-Validation

Testing error based on testing data

Fitted model using training data

Page 28: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 28

Cross-Validation Method

1 2 3 10. . .1st round

1 2 3 10. . .2nd round

1 2 3 10. . .

10th round

Page 29: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 29

(Methodology/Model Oriented)• Regression Type Models:

– Linear Models, GLM, Logistic Regression– Generalized additive models (Hastie & Tibshirani, 1990)

– Classification and Regression Tree (CART)(Breiman, Friedman, Olson, Stone, 1981)

– Multivariate Adaptive Regression Spline (MARS) – Multiple Additive Regression Tree (MART)– Neural Networks

Models for Supervised Learning

Page 30: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 30

• Segmentation Type Models– Support vector machines (SVM)– Generalized linear discriminant analysis (DA)– Flexible DA, Penalized DA, Mixture DA– K-Nearest Neighbors (NN), Adaptive k-NN– Bayesian Classification– Genetic Algorithms– Fuzzy Set Classification– Classification and Regression Tree (CART)

Models for Supervised Learning

Page 31: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 31

• Not clear how to categorize the regression and segmentation type models.

• Most regression models can be used for both classification and prediction (regression) problems.

• Segmentation models can also be useful for regression problem, e.g., Regression tree, SVR.

• Computer scientists focus on problem while statisticians focus on models(algorithms Vs. models, e.g. boosting Vs. MART)

Supervised Learning

Page 32: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 32

Some Characteristics of Different Learning Methods (Hastie et al.)

Characteristics Neural Net SVM Trees MARS K-NN, Kernel MART

Natural handling of data of “mixed” type

Handling of missing values

Robustness to outliers in input space

Insensitive to monotone transformations of inputs

Computational scalability (large N)

Ability to deal with irrelevant inputs

Ability to extract linear combinations of features

Interpretability

Predictive power

= good = fair = poor

Page 33: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 33

• Learning without a teacher• Statistical Definition

– Observe N vectors from the population distribution– Directly inference on the properties (e.g. relationship, grouping)

on the population distribution• Dimension of the observation (# of variables or

attributes) is often very high (much higher than that in supervised learning)

• No clear measure of success– The success is often judged (subjectively) by the value of

discovery knowledge or the effectiveness of the algorithm

Unsupervised Learning

Page 34: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 34

• Association Rule:– Single-Dimensional Association Rule from

Transaction Database– Multi-Level Association Rule from Transaction

Database– Multi-Dimensional Association Rule from Relational

Data Base and Data Warehouse– Correlation Analysis

Problems in Unsupervised Learning

Page 35: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 35

• Examples :– Multi-Level Association Rule :

• Computer: desktop (IBM, Dell), laptop (Toshiba, Sony)• Software: educational (Microsoft, …), financial (…, …)• Printer: color (HP, Epson), B/W (HP, Sony)• Rule e.g.: {IBM desktop computer => B/W printer}

– Multi-Dimensional Association Rule :• buys(X, “IBM desktop computer”)

=> buys(X, “Sony B/W printer”)• Age(X, “20 to 29”)& Occupation(X,”students”) =>

buys(X, “laptops”)

Association Rules

Page 36: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 36

• Clustering – Partitioning methods:

• K-means, K-medoids– Hierarchical Methods:

• BIRCH, CURE, chameleon, algorithms– Density-Based Methods:

• DB SCAN, OPTICS, DENCLUS– Grid-based methods:

• STING, Wave cluster, CLIQUE– Model-Based Clustering:

• CoBWEB (tree-model)• Neural Network model

Algorithms in Unsupervised Learning

Page 37: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 37

• Association Rules– Market basket analysis– Generalized association rules

• Cluster Analysis– K-mean algorithms– Clustering algorithms– Combinatorial algorithms

• Other Multivariate Methods– Principle components– Factor analysis and latent variables– Projection pursuit– Multi-dimensional scaling

Models for Unsupervised Learning

Page 38: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 38

Application Examples

Page 39: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 39

Learn a method for predicting the instance class from pre‐labeled (classified)  instances

Classification

Classification Models

Page 40: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 40

Find “natural” grouping of data given un‐labeled data

Clustering

Page 41: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 41

Classification problem

SensingFeature extractionWidth, length, lightness, etc.

salmon

bass

lightness

width

Classification Problem

Page 42: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 42

• Given a query functional data and some similarity measure (e.g., Euclidean distance), find the nearest matching functional data in DB.

Query Q(template)

C6 is the best match

Database C

2

1

4

3

5

7

6

9

8

10

Index Problem (Query by Content)

Page 43: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 43

• Find a natural groups (clusters) of the functional data in database

1

2

7

6

3

5

4

Clustering Signals

Page 44: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 44

• Given a faulty signal from the monitoring procedure, how to classify it to one of known classes

0 50 100 150-2

-1

0

1

2

3Fault 1

0 50 100 150-2

-1

0

1

2

3Fault 2

0 50 100 150

-2

0

2

4

Fault 3

0 50 100 150-2

-1

0

1

2

3Fault 4

0 20 40 60 80 100 120 140-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

Fault 4

0 20 40 60 80 100 120 140-1

-0.5

0

0.5

1

1.5

2

Faulty Signal Detection

Page 45: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 45

Hand Writing Recognition

Page 46: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 46

From Normal From Disease

Bioinformatics ‐Microarray

• Microarray (e.g., 50,000 spots)

Page 47: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 47

Bioinformatics ‐Microarray

• Clustering problem

– Partition the genes into groups or clusters based on their expression patterns.

Page 48: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 48

Bioinformatics – Gene Finding

Input: An DNA string (nucleotide) over the alphabet {A,C,G,T}Output: An annotation of the string showing for every 

nucleotide whether it is coding (gene) or non‐coding.

AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG

Gene finder

AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG

Gene !!

Page 49: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 49

Secondary Structure

3D coordinates of atoms

AminoacidSequence

Bioinformatics – Protein Structure Prediction

Page 50: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 50

Bioinformatics ‐Metabolomics

0123456780

0.01

0.02

0.03

0.04

0.05

0.06

chemical shift (ppm)

55 60 65 70 75 80 85-14

-12

-10

-8

-6

-4

-2

0

PC1

PC

2

8:30 9:30

10:30

11:30

12:30

13:30

14:30

15:30

16:30

17:30

18:30

19:30

20:30

21:30

22:30

23:30

00:30

1:30

2:30 3:30

4:30 5:30

6:30

7:30

8:30*

morning (7:30-12:30)

afternoon/evening(13:30-22:30)

night(23:30-6:30)

5560 65

70 7580 85

-15

-10

-5

0

2

4

6

8

10

PC2

00:30 23:30

5:30

2:30

4:30

1:30

14:30

3:30

16:30 17:30

15:30

6:30

18:30

13:30

20:30

19:30

22:30

PC1

21:30

8:30*

11:30

8:30

12:30

10:30

7:30 9:30

PC

3

Finding inherent metabolic patterns in response to pathophysiogical stimuli or genetic modification 

Page 51: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 51

• AT&T business data mining• Inventory management in military maintenance • Sea cargo demand forecasting• SMATRAQ project in transportation policies• Location problem of letterbox• Home improvement store shrinkage analysis • Hotels & Resorts chain data mining• Fast food drive through call center

Industrial Projects

Page 52: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 52

Data Mining in Telecom. (Funded AT&T project)

• ~160 billion dollar per year industry (~70 B long distance & ~90 B dollars local)

• 100 million + customers/accounts/lines• >1 billion phone calls per day

– Book closing (Estimating this month price/usage/revenue)

– Budgeting (Forecasting next year price/usage/revenue)– Segmentation (Clustering of usage, growth, …)– Cross Selling (Association Rule)– Churn (Disconnect prediction & Tracking)– Fraud (Detection of unusual usage time series behavior)– Each of these problems worth hundreds

millions dollars

Page 53: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 53

• A contractor manages parts inventory for aircraft maintenance

• Characterization and forecasting of demand and lead time distributions

• 60,000 different parts and 500 bench locations

• Data tracked by an automated system

• Demand data not available & stockout penalty

Inventory Management in Air Force (Funded project)

Page 54: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 54

• Sea cargo network optimization

• Contract planning & booking control

• Characterize & forecast sea cargo demand distribution & cost structure

• Improve ocean carrier and terminal operation efficiency

Data Mining in Sea Cargo Application (Funded TLIAP project)

Page 55: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 55

• Strategies for Metropolitan Atlanta’s Regional Transportation & Air Quality

• Five-year project sponsored by Transportation Dept., Federal Highway Admin., EPA, CDC, etc.

• Assess air quality, travel behavior, land use & transportation policies

• Reduce auto-dependence and vehicle emissions

• Highway Design based on detailed GPS data

SMARTRAQ Project for Transportation Policies

Page 56: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 56

• Improve performance of express mail dropoff letter boxes

• 50,000 letter boxes & 8 month transaction data

• Relate performance with important factors, e.g. regions, demographic, adjacent competition, pick-up schedule

• Comparison with direct competitors

• Customer demand analysis and forecast

Mining of Letter Box Transaction Data

Page 57: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 57

• Inventory shrinkage costs US retailers 32 billions

• Shrinkage = book inventory – inventory on hand

• Working with a home improvement store’s Loss Prevention Group

• Develop predictive model to relate shrinkage to important variables

• Extract hidden knowledge to reduce loss and improve operation efficiency

Data Mining for Shrinkage Analysis in Retail Industry

Page 58: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 58

• Manage chain hotels and resorts in different scale

• Evaluate impact of promotional programs

• Forecasting of customer behavior in frequent stay program

• Monitor performance in customer survey

• Predict performance with important factors

Data Mining for Hotels and Resorts Chain Business

Page 59: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 59

• Centralized call center for drive through order operated by an independent company

• Profit = Revenue (fixed rate per call) – cost (#operators)

• Constraint: 3 second response time and 20 second route back to store

• Objective: Reduce cost by optimizing operation time and scheduling

• Tools: data mining & forecasting, simulation, optimization

Data Mining and Forecasting for Fast Food Call Center

Page 60: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 60

A General Framework for Dynamic Modeling & Activity Monitoring (DMDA)

Actions

Segmentation & Model Selection

Monitoring

Dynamic Update

Problem

Profile Data–Time domain profile– Profile w. controllable predictors– Profile w. uncontrollable predictors

Model Selection– Global w/o segmentation– Global w. segmentation– Local within Segment

–Detection/Classification– Interpretation–Forecasting/Prediction Segmentation

– Known– Unknown

– Phase I: estimating unknown parameter– Phase II: monitoring and detecting– Anticipated drifts Vs. unanticipated

changes

Objective

Page 61: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 61

Applications• Manufacturing Processes

– Stamping Tonnage Signal Data (functional data)– Mass Flow Controller (MFC) Calibration (linear

profile)– Vertical Density Profile (VDP) Data (nonlinear

profile)

• Service Operations– Telecom. Customer Usage– Sea Cargo Terminal Operation– Used Car Price Mining and Prediction– Hotel Performance Monitoring– Fast Food Drive Through Call Center Forecasting &

Scheduling

Page 62: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 62

Manufacturing:Stamping Tonnage Signal Data

Figure 2: An Tonnage Signal and Some Possible Faults (Jin and Shi 1999)

Page 63: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 63

Stamping Tonnage Signal Data

• Problem – Time domain profile (a tonnage signal represents the stamping force

in a process cycle).• Objective

– Fault detection and classification• Segmentation & Model Selection

– Known segmentation: most process faults occur only in specific working stages. Boundaries and sizes of segments are determined by process knowledge. (Jin and Shi 1999)

– Global model: wavelet transforms• Monitoring

– For each segment, use T2 charts based on selected wavelet coefficients to conduct monitoring. (Jin and Shi 2001)

• Dynamic Update– Classify a signal as normal, a known fault or a new fault as

abnormal, and update wavelet coefficients’ selection and parameter estimates (e.g. μ, ∑, etc.) using all available data.

• Actions– Identify and remove assignable causes.

Page 64: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 64

Telecom. Customer Usage

• Problem – Profile with uncontrollable predictors

• Objective– Abnormal behavior detection and classification– Forecasting/prediction

• Segmentation & Model Selection– Unknown segmentation: segment customers based on demographic,

geographic, psychographic and/or behavioral information.– Local model: fit model for each customer segment, e.g. linear

regression.• Monitoring

– Use the model built for each segment to monitor customer behaviors, e.g. monitor linear regression parameter vector β using T2 chart.

• Dynamic Update– Update customer segmentation, segmental model fitting and/or

parameter monitoring, e.g. parameters update based on known trend.• Actions

– Service improvement, customer approval, etc.

Page 65: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 65

Telecom. Customer Usage

Profile: profile with uncontrollable predictors

Objective– Abnormal behavior detection and classification– Forecasting/prediction

Segmentation– Unknown (segments are defined by customer information.)Model Selection– segmental (e.g. linear regression on uncontrollable predictors for each segment)

Monitoring – Phase I: unknown control chart parameters estimated from data– Phase II: monitoring by control charts, like T2 chart, EWMA chart, etc.

Actions: service improvement, customer approval, etc.

Dynamic Update– Update segmentation, model selection and/or parameter monitoring

Page 66: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 66

Conclusions

Data Mining

Statisticians Computer Scientists

Support data mining by producing data, business problems, software & hardware equipment for testing and implementing results.

Support data mining by mathematical theory and statistical methods.

Support data mining by computational algorithm and relevant software.

Subject matter experts

Page 67: Supervised and Unsupervised Learning€¦ · (Regression) Classification . 1/7/2009 11 Statistical Problems and Decision Theory . 1/7/2009 12 Formulation of Statistical Problems •

1/7/2009 67

Conclusions

• It is not hard to obtain interesting and useful knowledge from data mining.

• The challenge is to transform and implement the interesting knowledge for business decisions making.

• Issues involved:– internal collaboration efforts (sales versus marketing),

– external collaboration efforts (competitors among the industry),

– privacy protection.