copyright © 2006, sas institute inc. all rights reserved. predictive modeling concepts and...
TRANSCRIPT
Copyright © 2006, SAS Institute Inc. All rights reserved.
Predictive Modeling Concepts and AlgorithmsRuss Albright and David DulingSAS Institute
Copyright © 2006, SAS Institute Inc. All rights reserved.
Predictive Modeling Landscape
1. Background
2. Modeling Overview
3. Models
4. Model Assessment and Selection
5. Model Deployment / Scoring
Copyright © 2006, SAS Institute Inc. All rights reserved.
Use Cases for Data Mining1. Offline applications
Campaign planning Adverse event detection
2. On-demand applications Front Office data collection & recommendation
3. Real-time applications Transaction processing Fraud detection Website product recommendation
4. Real time modeling and scoring of data streams (the future!) Mega data streams Internet traffic Satellite transmissions Digital data acquisition
Copyright © 2006, SAS Institute Inc. All rights reserved.
Background - Enterprise Miner Functionality
ampleample
xplorexplore
odifyodify
odelodel
ssessssess
Copyright © 2006, SAS Institute Inc. All rights reserved.
Background - Predictive Modeling TerminologyTraining Data
Observations
Variables/Features/AttributesActual Target
Scoring DataActual Target
Validation and Test Data Actual TargetPredicted Target
(Output)
Predicted Target(Output)
Copyright © 2006, SAS Institute Inc. All rights reserved.
Modeling Overview
What do we mean by prediction?
What is a predictive model?• Classification/descriminant model– target is categorical,
usually binary
• Regression model– target continuous
Given {x(i),y(i)},
y=f(x,θ)
E(y|x,θ)
p(y|x,θ)
Copyright © 2006, SAS Institute Inc. All rights reserved.
Consider the following dataPredict the Response for a new value of Attribute
Resp
onse
Attribute
Copyright © 2006, SAS Institute Inc. All rights reserved.
The Most Simple Model: y = YRe
spon
se
Attribute
Copyright © 2006, SAS Institute Inc. All rights reserved.
What about a polynomial ?Re
spon
se
Attribute
Copyright © 2006, SAS Institute Inc. All rights reserved.
What about a better polynomial ?Re
spon
se
Attribute
Copyright © 2006, SAS Institute Inc. All rights reserved.
Now acquire more data and call it “validation data”
The blue model is said to overfit the training data.
The mean model is said to underfit the training data.
Resp
onse
Attribute
TrainingValidation
Copyright © 2006, SAS Institute Inc. All rights reserved.
Models
Linear Regression
X
X
** *
*
* **
* *
*
*
*
*
Y
*
**
*
2
1
y = 0 + 1x1 + 2x2
Logistic Regression (Generalized Linear Model)
log(pj/(1-pj)) = 0 + 1X1 + 2X2
0-1 target/response variable
Fit pj = p(yj=0|x) = 1- p(yj=1|x)
Copyright © 2006, SAS Institute Inc. All rights reserved.
Idea: What if we break the data into smaller chunks to identify local phenomena ?
Resp
onse
Attribute
Copyright © 2006, SAS Institute Inc. All rights reserved.
Decision Trees
Copyright © 2006, SAS Institute Inc. All rights reserved.
Neural Networks
ftp://ftp.sas.com/pub/neural/FAQ.html
Copyright © 2006, SAS Institute Inc. All rights reserved.
Evolution of model training error and validation error
Mod
el E
rror
Initialization
Training Error
Validation Error
Underfitting Overfitting
Optimal fit
Copyright © 2006, SAS Institute Inc. All rights reserved.
Memory Based Reasoning (Nearest Neighbors)
X
X
** *
*
* **
*
*
*
*
*
Y
*
**
*
2
1
Neighbors
Copyright © 2006, SAS Institute Inc. All rights reserved.
Model Assessment and Selection – Lift chartsTest Data Actual Target
Predicted Target(Output)
10
01
.9
.8
.3
.6 1
Decision
10
1
Copyright © 2006, SAS Institute Inc. All rights reserved.
Model Assessment Selection – ROC CURVES
Copyright © 2006, SAS Institute Inc. All rights reserved.
Copyright © 2006, SAS Institute Inc. All rights reserved.
5. $ Model Deployment / “Scoring” $
It is definitely not (just) about building the models.
Scoring and Score Code
Monitoring
Copyright © 2006, SAS Institute Inc. All rights reserved.
Batch Score Delivery to Offline Applications
SAS Scoring
Data StoreScores
RDB ScoringC code
PMML engineBI Application
Scheduled ScoringETL process
Operations
ETL engineModel
Development
ETL for model development and scoringScores generated on nightly basisID and Score data pre-loaded into data storeScore requests contain ID Decision server translates score to action
CampaignPlanning
CampaignExecution
Data Mining
Copyright © 2006, SAS Institute Inc. All rights reserved.
Thanks!