predictive
DESCRIPTION
PredictiveTRANSCRIPT
-
Predictive Analytics: Modeling the World
Richard D. De VeauxProfessor of Statistics, Williams CollegeJanuary 28, 2005OR/MS Seminar
-
2Getting to Know Your CustomersGetting to Know Your Customers
50 years ago this was easy Customer data base could fit in one persons head Retention of customers depended on ability to do so
-
32121stst Century Data Bases Century Data Bases
Ability to anticipate customers needs crucial for retention
Even Sam Walton didnt know all his customers preferences
Amazon.com Earths biggest selection $390,000 Diamond Necklace Worlds biggest book Yak Cheese from Tibet
No one can do this without help Well, almost no one!
-
4 Paralyzed Veterans of America KDD 1998 cup
Mailing list of 3.5 million potential donors
Lapsed donors Made their last donation to PVA 13 to 24 months prior
to June 1997
200,000 (training and test sets)
Who should get the current mailing? Cost effective strategy
Direct Marketing ExampleDirect Marketing Example
-
5Why is this Hard?Why is this Hard?
Amount of Information 481 predictors 2 responses
Cross tabs / OLAP How many combinations? What to focus on?
Data Preparation This alone can be 60-95% of the effort Categorical vs. Quantitative
-
6WhatWhats s HardHard? ? ----ExampleExample
-
7TT--CodeCode
-
8So, what does it mean?So, what does it mean?T -C ode
0 _ 1 6 DEAN 4 8 CO RP O RAL 1 0 9 LIC. 1 M R. 1 7 J UDGE 5 0 ELDER 1 1 1 S A.
1 0 0 1 M ES S RS . 1 7 0 0 2 J UDGE & M RS . 5 6 M AYO R 1 1 4 DA. 1 0 0 2 M R. & M RS . 1 8 M AJ O R 5 9 0 0 2 LIEUTENANT & M RS . 1 1 6 S R.
2 M RS . 1 8 0 0 2 M AJ O R & M RS . 6 2 LO RD 1 1 7 S RA. 2 0 0 2 M ES DAM ES 1 9 S ENATO R 6 3 CARDINAL 1 1 8 S RTA.
3 M IS S 2 0 GO V ERNO R 6 4 FRIEND 1 2 0 YO UR M AJ ES TY 3 0 0 3 M IS S ES 2 1 0 0 2 S ERGEANT & M RS . 6 5 FRIENDS 1 2 2 HIS HIGHNES S
4 DR. 2 2 0 0 2 CO LNEL & M RS . 6 8 ARCHDEACO N 1 2 3 HER HIGHNES S 4 0 0 2 DR. & M RS . 2 4 LIEUTENANT 6 9 CANO N 1 2 4 CO UNT 4 0 0 4 DO CTO RS 2 6 M O NS IGNO R 7 0 BIS HO P 1 2 5 LADY
5 M ADAM E 2 7 REV EREND 7 2 0 0 2 REV EREND & M RS . 1 2 6 P RINCE 6 S ERGEANT 2 8 M S . 7 3 P AS TO R 1 2 7 P RINCES S 9 RABBI 2 8 0 2 8 M S S . 7 5 ARCHBIS HO P 1 2 8 CHIEF
1 0 P RO FES S O R 2 9 BIS HO P 8 5 S P ECIALIS T 1 2 9 BARO N 1 0 0 0 2 P RO FES S O R & M RS . 3 1 AM BAS S ADO R 8 7 P RIV ATE 1 3 0 S HEIK 1 0 0 1 0 P RO FES S O RS 3 1 0 0 2 AM BAS S ADO R & M RS 8 9 S EAM AN 1 3 1 P RINCE AND P RINCES S
1 1 ADM IRAL 3 3 CANTO R 9 0 AIRM AN 1 3 2 YO UR IM P ERIAL M AJ ES T 1 1 0 0 2 ADM IRAL & M RS . 3 6 BRO THER 9 1 J US TICE 1 3 5 M . ET M M E.
1 2 GENERAL 3 7 S IR 9 2 M R. J US TICE 2 1 0 P RO F.1 2 0 0 2 GENERAL & M RS . 3 8 CO M M O DO RE 1 0 0 M .
1 3 CO LO NEL 4 0 FATHER 1 0 3 M LLE. 1 3 0 0 2 CO LO NEL & M RS . 4 2 S IS TER 1 0 4 CHANCELLO R
1 4 CAP TAIN 4 3 P RES IDENT 1 0 6 REP RES ENTATIV E 1 4 0 0 2 CAP TAIN & M RS . 4 4 M AS TER 1 0 7 S ECRETARY
1 5 CO M M ANDER 4 6 M O THER 1 0 8 LT. GO V ERNO R 1 5 0 0 2 CO M M ANDER & M RS . 4 7 CHAP LAIN
T itle
-
9Results for PVA Data Set If entire list (100,000 donors) are
mailed, net donation is $10,500
Using data mining techniques, this was increased 41.37%
-
10
KDD CUP 98 ResultsKDD CUP 98 Results
-
11
KDD CUP 98 Results 2
-
12
Data Mining IsData Mining Isthe nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. --- Fayyad
finding interesting structure (patterns, statistical models, relationships) in data bases.--- Fayyad, Chaduri and Bradley
a knowledge discovery process of extracting previously unknown, actionable information from very large data bases--- Zornes
a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.---Edelstein
-
13
Data Mining Is Data Mining Is
-
14
Case Study ICase Study I Ingot Cracking 953 30,000 lb. Ingots 20% cracking rate $30,000 per recast 90 Potential Explanatory
Variables Water composition Metal composition Process variables Other environmental variables
Can we predict under what conditions ingots will crack?
-
15
Case Study II Case Study II Car Insurance 42800 mature policies 65 Potential Predictors
Can we find a pattern for the unprofitable policies?
-
16
Case Study IIICase Study III
Breast Cancer Diagnosis Mammograms used as
screening instrument Expensive radiologist read Inaccurate
False positive and negative rates over 25%
Over a decade, nearly 100% false positive rate
Can we do better? Automatically read by a scanning
algorithm Automatically diagnosed by a
model
-
17
Why not Queries?Why not Queries?
Queries Describe Models promote understanding Models can be assessed both by their understanding and
their predictions Its difficult to predict especially the future
Queries are Event Driven Models are phenomenon driven
Queries are reactive Models are proactive
-
18
What Happened on the Titanic?What Happened on the Titanic?
CrewFirstSecondThird
Class
-
19
Mosaic Plot Mosaic Plot
D S
A
C
F M
1
2
3
C
1
2
3
C
F M
-
20
ModelsModels
Powerful predictors for optimizing performance
Powerful summaries for understanding
Used to explore data set
Are not perfect All models are wrong, but some are useful Statisticians, like artists, have the bad habit of falling
in love with their models.
-
21
Tree DiagramTree Diagram
|M
3
46% 93%
3 1,2,CChildAdult
1 or 2
F
27% 100%
33%23%
1stCrew
1 or Crew2 or 3
14%
-
22
Why Models?Why Models?
Whats interesting? Most associated variables in the census Whats associated with shampoo
purchases?
Beer and Diapers In the convenience stores we looked at, on
Friday nights, purchases of beer and purchases of diapers are highly associated Conclusions? Actions?
-
23
Beer and DiapersBeer and Diapers
Picture from TandemTM ad
-
24
ToyToy Problem Problem
train2[, i]
t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
train2[, i]
t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
train2[, i]
t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
train2[, i]
t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
train2[, i]
t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
train2[, i]t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
train2[, i]
t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
train2[, i]
t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
train2[, i]
t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
train2[, i]
t
r
a
i
n
2
$
y
0.0 0.2 0.4 0.6 0.8 1.0
5
1
0
1
5
2
0
2
5
-
25
Familiar ModelsFamiliar Models
Linear Regression
-
26
Logistic RegressionLogistic Regression
-
27
Linear Regression Linear Regression Term Estimate Std Error t Ratio Prob>|t|
Intercept 0.806 0.427 1.890 0.059x1 7.269 0.273 26.590
-
28
Stepwise RegressionStepwise Regression
Term Estimate Std Error t Ratio Prob>|t|Intercept 0.561 0.328 1.710 0.087x1 7.252 0.273 26.550
-
29
Stepwise 2Stepwise 2NDND Order ModelOrder ModelTerm Estimate Std Error t Ratio Prob>|t|
Intercept 0.000 0.000 . .x1 7.204 0.169 42.510
-
30
Next Steps Next Steps
Higher order terms?
When to stop?
Transformations?
Too simple: underfitting bias
Too complex: inconsistent predictions, overfitting high variance
Selecting models is Occams razor Keep goals of interpretation vs. Prediction in mind
-
31
Tree ModelTree Model|
x4
-
32
Feature CreationFeature Creation
New predictor based on original predictors
Often linear:
Principal components Factor analysis Multidimensional scaling
ppi xbxbz +++= ...11
-
33
Neural NetsNeural Nets
Dont resemble the brain Are just a statistical model
-
34
Input (z1)Output
x1
x2
x3
x4
x5
x0
0.3
0.7
-0.20.4
-0.5
z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5
s(z1)
A Single NeuronA Single Neuron
0.8
-
35
More exotic More exotic Neural networksNeural networks
Input layer
Output layer
Hidden layer
z1
z2
z3
x1
x2 y
-
36
Running a Neural NetRunning a Neural Net
-
37
Predictions for ExamplePredictions for Example
R squared 92.7% Train 90.6% Test
-
38
What Does This Get Us?What Does This Get Us?
Enormous flexibility
Ability to fit anything Including noise
Interpretation?
-
39
Case Study Case Study Warranty DataWarranty Data
A new backpack inkjet printer is showing higher than expected warranty claims What are the important variables? Whats going on?
A neural networks shows that Zipcode is the most important predictor
-
40
Spatial Analysis Spatial Analysis
Warranty Data showing problem with ink jet printer
Use the model as a black box for variable selection
-
41
MARS MARS
Multivariate Adaptive Regression Splines
What do they do? Replace each step function in a tree model by a pair of linear
functions.
x
y
0 2 4 6 8 10
-
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
x
y
0 2 4 6 8 10
-
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
xy
0 2 4 6 8 10
-
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
1
.
2
-
42
MARS Variable ImportanceMARS Variable Importance
R-squared 95.0% Train 94.3% Test(96.3%) (95.8%)
-
43
MARS Function OutputMARS Function Output
-
44
Collaborative FilteringCollaborative Filtering
Goal: predict what movies people will like
Data: list of movies each person has watched
Lyle Andre, Starwars Ellen Andre, Starwars, Coeur en Hiver Fred Starwars, BatmanDean Starwars, Batman, RamboJason Coeur en Hiver, Chocolat
-
45
Data BaseData Base
Data can be represented as a sparse matrix
Karen likes Andre. What else might she like?
CDNow doubled e-mail responses
Andre Starwars Batman Rambo Coeur Chocolat
Lyle y yEllen y y yFred y yDean y y yJason y y y
Karen y ? ? ? ? ?
-
46
How Do We Really Start?How Do We Really Start?
Life is not so kind Categorical variables Missing data 500 variables, not 10
481 variables where to start?
-
47
Where to Start?Where to Start?
EDM Use a tree to find a smaller subset of
variables to investigate Explore this set graphically Start the modeling process over
Build model Compare model on small subset with full
predictive model
-
48
Start With a Simple ModelStart With a Simple Model
Maybe a Tree:|x4
-
49
Automatic ModelsAutomatic Models
KXEN
-
50
PVA Results from KXENPVA Results from KXEN
-
51
Combining Models Combining Models ---- BaggingBagging
Bagging (Bootstrap Aggregation) Bootstrap a data set repeatedly Take many versions of same
model (e.g. tree) Form a committee of models Take majority rule of predictions
-
52
Combining Models Combining Models ---- BoostingBoosting
Take the data and apply a simple classifier
Reweight the data, weighting the misclassified data much higher.
Reapply the classifier
Repeat over and over
The final prediction is a combination of the output of each classifier, weighted by the overall misclassification rate.
Details in Freund, Y. Boosting a weak learning algorithm by majority, Information and Computation 121(2), 256-285.
-
53
Breast Cancer DiagnosisBreast Cancer Diagnosis
-
54
Results from Random ForestResults from Random Forest
F a lse P o sitive Ra te F a lse Ne g a tive Ra teT re e 32.20% 33.70%Bo o ste d T re e s 24.90% 32.50%Ra n d o m F o re st 19.30% 28.80%Ne u ra l Ne tw o rk 25.50% 31.70%Ra d io lo g ists 22.40% 35.80%
Results from 1000 splits of Training and Test data
-
55
Case StudyCase Study Ingot failuresIngot failures
Ingot cracking 953 30,000 lb. Ingots 20% cracking rate $30,000 per recast 90 potential explanatory variables Water composition (reduced) Metal composition Process variables Other environmental variables
-
56
Model building processModel building process
Model building Train Test
Evaluate
-
57
Most Important VariableMost Important Variable
Take One Here we started with trees Alloy We know that
OK, take two Yttrium What do you think is in the alloy?
Third times the charm? Selenium! OH!
-
58
Case Study Case Study Car InsuranceCar Insurance
Now that we have 40000 mature policies, can we find other factors to price policies better?
65 potential predictors Industry, vehicle age, color, numbers of vehicles, usage
and location etc
-
59
Fast FailFast Fail
Not every modeling effort is a success A model search can save lots of queries
Data took 8 months to get ready
Analyst spent 2 months exploring it
A new model search program (KXEN) running for several hours found no out of sample predictive ability Tree model gave similar results
-
60
PVA RecapPVA Recap
Remember --- 481 predictor variables
Need a way to trim this down
Need an exploratory model Neural network? Tree?
-
61
Students in Data Mining ClassStudents in Data Mining Class
Student #1 $15,024Student #2 $14,695Student #3 $14,345
-
62
Take Home MessagesTake Home Messages
What a great time to be a Statistician!
Problems are exciting
Research is exciting
Success in Data mining Requires Team Work Requires Flexibility in modeling Means that you Act on Your results Depends much more on the way you mine the data rather
than the specific model or tool that you use
Which method to use? Yes!! Have fun!
-
63
Thank you!Thank you!