predictive

Predictive Analytics: Modeling the World

Richard D. De VeauxProfessor of Statistics, Williams CollegeJanuary 28, 2005OR/MS Seminar

2Getting to Know Your CustomersGetting to Know Your Customers

50 years ago this was easy Customer data base could fit in one persons head Retention of customers depended on ability to do so

32121stst Century Data Bases Century Data Bases

Ability to anticipate customers needs crucial for retention

Even Sam Walton didnt know all his customers preferences

Amazon.com Earths biggest selection $390,000 Diamond Necklace Worlds biggest book Yak Cheese from Tibet

No one can do this without help Well, almost no one!

4 Paralyzed Veterans of America KDD 1998 cup

Mailing list of 3.5 million potential donors

Lapsed donors Made their last donation to PVA 13 to 24 months prior

to June 1997

200,000 (training and test sets)

Who should get the current mailing? Cost effective strategy

Direct Marketing ExampleDirect Marketing Example

5Why is this Hard?Why is this Hard?

Amount of Information 481 predictors 2 responses

Cross tabs / OLAP How many combinations? What to focus on?

Data Preparation This alone can be 60-95% of the effort Categorical vs. Quantitative

6WhatWhats s HardHard? ? ----ExampleExample

7TT--CodeCode

8So, what does it mean?So, what does it mean?T -C ode

0 _ 1 6 DEAN 4 8 CO RP O RAL 1 0 9 LIC. 1 M R. 1 7 J UDGE 5 0 ELDER 1 1 1 S A.

1 0 0 1 M ES S RS . 1 7 0 0 2 J UDGE & M RS . 5 6 M AYO R 1 1 4 DA. 1 0 0 2 M R. & M RS . 1 8 M AJ O R 5 9 0 0 2 LIEUTENANT & M RS . 1 1 6 S R.

2 M RS . 1 8 0 0 2 M AJ O R & M RS . 6 2 LO RD 1 1 7 S RA. 2 0 0 2 M ES DAM ES 1 9 S ENATO R 6 3 CARDINAL 1 1 8 S RTA.

3 M IS S 2 0 GO V ERNO R 6 4 FRIEND 1 2 0 YO UR M AJ ES TY 3 0 0 3 M IS S ES 2 1 0 0 2 S ERGEANT & M RS . 6 5 FRIENDS 1 2 2 HIS HIGHNES S

4 DR. 2 2 0 0 2 CO LNEL & M RS . 6 8 ARCHDEACO N 1 2 3 HER HIGHNES S 4 0 0 2 DR. & M RS . 2 4 LIEUTENANT 6 9 CANO N 1 2 4 CO UNT 4 0 0 4 DO CTO RS 2 6 M O NS IGNO R 7 0 BIS HO P 1 2 5 LADY

5 M ADAM E 2 7 REV EREND 7 2 0 0 2 REV EREND & M RS . 1 2 6 P RINCE 6 S ERGEANT 2 8 M S . 7 3 P AS TO R 1 2 7 P RINCES S 9 RABBI 2 8 0 2 8 M S S . 7 5 ARCHBIS HO P 1 2 8 CHIEF

1 0 P RO FES S O R 2 9 BIS HO P 8 5 S P ECIALIS T 1 2 9 BARO N 1 0 0 0 2 P RO FES S O R & M RS . 3 1 AM BAS S ADO R 8 7 P RIV ATE 1 3 0 S HEIK 1 0 0 1 0 P RO FES S O RS 3 1 0 0 2 AM BAS S ADO R & M RS 8 9 S EAM AN 1 3 1 P RINCE AND P RINCES S

1 1 ADM IRAL 3 3 CANTO R 9 0 AIRM AN 1 3 2 YO UR IM P ERIAL M AJ ES T 1 1 0 0 2 ADM IRAL & M RS . 3 6 BRO THER 9 1 J US TICE 1 3 5 M . ET M M E.

1 2 GENERAL 3 7 S IR 9 2 M R. J US TICE 2 1 0 P RO F.1 2 0 0 2 GENERAL & M RS . 3 8 CO M M O DO RE 1 0 0 M .

1 3 CO LO NEL 4 0 FATHER 1 0 3 M LLE. 1 3 0 0 2 CO LO NEL & M RS . 4 2 S IS TER 1 0 4 CHANCELLO R

1 4 CAP TAIN 4 3 P RES IDENT 1 0 6 REP RES ENTATIV E 1 4 0 0 2 CAP TAIN & M RS . 4 4 M AS TER 1 0 7 S ECRETARY

1 5 CO M M ANDER 4 6 M O THER 1 0 8 LT. GO V ERNO R 1 5 0 0 2 CO M M ANDER & M RS . 4 7 CHAP LAIN

T itle

9Results for PVA Data Set If entire list (100,000 donors) are

mailed, net donation is $10,500

Using data mining techniques, this was increased 41.37%

10

KDD CUP 98 ResultsKDD CUP 98 Results

11

KDD CUP 98 Results 2

12

Data Mining IsData Mining Isthe nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. --- Fayyad

finding interesting structure (patterns, statistical models, relationships) in data bases.--- Fayyad, Chaduri and Bradley

a knowledge discovery process of extracting previously unknown, actionable information from very large data bases--- Zornes

a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.---Edelstein

13

Data Mining Is Data Mining Is

14

Case Study ICase Study I Ingot Cracking 953 30,000 lb. Ingots 20% cracking rate $30,000 per recast 90 Potential Explanatory

Variables Water composition Metal composition Process variables Other environmental variables

Can we predict under what conditions ingots will crack?

15

Case Study II Case Study II Car Insurance 42800 mature policies 65 Potential Predictors

Can we find a pattern for the unprofitable policies?

16

Case Study IIICase Study III

Breast Cancer Diagnosis Mammograms used as

screening instrument Expensive radiologist read Inaccurate

False positive and negative rates over 25%

Over a decade, nearly 100% false positive rate

Can we do better? Automatically read by a scanning

algorithm Automatically diagnosed by a

model

17

Why not Queries?Why not Queries?

Queries Describe Models promote understanding Models can be assessed both by their understanding and

their predictions Its difficult to predict especially the future

Queries are Event Driven Models are phenomenon driven

Queries are reactive Models are proactive

18

What Happened on the Titanic?What Happened on the Titanic?

CrewFirstSecondThird

Class

19

Mosaic Plot Mosaic Plot

D S

A

C

F M

1

2

3

C

1

2

3

C

F M

20

ModelsModels

Powerful predictors for optimizing performance

Powerful summaries for understanding

Used to explore data set

Are not perfect All models are wrong, but some are useful Statisticians, like artists, have the bad habit of falling

in love with their models.

21

Tree DiagramTree Diagram

|M

3

46% 93%

3 1,2,CChildAdult

1 or 2

F

27% 100%

33%23%

1stCrew

1 or Crew2 or 3

14%

22

Why Models?Why Models?

Whats interesting? Most associated variables in the census Whats associated with shampoo

purchases?

Beer and Diapers In the convenience stores we looked at, on

Friday nights, purchases of beer and purchases of diapers are highly associated Conclusions? Actions?

23

Beer and DiapersBeer and Diapers

Picture from TandemTM ad

24

ToyToy Problem Problem

train2[, i]

t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

train2[, i]

t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

train2[, i]

t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

train2[, i]

t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

train2[, i]

t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

train2[, i]t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

train2[, i]

t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

train2[, i]

t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

train2[, i]

t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

train2[, i]

t

r

a

i

n

2

$

y

0.0 0.2 0.4 0.6 0.8 1.0

5

1

0

1

5

2

0

2

5

25

Familiar ModelsFamiliar Models

Linear Regression

26

Logistic RegressionLogistic Regression

27

Linear Regression Linear Regression Term Estimate Std Error t Ratio Prob>|t|

Intercept 0.806 0.427 1.890 0.059x1 7.269 0.273 26.590

28

Stepwise RegressionStepwise Regression

Term Estimate Std Error t Ratio Prob>|t|Intercept 0.561 0.328 1.710 0.087x1 7.252 0.273 26.550

29

Stepwise 2Stepwise 2NDND Order ModelOrder ModelTerm Estimate Std Error t Ratio Prob>|t|

Intercept 0.000 0.000 . .x1 7.204 0.169 42.510

30

Next Steps Next Steps

Higher order terms?

When to stop?

Transformations?

Too simple: underfitting bias

Too complex: inconsistent predictions, overfitting high variance

Selecting models is Occams razor Keep goals of interpretation vs. Prediction in mind

31

Tree ModelTree Model|

x4

32

Feature CreationFeature Creation

New predictor based on original predictors

Often linear:

Principal components Factor analysis Multidimensional scaling

ppi xbxbz +++= ...11

33

Neural NetsNeural Nets

Dont resemble the brain Are just a statistical model

34

Input (z1)Output

x1

x2

x3

x4

x5

x0

0.3

0.7

-0.20.4

-0.5

z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5

s(z1)

A Single NeuronA Single Neuron

0.8

35

More exotic More exotic Neural networksNeural networks

Input layer

Output layer

Hidden layer

z1

z2

z3

x1

x2 y

36

Running a Neural NetRunning a Neural Net

37

Predictions for ExamplePredictions for Example

R squared 92.7% Train 90.6% Test

38

What Does This Get Us?What Does This Get Us?

Enormous flexibility

Ability to fit anything Including noise

Interpretation?

39

Case Study Case Study Warranty DataWarranty Data

A new backpack inkjet printer is showing higher than expected warranty claims What are the important variables? Whats going on?

A neural networks shows that Zipcode is the most important predictor

40

Spatial Analysis Spatial Analysis

Warranty Data showing problem with ink jet printer

Use the model as a black box for variable selection

41

MARS MARS

Multivariate Adaptive Regression Splines

What do they do? Replace each step function in a tree model by a pair of linear

functions.

x

y

0 2 4 6 8 10

-

0

.

2

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

1

.

2

x

y

0 2 4 6 8 10

-

0

.

2

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

1

.

2

xy

0 2 4 6 8 10

-

0

.

2

0

.

0

0

.

2

0

.

4

0

.

6

0

.

8

1

.

0

1

.

2

42

MARS Variable ImportanceMARS Variable Importance

R-squared 95.0% Train 94.3% Test(96.3%) (95.8%)

43

MARS Function OutputMARS Function Output

44

Collaborative FilteringCollaborative Filtering

Goal: predict what movies people will like

Data: list of movies each person has watched

Lyle Andre, Starwars Ellen Andre, Starwars, Coeur en Hiver Fred Starwars, BatmanDean Starwars, Batman, RamboJason Coeur en Hiver, Chocolat

45

Data BaseData Base

Data can be represented as a sparse matrix

Karen likes Andre. What else might she like?

CDNow doubled e-mail responses

Andre Starwars Batman Rambo Coeur Chocolat

Lyle y yEllen y y yFred y yDean y y yJason y y y

Karen y ? ? ? ? ?

46

How Do We Really Start?How Do We Really Start?

Life is not so kind Categorical variables Missing data 500 variables, not 10

481 variables where to start?

47

Where to Start?Where to Start?

EDM Use a tree to find a smaller subset of

variables to investigate Explore this set graphically Start the modeling process over

Build model Compare model on small subset with full

predictive model

48

Start With a Simple ModelStart With a Simple Model

Maybe a Tree:|x4

49

Automatic ModelsAutomatic Models

KXEN

50

PVA Results from KXENPVA Results from KXEN

51

Combining Models Combining Models ---- BaggingBagging

Bagging (Bootstrap Aggregation) Bootstrap a data set repeatedly Take many versions of same

model (e.g. tree) Form a committee of models Take majority rule of predictions

52

Combining Models Combining Models ---- BoostingBoosting

Take the data and apply a simple classifier

Reweight the data, weighting the misclassified data much higher.

Reapply the classifier

Repeat over and over

The final prediction is a combination of the output of each classifier, weighted by the overall misclassification rate.

Details in Freund, Y. Boosting a weak learning algorithm by majority, Information and Computation 121(2), 256-285.

53

Breast Cancer DiagnosisBreast Cancer Diagnosis

54

Results from Random ForestResults from Random Forest

F a lse P o sitive Ra te F a lse Ne g a tive Ra teT re e 32.20% 33.70%Bo o ste d T re e s 24.90% 32.50%Ra n d o m F o re st 19.30% 28.80%Ne u ra l Ne tw o rk 25.50% 31.70%Ra d io lo g ists 22.40% 35.80%

Results from 1000 splits of Training and Test data

55

Case StudyCase Study Ingot failuresIngot failures

Ingot cracking 953 30,000 lb. Ingots 20% cracking rate $30,000 per recast 90 potential explanatory variables Water composition (reduced) Metal composition Process variables Other environmental variables

56

Model building processModel building process

Model building Train Test

Evaluate

57

Most Important VariableMost Important Variable

Take One Here we started with trees Alloy We know that

OK, take two Yttrium What do you think is in the alloy?

Third times the charm? Selenium! OH!

58

Case Study Case Study Car InsuranceCar Insurance

Now that we have 40000 mature policies, can we find other factors to price policies better?

65 potential predictors Industry, vehicle age, color, numbers of vehicles, usage

and location etc

59

Fast FailFast Fail

Not every modeling effort is a success A model search can save lots of queries

Data took 8 months to get ready

Analyst spent 2 months exploring it

A new model search program (KXEN) running for several hours found no out of sample predictive ability Tree model gave similar results

60

PVA RecapPVA Recap

Remember --- 481 predictor variables

Need a way to trim this down

Need an exploratory model Neural network? Tree?

61

Students in Data Mining ClassStudents in Data Mining Class

Student #1 $15,024Student #2 $14,695Student #3 $14,345

62

Take Home MessagesTake Home Messages

What a great time to be a Statistician!

Problems are exciting

Research is exciting

Success in Data mining Requires Team Work Requires Flexibility in modeling Means that you Act on Your results Depends much more on the way you mine the data rather

than the specific model or tool that you use

Which method to use? Yes!! Have fun!

63

Thank you!Thank you!

[email protected]

predictive

Documents

es s rs

p rinces s

s ergeant

highnes s

s ecretary

s heik

s enato r

p ro fes s o rs