predictive

63
Predictive Analytics: Modeling the World Richard D. De Veaux Professor of Statistics, Williams College January 28, 2005 OR/MS Seminar

Upload: cyobosaurus

Post on 17-Dec-2015

18 views

Category:

Documents


0 download

DESCRIPTION

Predictive

TRANSCRIPT

  • Predictive Analytics: Modeling the World

    Richard D. De VeauxProfessor of Statistics, Williams CollegeJanuary 28, 2005OR/MS Seminar

  • 2Getting to Know Your CustomersGetting to Know Your Customers

    50 years ago this was easy Customer data base could fit in one persons head Retention of customers depended on ability to do so

  • 32121stst Century Data Bases Century Data Bases

    Ability to anticipate customers needs crucial for retention

    Even Sam Walton didnt know all his customers preferences

    Amazon.com Earths biggest selection $390,000 Diamond Necklace Worlds biggest book Yak Cheese from Tibet

    No one can do this without help Well, almost no one!

  • 4 Paralyzed Veterans of America KDD 1998 cup

    Mailing list of 3.5 million potential donors

    Lapsed donors Made their last donation to PVA 13 to 24 months prior

    to June 1997

    200,000 (training and test sets)

    Who should get the current mailing? Cost effective strategy

    Direct Marketing ExampleDirect Marketing Example

  • 5Why is this Hard?Why is this Hard?

    Amount of Information 481 predictors 2 responses

    Cross tabs / OLAP How many combinations? What to focus on?

    Data Preparation This alone can be 60-95% of the effort Categorical vs. Quantitative

  • 6WhatWhats s HardHard? ? ----ExampleExample

  • 7TT--CodeCode

  • 8So, what does it mean?So, what does it mean?T -C ode

    0 _ 1 6 DEAN 4 8 CO RP O RAL 1 0 9 LIC. 1 M R. 1 7 J UDGE 5 0 ELDER 1 1 1 S A.

    1 0 0 1 M ES S RS . 1 7 0 0 2 J UDGE & M RS . 5 6 M AYO R 1 1 4 DA. 1 0 0 2 M R. & M RS . 1 8 M AJ O R 5 9 0 0 2 LIEUTENANT & M RS . 1 1 6 S R.

    2 M RS . 1 8 0 0 2 M AJ O R & M RS . 6 2 LO RD 1 1 7 S RA. 2 0 0 2 M ES DAM ES 1 9 S ENATO R 6 3 CARDINAL 1 1 8 S RTA.

    3 M IS S 2 0 GO V ERNO R 6 4 FRIEND 1 2 0 YO UR M AJ ES TY 3 0 0 3 M IS S ES 2 1 0 0 2 S ERGEANT & M RS . 6 5 FRIENDS 1 2 2 HIS HIGHNES S

    4 DR. 2 2 0 0 2 CO LNEL & M RS . 6 8 ARCHDEACO N 1 2 3 HER HIGHNES S 4 0 0 2 DR. & M RS . 2 4 LIEUTENANT 6 9 CANO N 1 2 4 CO UNT 4 0 0 4 DO CTO RS 2 6 M O NS IGNO R 7 0 BIS HO P 1 2 5 LADY

    5 M ADAM E 2 7 REV EREND 7 2 0 0 2 REV EREND & M RS . 1 2 6 P RINCE 6 S ERGEANT 2 8 M S . 7 3 P AS TO R 1 2 7 P RINCES S 9 RABBI 2 8 0 2 8 M S S . 7 5 ARCHBIS HO P 1 2 8 CHIEF

    1 0 P RO FES S O R 2 9 BIS HO P 8 5 S P ECIALIS T 1 2 9 BARO N 1 0 0 0 2 P RO FES S O R & M RS . 3 1 AM BAS S ADO R 8 7 P RIV ATE 1 3 0 S HEIK 1 0 0 1 0 P RO FES S O RS 3 1 0 0 2 AM BAS S ADO R & M RS 8 9 S EAM AN 1 3 1 P RINCE AND P RINCES S

    1 1 ADM IRAL 3 3 CANTO R 9 0 AIRM AN 1 3 2 YO UR IM P ERIAL M AJ ES T 1 1 0 0 2 ADM IRAL & M RS . 3 6 BRO THER 9 1 J US TICE 1 3 5 M . ET M M E.

    1 2 GENERAL 3 7 S IR 9 2 M R. J US TICE 2 1 0 P RO F.1 2 0 0 2 GENERAL & M RS . 3 8 CO M M O DO RE 1 0 0 M .

    1 3 CO LO NEL 4 0 FATHER 1 0 3 M LLE. 1 3 0 0 2 CO LO NEL & M RS . 4 2 S IS TER 1 0 4 CHANCELLO R

    1 4 CAP TAIN 4 3 P RES IDENT 1 0 6 REP RES ENTATIV E 1 4 0 0 2 CAP TAIN & M RS . 4 4 M AS TER 1 0 7 S ECRETARY

    1 5 CO M M ANDER 4 6 M O THER 1 0 8 LT. GO V ERNO R 1 5 0 0 2 CO M M ANDER & M RS . 4 7 CHAP LAIN

    T itle

  • 9Results for PVA Data Set If entire list (100,000 donors) are

    mailed, net donation is $10,500

    Using data mining techniques, this was increased 41.37%

  • 10

    KDD CUP 98 ResultsKDD CUP 98 Results

  • 11

    KDD CUP 98 Results 2

  • 12

    Data Mining IsData Mining Isthe nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. --- Fayyad

    finding interesting structure (patterns, statistical models, relationships) in data bases.--- Fayyad, Chaduri and Bradley

    a knowledge discovery process of extracting previously unknown, actionable information from very large data bases--- Zornes

    a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.---Edelstein

  • 13

    Data Mining Is Data Mining Is

  • 14

    Case Study ICase Study I Ingot Cracking 953 30,000 lb. Ingots 20% cracking rate $30,000 per recast 90 Potential Explanatory

    Variables Water composition Metal composition Process variables Other environmental variables

    Can we predict under what conditions ingots will crack?

  • 15

    Case Study II Case Study II Car Insurance 42800 mature policies 65 Potential Predictors

    Can we find a pattern for the unprofitable policies?

  • 16

    Case Study IIICase Study III

    Breast Cancer Diagnosis Mammograms used as

    screening instrument Expensive radiologist read Inaccurate

    False positive and negative rates over 25%

    Over a decade, nearly 100% false positive rate

    Can we do better? Automatically read by a scanning

    algorithm Automatically diagnosed by a

    model

  • 17

    Why not Queries?Why not Queries?

    Queries Describe Models promote understanding Models can be assessed both by their understanding and

    their predictions Its difficult to predict especially the future

    Queries are Event Driven Models are phenomenon driven

    Queries are reactive Models are proactive

  • 18

    What Happened on the Titanic?What Happened on the Titanic?

    CrewFirstSecondThird

    Class

  • 19

    Mosaic Plot Mosaic Plot

    D S

    A

    C

    F M

    1

    2

    3

    C

    1

    2

    3

    C

    F M

  • 20

    ModelsModels

    Powerful predictors for optimizing performance

    Powerful summaries for understanding

    Used to explore data set

    Are not perfect All models are wrong, but some are useful Statisticians, like artists, have the bad habit of falling

    in love with their models.

  • 21

    Tree DiagramTree Diagram

    |M

    3

    46% 93%

    3 1,2,CChildAdult

    1 or 2

    F

    27% 100%

    33%23%

    1stCrew

    1 or Crew2 or 3

    14%

  • 22

    Why Models?Why Models?

    Whats interesting? Most associated variables in the census Whats associated with shampoo

    purchases?

    Beer and Diapers In the convenience stores we looked at, on

    Friday nights, purchases of beer and purchases of diapers are highly associated Conclusions? Actions?

  • 23

    Beer and DiapersBeer and Diapers

    Picture from TandemTM ad

  • 24

    ToyToy Problem Problem

    train2[, i]

    t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

    train2[, i]

    t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

    train2[, i]

    t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

    train2[, i]

    t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

    train2[, i]

    t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

    train2[, i]t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

    train2[, i]

    t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

    train2[, i]

    t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

    train2[, i]

    t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

    train2[, i]

    t

    r

    a

    i

    n

    2

    $

    y

    0.0 0.2 0.4 0.6 0.8 1.0

    5

    1

    0

    1

    5

    2

    0

    2

    5

  • 25

    Familiar ModelsFamiliar Models

    Linear Regression

  • 26

    Logistic RegressionLogistic Regression

  • 27

    Linear Regression Linear Regression Term Estimate Std Error t Ratio Prob>|t|

    Intercept 0.806 0.427 1.890 0.059x1 7.269 0.273 26.590

  • 28

    Stepwise RegressionStepwise Regression

    Term Estimate Std Error t Ratio Prob>|t|Intercept 0.561 0.328 1.710 0.087x1 7.252 0.273 26.550

  • 29

    Stepwise 2Stepwise 2NDND Order ModelOrder ModelTerm Estimate Std Error t Ratio Prob>|t|

    Intercept 0.000 0.000 . .x1 7.204 0.169 42.510

  • 30

    Next Steps Next Steps

    Higher order terms?

    When to stop?

    Transformations?

    Too simple: underfitting bias

    Too complex: inconsistent predictions, overfitting high variance

    Selecting models is Occams razor Keep goals of interpretation vs. Prediction in mind

  • 31

    Tree ModelTree Model|

    x4

  • 32

    Feature CreationFeature Creation

    New predictor based on original predictors

    Often linear:

    Principal components Factor analysis Multidimensional scaling

    ppi xbxbz +++= ...11

  • 33

    Neural NetsNeural Nets

    Dont resemble the brain Are just a statistical model

  • 34

    Input (z1)Output

    x1

    x2

    x3

    x4

    x5

    x0

    0.3

    0.7

    -0.20.4

    -0.5

    z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5

    s(z1)

    A Single NeuronA Single Neuron

    0.8

  • 35

    More exotic More exotic Neural networksNeural networks

    Input layer

    Output layer

    Hidden layer

    z1

    z2

    z3

    x1

    x2 y

  • 36

    Running a Neural NetRunning a Neural Net

  • 37

    Predictions for ExamplePredictions for Example

    R squared 92.7% Train 90.6% Test

  • 38

    What Does This Get Us?What Does This Get Us?

    Enormous flexibility

    Ability to fit anything Including noise

    Interpretation?

  • 39

    Case Study Case Study Warranty DataWarranty Data

    A new backpack inkjet printer is showing higher than expected warranty claims What are the important variables? Whats going on?

    A neural networks shows that Zipcode is the most important predictor

  • 40

    Spatial Analysis Spatial Analysis

    Warranty Data showing problem with ink jet printer

    Use the model as a black box for variable selection

  • 41

    MARS MARS

    Multivariate Adaptive Regression Splines

    What do they do? Replace each step function in a tree model by a pair of linear

    functions.

    x

    y

    0 2 4 6 8 10

    -

    0

    .

    2

    0

    .

    0

    0

    .

    2

    0

    .

    4

    0

    .

    6

    0

    .

    8

    1

    .

    0

    1

    .

    2

    x

    y

    0 2 4 6 8 10

    -

    0

    .

    2

    0

    .

    0

    0

    .

    2

    0

    .

    4

    0

    .

    6

    0

    .

    8

    1

    .

    0

    1

    .

    2

    xy

    0 2 4 6 8 10

    -

    0

    .

    2

    0

    .

    0

    0

    .

    2

    0

    .

    4

    0

    .

    6

    0

    .

    8

    1

    .

    0

    1

    .

    2

  • 42

    MARS Variable ImportanceMARS Variable Importance

    R-squared 95.0% Train 94.3% Test(96.3%) (95.8%)

  • 43

    MARS Function OutputMARS Function Output

  • 44

    Collaborative FilteringCollaborative Filtering

    Goal: predict what movies people will like

    Data: list of movies each person has watched

    Lyle Andre, Starwars Ellen Andre, Starwars, Coeur en Hiver Fred Starwars, BatmanDean Starwars, Batman, RamboJason Coeur en Hiver, Chocolat

  • 45

    Data BaseData Base

    Data can be represented as a sparse matrix

    Karen likes Andre. What else might she like?

    CDNow doubled e-mail responses

    Andre Starwars Batman Rambo Coeur Chocolat

    Lyle y yEllen y y yFred y yDean y y yJason y y y

    Karen y ? ? ? ? ?

  • 46

    How Do We Really Start?How Do We Really Start?

    Life is not so kind Categorical variables Missing data 500 variables, not 10

    481 variables where to start?

  • 47

    Where to Start?Where to Start?

    EDM Use a tree to find a smaller subset of

    variables to investigate Explore this set graphically Start the modeling process over

    Build model Compare model on small subset with full

    predictive model

  • 48

    Start With a Simple ModelStart With a Simple Model

    Maybe a Tree:|x4

  • 49

    Automatic ModelsAutomatic Models

    KXEN

  • 50

    PVA Results from KXENPVA Results from KXEN

  • 51

    Combining Models Combining Models ---- BaggingBagging

    Bagging (Bootstrap Aggregation) Bootstrap a data set repeatedly Take many versions of same

    model (e.g. tree) Form a committee of models Take majority rule of predictions

  • 52

    Combining Models Combining Models ---- BoostingBoosting

    Take the data and apply a simple classifier

    Reweight the data, weighting the misclassified data much higher.

    Reapply the classifier

    Repeat over and over

    The final prediction is a combination of the output of each classifier, weighted by the overall misclassification rate.

    Details in Freund, Y. Boosting a weak learning algorithm by majority, Information and Computation 121(2), 256-285.

  • 53

    Breast Cancer DiagnosisBreast Cancer Diagnosis

  • 54

    Results from Random ForestResults from Random Forest

    F a lse P o sitive Ra te F a lse Ne g a tive Ra teT re e 32.20% 33.70%Bo o ste d T re e s 24.90% 32.50%Ra n d o m F o re st 19.30% 28.80%Ne u ra l Ne tw o rk 25.50% 31.70%Ra d io lo g ists 22.40% 35.80%

    Results from 1000 splits of Training and Test data

  • 55

    Case StudyCase Study Ingot failuresIngot failures

    Ingot cracking 953 30,000 lb. Ingots 20% cracking rate $30,000 per recast 90 potential explanatory variables Water composition (reduced) Metal composition Process variables Other environmental variables

  • 56

    Model building processModel building process

    Model building Train Test

    Evaluate

  • 57

    Most Important VariableMost Important Variable

    Take One Here we started with trees Alloy We know that

    OK, take two Yttrium What do you think is in the alloy?

    Third times the charm? Selenium! OH!

  • 58

    Case Study Case Study Car InsuranceCar Insurance

    Now that we have 40000 mature policies, can we find other factors to price policies better?

    65 potential predictors Industry, vehicle age, color, numbers of vehicles, usage

    and location etc

  • 59

    Fast FailFast Fail

    Not every modeling effort is a success A model search can save lots of queries

    Data took 8 months to get ready

    Analyst spent 2 months exploring it

    A new model search program (KXEN) running for several hours found no out of sample predictive ability Tree model gave similar results

  • 60

    PVA RecapPVA Recap

    Remember --- 481 predictor variables

    Need a way to trim this down

    Need an exploratory model Neural network? Tree?

  • 61

    Students in Data Mining ClassStudents in Data Mining Class

    Student #1 $15,024Student #2 $14,695Student #3 $14,345

  • 62

    Take Home MessagesTake Home Messages

    What a great time to be a Statistician!

    Problems are exciting

    Research is exciting

    Success in Data mining Requires Team Work Requires Flexibility in modeling Means that you Act on Your results Depends much more on the way you mine the data rather

    than the specific model or tool that you use

    Which method to use? Yes!! Have fun!

  • 63

    Thank you!Thank you!

    [email protected]