data meetup presentation

Upload: dynamo13

Post on 02-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Data Meetup Presentation

    1/51

    Data Science DC June 28, 2012

    Gerhard Pilcher [email protected]

    mailto:[email protected]:[email protected]
  • 7/27/2019 Data Meetup Presentation

    2/51

    2

    Useful= ?

    Material?

    Attention?

    Work?

    Other?

    Divorce?

    New Car?

    Affordable?

    SellMotorcycle?

    Presence?

    0.75

  • 7/27/2019 Data Meetup Presentation

    3/51

    3

    Starting

    Over Fit

    Amplify

    Noise

    Distributions

    Transformations

    Baseline

    Confidence

    Mahalanobis

    HiddenRare

  • 7/27/2019 Data Meetup Presentation

    4/51

    4

    Possibly less than 0.1% events

    Model with 99% accuracy misses 9 of 10 eventsSignal is diffuse, patterns are faint

    Data sets typically large and messy

    Rigorous process methods needed

    Fraud is a stealthy case of human response

  • 7/27/2019 Data Meetup Presentation

    5/51

    5

    Getting Started

    TimeAmplify the response

    Noise - Variable selection / reduction

    Avoiding Over-fit Ensembles/Bundling

    Establishing a baseline Target Shuffling

    Gain Confidence Cross validation/Bundling

  • 7/27/2019 Data Meetup Presentation

    6/51

    6

    Create a variable to retain original ordering

    Separate Out-of-Sample data (Test) first!

    Dont split observations

    Stratification a safer strategy

    Avoid variable selection before split

    Temporal componentCreate random canaries in the mine variables

  • 7/27/2019 Data Meetup Presentation

    7/51

  • 7/27/2019 Data Meetup Presentation

    8/51

    8

    Modify the data

    Up-sampling: Duplicate rare cases

    Down-sampling: Use all rare cases, samplethe rest

    Modify the response

    Cost sensitive learning Profit or cost matrix

    Essentially weights responses based on cost orprofit of each level

  • 7/27/2019 Data Meetup Presentation

    9/51

    9

    Modify the data Correcting predictedprobabilities

    1

    1

    1

    1

    1

    1

    1

    1

  • 7/27/2019 Data Meetup Presentation

    10/51

    10

    Modify the data Orig Frac = o.05, Sample Frac =0.2, Score = 0.7

    +

    .

    .

    .

    = 0.33 (hopefully math is correct)

  • 7/27/2019 Data Meetup Presentation

    11/51

    11

    Lots of variablesSparseness

    Multicolinearity Principal Components

    Confounding

    Interactions

    Variable Selection

    CART Random Forest (Breiman 2001)

    Texas 2-Step (Elder 1994)

  • 7/27/2019 Data Meetup Presentation

    12/51

    12

  • 7/27/2019 Data Meetup Presentation

    13/51

    13

    Complexity War between accuracy and loss

    of generality for out of sample dataNon-linear complexity Generalized Degrees

    of Freedom (Ye -1998)

    Add random noise to response

    Measure changes to estimates Higher adaption = higher complexity

  • 7/27/2019 Data Meetup Presentation

    14/51

    14

    A case for bundling / ensembles

  • 7/27/2019 Data Meetup Presentation

    15/51

    15

    Add noise and take a sample

  • 7/27/2019 Data Meetup Presentation

    16/51

    16

    Measured complexity lower for bagged trees!

    5 bagged trees, 3 splits 5 bagged trees, 7 splitsOver-fitting

  • 7/27/2019 Data Meetup Presentation

    17/51

    17

    Target Shuffling

    1. Break link between target, Y, and features,Xby shuffling Yto form Ys.

    2. Model new Ys ~f(X).

    3. Notice the quality of the resulting (random) model.

    4. Repeat to build distribution (of random models).

    5. True model performance can be measured against thisdistribution. The mean (or best) shuffled model can bethe baseline for comparison.

  • 7/27/2019 Data Meetup Presentation

    18/51

    18

    Target Shuffling Single Test

  • 7/27/2019 Data Meetup Presentation

    19/51

    19

    Target Shuffling 7% of the data (in each of 5 tests)

  • 7/27/2019 Data Meetup Presentation

    20/51

    20

    Target Shuffling 100% of the data (in each of 5 tests)

  • 7/27/2019 Data Meetup Presentation

    21/51

    21

    Cut-off Decisions

    Workload

    Actual

    F T

    Predicted

    F 800,000 100

    T 20,000 900

  • 7/27/2019 Data Meetup Presentation

    22/51

    22

    Cut-off Decisions

  • 7/27/2019 Data Meetup Presentation

    23/51

    Actual

    OK BAD

    Predicte

    d

    OK 1540 246

    BAD 49 150

    Actual

    OK BAD

    Predicte

    d

    OK 846 47

    BAD 743 349

    23

    Cut-off Decisions

  • 7/27/2019 Data Meetup Presentation

    24/51

    24

    Cut-off Decisions

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

    Set investigation

    limit

    Note

    expected

    response

    Set desired

    response

    Note work requirements

  • 7/27/2019 Data Meetup Presentation

    25/51

    25

    V-fold Cross Validation

    Train V models on different(overlapping) data subsets

    Test each on its unseen data

    Use distribution of test resultsto score procedure realistically

  • 7/27/2019 Data Meetup Presentation

    26/51

    26

    4-fold Cross Validation

  • 7/27/2019 Data Meetup Presentation

    27/51

    27

    Ensembles/Bundling (J. Elder with S. Lee, U. Idaho, 1997)

    (J. Elder with S. Lee, U. Idaho, 1997)

  • 7/27/2019 Data Meetup Presentation

    28/51

    28

    Ensembles/Bundling (J. Elder with S. Lee, U. Idaho, 1997)

  • 7/27/2019 Data Meetup Presentation

    29/51

    29

    Ensembles/Bundling (how many?)

  • 7/27/2019 Data Meetup Presentation

    30/51

    30

  • 7/27/2019 Data Meetup Presentation

    31/51

    31

    Starting

    Over Fit

    Amplify

    Noise

    Distributions

    Transformations

    Baseline

    Confidence

    Mahalanobis

    HiddenRare

  • 7/27/2019 Data Meetup Presentation

    32/51

    32

    No or few known events

    Subject Matter Expertise essentialDefining anomolies

    Data quality and matching

    Sparse data

    Measuring goodness of model

  • 7/27/2019 Data Meetup Presentation

    33/51

    33

    Time

    TransformationsOther Distributions

    Clustering

    Mahanalobis distance measure

  • 7/27/2019 Data Meetup Presentation

    34/51

    34

    Time

  • 7/27/2019 Data Meetup Presentation

    35/51

    35

    Transformations

    Functions (log, square root, power, multiplicativeinverse)

    Normalization (0,1) Outliers?

    Standardization Gaussian?

    Categorical dummy variables Variance stabilizing Fisher, Box-Cox, Anscombe

  • 7/27/2019 Data Meetup Presentation

    36/51

    36

    Transformations

  • 7/27/2019 Data Meetup Presentation

    37/51

    37

    Transformations

  • 7/27/2019 Data Meetup Presentation

    38/51

    38

    http://itl.nist.gov/div898/handbook/eda/section3/eda366.htm

    http://itl.nist.gov/div898/handbook/eda/section3/eda366.htmhttp://itl.nist.gov/div898/handbook/eda/section3/eda366.htm
  • 7/27/2019 Data Meetup Presentation

    39/51

    39

    Euclidean distance: or norm(A-B) Sensitive to scale

    Blind to covariance

    Mahalanobis distance:

    Normally distributed, continuous variables

    If Covariance = 0, then Euclidean

    The following explanation is shamelessly borrowed from smartpeople like:

    Antonia de Medinaceli and Kenny Darrell, Elder Research

    Will Dwinnell, Blogger, MATLAB user, Data Miner / Statisticianhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html

    Rick Wicklin, SAShttp://blogs.sas.com/content/iml/author/rickwicklin/

    http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://blogs.sas.com/content/iml/author/rickwicklin/http://blogs.sas.com/content/iml/author/rickwicklin/http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html
  • 7/27/2019 Data Meetup Presentation

    40/51

    40

    Is this point an outlier?

  • 7/27/2019 Data Meetup Presentation

    41/51

    41

    Is this point an outlier?

  • 7/27/2019 Data Meetup Presentation

    42/51

    42

    Is this point an outlier?

  • 7/27/2019 Data Meetup Presentation

    43/51

    43

    Maps to the same point in 2D

  • 7/27/2019 Data Meetup Presentation

    44/51

    44

    Euclidean distance from centroid

  • 7/27/2019 Data Meetup Presentation

    45/51

    45

    So Euclidean distance is not the best distance metric.

    Need to account for the shape of the mass of data

    Hmmm, I wonder if the

    Mahalanobis distance would

    work??

  • 7/27/2019 Data Meetup Presentation

    46/51

    46

    , where X = (X1, X2), = 1,2 , S - the covariance matrix of

    X1, X2.Accounts for variance of each variable and

    covariance.

    Essentially transforms data into

    standardized, uncorrelated data andcomputes ordinary Euclidean distances forthe transformed data.

  • 7/27/2019 Data Meetup Presentation

    47/51

    47

  • 7/27/2019 Data Meetup Presentation

    48/51

    48

    Cool, so whats an outlier?

    The squared Mahalanobis distance of multivariatenormal data obeys a chi-square distribution

    withp degress of freedom, wherep is the number ofvariables (Rick Wicklin, SAS)

    (0.975, )

    2 variables = 2.72

    10 variables = 4.53

  • 7/27/2019 Data Meetup Presentation

    49/51

    49

    (Rick Wicklin, SAS)

  • 7/27/2019 Data Meetup Presentation

    50/51

    50

  • 7/27/2019 Data Meetup Presentation

    51/51

    Starting

    Over Fit

    Amplify

    Noise

    Distributions

    Transformations

    Baseline

    Confidence

    Mahalanobis

    HiddenRare