data meetup presentation

7/27/2019 Data Meetup Presentation

1/51

Data Science DC June 28, 2012

Gerhard Pilcher [email protected]
mailto:[email protected]:[email protected]


2/51

2

Useful= ?

Material?

Attention?

Work?

Other?

Divorce?

New Car?

Affordable?

SellMotorcycle?

Presence?

0.75


3/51

3

Starting

Over Fit

Amplify

Noise

Distributions

Transformations

Baseline

Confidence

Mahalanobis

HiddenRare


4/51

4

Possibly less than 0.1% events

Model with 99% accuracy misses 9 of 10 eventsSignal is diffuse, patterns are faint

Data sets typically large and messy

Rigorous process methods needed

Fraud is a stealthy case of human response


5/51

5

Getting Started

TimeAmplify the response

Noise - Variable selection / reduction

Avoiding Over-fit Ensembles/Bundling

Establishing a baseline Target Shuffling

Gain Confidence Cross validation/Bundling


6/51

6

Create a variable to retain original ordering

Separate Out-of-Sample data (Test) first!

Dont split observations

Stratification a safer strategy

Avoid variable selection before split

Temporal componentCreate random canaries in the mine variables


7/51


8/51

8

Modify the data

Up-sampling: Duplicate rare cases

Down-sampling: Use all rare cases, samplethe rest

Modify the response

Cost sensitive learning Profit or cost matrix

Essentially weights responses based on cost orprofit of each level


9/51

9

Modify the data Correcting predictedprobabilities

1

1

1

1

1

1

1

1


10/51

10

Modify the data Orig Frac = o.05, Sample Frac =0.2, Score = 0.7

+

.

.

.

= 0.33 (hopefully math is correct)


11/51

11

Lots of variablesSparseness

Multicolinearity Principal Components

Confounding

Interactions

Variable Selection

CART Random Forest (Breiman 2001)

Texas 2-Step (Elder 1994)


12/51

12


13/51

13

Complexity War between accuracy and loss

of generality for out of sample dataNon-linear complexity Generalized Degrees

of Freedom (Ye -1998)

Add random noise to response

Measure changes to estimates Higher adaption = higher complexity


14/51

14

A case for bundling / ensembles


15/51

15

Add noise and take a sample


16/51

16

Measured complexity lower for bagged trees!

5 bagged trees, 3 splits 5 bagged trees, 7 splitsOver-fitting


17/51

17

Target Shuffling

1. Break link between target, Y, and features,Xby shuffling Yto form Ys.

2. Model new Ys ~f(X).

3. Notice the quality of the resulting (random) model.

4. Repeat to build distribution (of random models).

5. True model performance can be measured against thisdistribution. The mean (or best) shuffled model can bethe baseline for comparison.


18/51

18

Target Shuffling Single Test


19/51

19

Target Shuffling 7% of the data (in each of 5 tests)


20/51

20

Target Shuffling 100% of the data (in each of 5 tests)


21/51

21

Cut-off Decisions

Workload

Actual

F T

Predicted

F 800,000 100

T 20,000 900


22/51

22

Cut-off Decisions


23/51

Actual

OK BAD

Predicte

d

OK 1540 246

BAD 49 150

Actual

OK BAD

Predicte

d

OK 846 47

BAD 743 349

23

Cut-off Decisions


24/51

24

Cut-off Decisions

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Set investigation

limit

Note

expected

response

Set desired

response

Note work requirements


25/51

25

V-fold Cross Validation

Train V models on different(overlapping) data subsets

Test each on its unseen data

Use distribution of test resultsto score procedure realistically


26/51

26

4-fold Cross Validation


27/51

27

Ensembles/Bundling (J. Elder with S. Lee, U. Idaho, 1997)

(J. Elder with S. Lee, U. Idaho, 1997)


28/51

28

Ensembles/Bundling (J. Elder with S. Lee, U. Idaho, 1997)


29/51

29

Ensembles/Bundling (how many?)


30/51

30


31/51

31

Starting

Over Fit

Amplify

Noise

Distributions

Transformations

Baseline

Confidence

Mahalanobis

HiddenRare


32/51

32

No or few known events

Subject Matter Expertise essentialDefining anomolies

Data quality and matching

Sparse data

Measuring goodness of model


33/51

33

Time

TransformationsOther Distributions

Clustering

Mahanalobis distance measure


34/51

34

Time


35/51

35

Transformations

Functions (log, square root, power, multiplicativeinverse)

Normalization (0,1) Outliers?

Standardization Gaussian?

Categorical dummy variables Variance stabilizing Fisher, Box-Cox, Anscombe


36/51

36

Transformations


37/51

37

Transformations


38/51

38

http://itl.nist.gov/div898/handbook/eda/section3/eda366.htm
http://itl.nist.gov/div898/handbook/eda/section3/eda366.htmhttp://itl.nist.gov/div898/handbook/eda/section3/eda366.htm


39/51

39

Euclidean distance: or norm(A-B) Sensitive to scale

Blind to covariance

Mahalanobis distance:

Normally distributed, continuous variables

If Covariance = 0, then Euclidean

The following explanation is shamelessly borrowed from smartpeople like:

Antonia de Medinaceli and Kenny Darrell, Elder Research

Will Dwinnell, Blogger, MATLAB user, Data Miner / Statisticianhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html

Rick Wicklin, SAShttp://blogs.sas.com/content/iml/author/rickwicklin/
http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://blogs.sas.com/content/iml/author/rickwicklin/http://blogs.sas.com/content/iml/author/rickwicklin/http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html


40/51

40

Is this point an outlier?


41/51

41



42/51

42



43/51

43

Maps to the same point in 2D


44/51

44

Euclidean distance from centroid


45/51

45

So Euclidean distance is not the best distance metric.

Need to account for the shape of the mass of data

Hmmm, I wonder if the

Mahalanobis distance would

work??


46/51

46

, where X = (X1, X2), = 1,2 , S - the covariance matrix of

X1, X2.Accounts for variance of each variable and

covariance.

Essentially transforms data into

standardized, uncorrelated data andcomputes ordinary Euclidean distances forthe transformed data.


47/51

47


48/51

48

Cool, so whats an outlier?

The squared Mahalanobis distance of multivariatenormal data obeys a chi-square distribution

withp degress of freedom, wherep is the number ofvariables (Rick Wicklin, SAS)

(0.975, )

2 variables = 2.72

10 variables = 4.53


49/51

49

(Rick Wicklin, SAS)


50/51

50


51/51

Starting

Over Fit

Amplify

Noise

Distributions

Transformations

Baseline

Confidence

Mahalanobis

HiddenRare

data meetup presentation

Documents