data meetup presentation
TRANSCRIPT
-
7/27/2019 Data Meetup Presentation
1/51
Data Science DC June 28, 2012
Gerhard Pilcher [email protected]
mailto:[email protected]:[email protected] -
7/27/2019 Data Meetup Presentation
2/51
2
Useful= ?
Material?
Attention?
Work?
Other?
Divorce?
New Car?
Affordable?
SellMotorcycle?
Presence?
0.75
-
7/27/2019 Data Meetup Presentation
3/51
3
Starting
Over Fit
Amplify
Noise
Distributions
Transformations
Baseline
Confidence
Mahalanobis
HiddenRare
-
7/27/2019 Data Meetup Presentation
4/51
4
Possibly less than 0.1% events
Model with 99% accuracy misses 9 of 10 eventsSignal is diffuse, patterns are faint
Data sets typically large and messy
Rigorous process methods needed
Fraud is a stealthy case of human response
-
7/27/2019 Data Meetup Presentation
5/51
5
Getting Started
TimeAmplify the response
Noise - Variable selection / reduction
Avoiding Over-fit Ensembles/Bundling
Establishing a baseline Target Shuffling
Gain Confidence Cross validation/Bundling
-
7/27/2019 Data Meetup Presentation
6/51
6
Create a variable to retain original ordering
Separate Out-of-Sample data (Test) first!
Dont split observations
Stratification a safer strategy
Avoid variable selection before split
Temporal componentCreate random canaries in the mine variables
-
7/27/2019 Data Meetup Presentation
7/51
-
7/27/2019 Data Meetup Presentation
8/51
8
Modify the data
Up-sampling: Duplicate rare cases
Down-sampling: Use all rare cases, samplethe rest
Modify the response
Cost sensitive learning Profit or cost matrix
Essentially weights responses based on cost orprofit of each level
-
7/27/2019 Data Meetup Presentation
9/51
9
Modify the data Correcting predictedprobabilities
1
1
1
1
1
1
1
1
-
7/27/2019 Data Meetup Presentation
10/51
10
Modify the data Orig Frac = o.05, Sample Frac =0.2, Score = 0.7
+
.
.
.
= 0.33 (hopefully math is correct)
-
7/27/2019 Data Meetup Presentation
11/51
11
Lots of variablesSparseness
Multicolinearity Principal Components
Confounding
Interactions
Variable Selection
CART Random Forest (Breiman 2001)
Texas 2-Step (Elder 1994)
-
7/27/2019 Data Meetup Presentation
12/51
12
-
7/27/2019 Data Meetup Presentation
13/51
13
Complexity War between accuracy and loss
of generality for out of sample dataNon-linear complexity Generalized Degrees
of Freedom (Ye -1998)
Add random noise to response
Measure changes to estimates Higher adaption = higher complexity
-
7/27/2019 Data Meetup Presentation
14/51
14
A case for bundling / ensembles
-
7/27/2019 Data Meetup Presentation
15/51
15
Add noise and take a sample
-
7/27/2019 Data Meetup Presentation
16/51
16
Measured complexity lower for bagged trees!
5 bagged trees, 3 splits 5 bagged trees, 7 splitsOver-fitting
-
7/27/2019 Data Meetup Presentation
17/51
17
Target Shuffling
1. Break link between target, Y, and features,Xby shuffling Yto form Ys.
2. Model new Ys ~f(X).
3. Notice the quality of the resulting (random) model.
4. Repeat to build distribution (of random models).
5. True model performance can be measured against thisdistribution. The mean (or best) shuffled model can bethe baseline for comparison.
-
7/27/2019 Data Meetup Presentation
18/51
18
Target Shuffling Single Test
-
7/27/2019 Data Meetup Presentation
19/51
19
Target Shuffling 7% of the data (in each of 5 tests)
-
7/27/2019 Data Meetup Presentation
20/51
20
Target Shuffling 100% of the data (in each of 5 tests)
-
7/27/2019 Data Meetup Presentation
21/51
21
Cut-off Decisions
Workload
Actual
F T
Predicted
F 800,000 100
T 20,000 900
-
7/27/2019 Data Meetup Presentation
22/51
22
Cut-off Decisions
-
7/27/2019 Data Meetup Presentation
23/51
Actual
OK BAD
Predicte
d
OK 1540 246
BAD 49 150
Actual
OK BAD
Predicte
d
OK 846 47
BAD 743 349
23
Cut-off Decisions
-
7/27/2019 Data Meetup Presentation
24/51
24
Cut-off Decisions
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Set investigation
limit
Note
expected
response
Set desired
response
Note work requirements
-
7/27/2019 Data Meetup Presentation
25/51
25
V-fold Cross Validation
Train V models on different(overlapping) data subsets
Test each on its unseen data
Use distribution of test resultsto score procedure realistically
-
7/27/2019 Data Meetup Presentation
26/51
26
4-fold Cross Validation
-
7/27/2019 Data Meetup Presentation
27/51
27
Ensembles/Bundling (J. Elder with S. Lee, U. Idaho, 1997)
(J. Elder with S. Lee, U. Idaho, 1997)
-
7/27/2019 Data Meetup Presentation
28/51
28
Ensembles/Bundling (J. Elder with S. Lee, U. Idaho, 1997)
-
7/27/2019 Data Meetup Presentation
29/51
29
Ensembles/Bundling (how many?)
-
7/27/2019 Data Meetup Presentation
30/51
30
-
7/27/2019 Data Meetup Presentation
31/51
31
Starting
Over Fit
Amplify
Noise
Distributions
Transformations
Baseline
Confidence
Mahalanobis
HiddenRare
-
7/27/2019 Data Meetup Presentation
32/51
32
No or few known events
Subject Matter Expertise essentialDefining anomolies
Data quality and matching
Sparse data
Measuring goodness of model
-
7/27/2019 Data Meetup Presentation
33/51
33
Time
TransformationsOther Distributions
Clustering
Mahanalobis distance measure
-
7/27/2019 Data Meetup Presentation
34/51
34
Time
-
7/27/2019 Data Meetup Presentation
35/51
35
Transformations
Functions (log, square root, power, multiplicativeinverse)
Normalization (0,1) Outliers?
Standardization Gaussian?
Categorical dummy variables Variance stabilizing Fisher, Box-Cox, Anscombe
-
7/27/2019 Data Meetup Presentation
36/51
36
Transformations
-
7/27/2019 Data Meetup Presentation
37/51
37
Transformations
-
7/27/2019 Data Meetup Presentation
38/51
38
http://itl.nist.gov/div898/handbook/eda/section3/eda366.htm
http://itl.nist.gov/div898/handbook/eda/section3/eda366.htmhttp://itl.nist.gov/div898/handbook/eda/section3/eda366.htm -
7/27/2019 Data Meetup Presentation
39/51
39
Euclidean distance: or norm(A-B) Sensitive to scale
Blind to covariance
Mahalanobis distance:
Normally distributed, continuous variables
If Covariance = 0, then Euclidean
The following explanation is shamelessly borrowed from smartpeople like:
Antonia de Medinaceli and Kenny Darrell, Elder Research
Will Dwinnell, Blogger, MATLAB user, Data Miner / Statisticianhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html
Rick Wicklin, SAShttp://blogs.sas.com/content/iml/author/rickwicklin/
http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://blogs.sas.com/content/iml/author/rickwicklin/http://blogs.sas.com/content/iml/author/rickwicklin/http://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.htmlhttp://matlabdatamining.blogspot.com/2006/11/mahalanobis-distance.html -
7/27/2019 Data Meetup Presentation
40/51
40
Is this point an outlier?
-
7/27/2019 Data Meetup Presentation
41/51
41
Is this point an outlier?
-
7/27/2019 Data Meetup Presentation
42/51
42
Is this point an outlier?
-
7/27/2019 Data Meetup Presentation
43/51
43
Maps to the same point in 2D
-
7/27/2019 Data Meetup Presentation
44/51
44
Euclidean distance from centroid
-
7/27/2019 Data Meetup Presentation
45/51
45
So Euclidean distance is not the best distance metric.
Need to account for the shape of the mass of data
Hmmm, I wonder if the
Mahalanobis distance would
work??
-
7/27/2019 Data Meetup Presentation
46/51
46
, where X = (X1, X2), = 1,2 , S - the covariance matrix of
X1, X2.Accounts for variance of each variable and
covariance.
Essentially transforms data into
standardized, uncorrelated data andcomputes ordinary Euclidean distances forthe transformed data.
-
7/27/2019 Data Meetup Presentation
47/51
47
-
7/27/2019 Data Meetup Presentation
48/51
48
Cool, so whats an outlier?
The squared Mahalanobis distance of multivariatenormal data obeys a chi-square distribution
withp degress of freedom, wherep is the number ofvariables (Rick Wicklin, SAS)
(0.975, )
2 variables = 2.72
10 variables = 4.53
-
7/27/2019 Data Meetup Presentation
49/51
49
(Rick Wicklin, SAS)
-
7/27/2019 Data Meetup Presentation
50/51
50
-
7/27/2019 Data Meetup Presentation
51/51
Starting
Over Fit
Amplify
Noise
Distributions
Transformations
Baseline
Confidence
Mahalanobis
HiddenRare