psy 1950 outliers, missing data, and transformations september 22, 2008
DESCRIPTION
PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008. On Suspecting Fishiness Looking for outliers, gaps, and dips e.g., tests of clairvoyance When gaps or dips are hypothesized e.g., is dyslexia a distinct entity Cliffs - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/1.jpg)
PSY 1950Outliers, Missing Data, and
TransformationsSeptember 22, 2008
![Page 2: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/2.jpg)
On Suspecting Fishiness• Looking for outliers, gaps, and dips
– e.g., tests of clairvoyance• When gaps or dips are hypothesized
– e.g., is dyslexia a distinct entity• Cliffs
– e.g., differences between rating of ingroup and outgroup
• Peaks– e.g., the blackout and baby boom
• The occurrence of impossible scores
![Page 3: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/3.jpg)
Visualize your data!• “make friends with your data”
– Rosenthal• “don’t becomes lovers with your
data”– Me
• Statistics condense data• View raw data graphically
– Frequency distribution graphs– Scatter plots
![Page 4: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/4.jpg)
![Page 5: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/5.jpg)
Outliers• Extreme scores• Come from samples other than
those of interest• Can lead to Type I and II
errors
![Page 6: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/6.jpg)
Outlier Detection• Graph
– Box plots– Scatter plots
• Numerical criterion– Extremity (central tendency +/- spread)
• Outside fences– lower: Q1 - 3(Q3 - Q1)– upper: Q3 + 3(Q3 - Q1)
• z-score
– Probability (Extremity + # measurements)• Chauvenat’s/Peirce’s criterion, Grubb’s test
– Absolute cutoff
![Page 7: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/7.jpg)
Outlier Analysis• Determine nature of impact
– Quantitative• Changes numbers, not inferences
– Qualitative• Changes numbers and inferences
• Consider source of outlier– Quantitative
• Same underlying mechanism/sample
– Qualitative• Different underlying mechanisms/samples
– e.g., digit span = 107, simple RT = 1200 ms
![Page 8: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/8.jpg)
Outlier Coping• Options
– Retain– Remove– Reduce
• Windsorize• Normalizing transformation
• Considerations– Impact/Source– Convention– Believability
• Justification• Replication
QuickTime™ and a decompressor
are needed to see this picture.
![Page 9: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/9.jpg)
Transformations• Linear “rescaling”
– unit conversion•e.g., # items correct, # items wrong •e.g., standardization
• Curvilinear “reexpression”– variable conversion
•e.g., time (sec/trial) to speed (trials/sec)
•e.g., normalization
![Page 10: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/10.jpg)
Standardization• Why standardize data?
– Intra-distribution statistics• You got 8 questions wrong on one exam• You were one standard deviation below
the mean
– Inter-distribution statistics• You got 8 questions wrong on the
midterm and 5 questions wrong on the final
• Aggregation: Overall, you were one standard deviation below the mean
• Comparison: You did better on the midterm than the final
![Page 11: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/11.jpg)
z-score• # standard deviations
above/below the mean
![Page 12: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/12.jpg)
![Page 13: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/13.jpg)
raw-score z-score IQ-scale20 -2.48 75.225 -1.01 89.925 -1.01 89.926 -0.71 92.927 -0.42 95.827 -0.42 95.827 -0.42 95.828 -0.12 98.828 -0.12 98.829 0.17 101.730 0.47 104.731 0.76 107.631 0.76 107.631 0.76 107.632 1.06 110.633 1.35 113.533 1.35 113.5
M 28.41 0.00 100.0SD 3.39 1.00 10.0
Test Performance
![Page 14: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/14.jpg)
Normal Distributions• “…normality is a myth; there
never was, and never will be, a normal distribution.”– Geary (1947)
• “Experimentalists think that it is a mathematical theorem while the mathematicians believe it to be an experimental fact.”– Lippman (1917)
![Page 15: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/15.jpg)
Normalization• Why normalize DV?
– Meet statistical assumption of normality in situations when it matters• Small n• Unequal n• One-sample t and z tests
– Increase power• Why NOT normalize DV?
– Interpretability– Affects measurement scale
![Page 16: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/16.jpg)
Tests of Normality• Frequency distribution• Skew/kurtosis statistics• Kolmorogov-Smirnov test • Probability plots (e.g., P-P
plot)QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 17: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/17.jpg)
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Types of Curvilinear Transformations
![Page 18: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/18.jpg)
Does normalization help?• Games & Lucas (1966): Normalizing
transformations hurt– Reduce interpretability, power
• Levin & Dunlap (1982): Transformations help– Increase power
• Games (1983): It Depends, Levin and Dunlap are stupid
• Levine & Dunlap (1984): It depends, Games is stupid
• Games (1984): This debate is stupid
![Page 19: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/19.jpg)
Does non-normality hurt?
![Page 20: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/20.jpg)
Normalize If and Only If• It matters
– In theory: Got robust?– In practice: Got change?
• Must assume normality (i.e., no non-parametric test available)
![Page 21: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/21.jpg)
Missing Data
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
![Page 22: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/22.jpg)
Why are they missing?– MCAR
• Variable’s missingness unrelated to both its value and other variables’ values
• e.g., equipment malfunction• No bias
– MAR• Variable’s missingness unrelated to its value after
controlling for its relation to other variables• e.g., depression and income• Bias
– MNAR• Variable’s missingness related to its value after
controlling for its relation to other variables• e.g., income reporting• Bias
![Page 23: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/23.jpg)
Diagnosing Missing Data• How much?• How concentrated?• How essential?• MCAR, MAR, MNAR?• How influential?
![Page 24: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/24.jpg)
Dealing with Missing Data– Treat missing data as data– Note bias
• “lower income individuals are underrepresented”
– Delete variables– Delete cases
• Listwise• Casewise
– Estimation• Prior knowledge• Mean substitution• Regression substitution• Expectation-maximization (EM)• Hot decking• Multiple imputation (MI)
![Page 25: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/25.jpg)
Missing Data: Conclusions• Avoid missing data!• If rare (<5%), MCAR,
nonessential, concentrated, or impotent, delete appropriately
• If frequent, patterned, essential, diffuse, influential, use MI
• If MNAR, treat missingness as DV
![Page 26: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/26.jpg)
• Question: What’s the best method for identifying and removing RT outliers?
• Alternatives– RT cutoff (5 values)– z-score cutoff (1, 1.5)– Transformation (log, inverse)– Trimming– Medians– Windsorizing (2 SD)
![Page 27: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/27.jpg)
Method• Conduct series of simulations
– DV: power (# sig simulations/1000)• 2 x 2 ANOVA
– One main effect (20, 30, 40 ms)• 7 observations/condition
– 10% outlier probability– Outliers 0-2000 ms
• 32 participants• Between-participants variability
![Page 28: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/28.jpg)
SpreadDrift
ex-Gaussian distribution
![Page 29: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/29.jpg)
![Page 30: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/30.jpg)
![Page 31: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/31.jpg)
![Page 32: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/32.jpg)
Inferences• Absolute cutoffs resulted in greatest
power• Best cutoff values depended on type
of effect– Shift: 10-15% cutoff– Spread: 5% cutoff
• Inverse transformation good, too• With high between-participant
variability, SD cutoff becomes effective
![Page 33: PSY 1950 Outliers, Missing Data, and Transformations September 22, 2008](https://reader036.vdocument.in/reader036/viewer/2022081512/56814a96550346895db7a336/html5/thumbnails/33.jpg)
Recommendations• Try range of cutoffs to
examine robustness • Replicate with inverse
transformation (or SD cutoff)• Replicate novel, unexpected,
or important effects• Choose method before
analyzing data