quantile regression vs winsorization as methods for dealing with outlier prone processes savannah...

21
Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Savannah Guo, Blair Marquardt, Tim Thorley Tim Thorley 3/12/15 3/12/15

Upload: lucinda-wheeler

Post on 19-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Quantile Regression vs Winsorization as

methodsfor dealing with outlier

prone processes

Savannah Guo, Blair Marquardt, Tim Savannah Guo, Blair Marquardt, Tim ThorleyThorley3/12/153/12/15

Page 2: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Outliers – What are they?

1. A data point that lies outside the range of the rest of our data.

2. An observation that is distant from other observations.

***Both definitions are fine to describe outliers, but vague when it comes to making critical decisions about your data

Page 3: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Outliers – How do they occur?

Measurement errorExamine or “scrub” your data for typical

errorsYou may never catch all errorsSome errors aren’t out on the extremes of x

and y

Page 4: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Outliers – How do they occur?

Extreme, but non-error, valuesExtreme, but accurate, data may tell an

important storyPerhaps this data was generated from a

different process, or you are being alerted to critical omitted variables

Page 5: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Outliers – How do they affect data analysis?

Different types of outliers will have different impacts on data analysis

Some observations that don’t appear to be outliers may still be influential.1. Data points that have a large impact on the

calculated values of various estimates (e.g, mean, reg. coefficients, standard errors, etc.).

Page 6: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Outliers and Leverage (Batna, 2006)

Outliers on Y (Response, DV) variable: OutliersMay represent model failure – different

process

Outliers on X (Predictor, IV) variable: Leverage PointGood Leverage PointsBad Leverage Points

Page 7: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Outlier examples in R

***The code used in the outlier examples comes from Dr. Westfall’s Regression class webpage, including last year’s (2014) material.

Page 8: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Quantile Wage Example

First example on 3/12 in course webpage.

Page 9: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Quantile regressionA regression function for each quantile (decile,

quartile, quintile etc.)Compare OLS & quantile regressions:

OLS: estimate the mean of the distribution of Y conditional on X

quantile: estimate the quantiles of the distribution of Y conditional on X

Page 10: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Conditional quantile functionConditional distributions p(y|x)parameters are estimated separately for each

identified quantile (mean, median, quartile, decile, etc) of Y

Page 11: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Optimization

Page 12: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Optimization (cont’d)

Page 13: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

StrengthsComplete picture of the data generating

process.“Outliers” (birth-weight example in the reading

material).Systematic difference in the process (i.e.

heteroscedasticity) (income example)Require dependent variable to be continuous

Page 14: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

An accounting example (Armstrong et al. 2014)

Relation between the board’s financial knowledge and level of tax avoidance.

Prior study shows mixed results using OLS. This paper proposes that it’s very important to examine the tails of the tax avoidance distribution.

Findings (next slide)

Page 15: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

An accounting example (cont’d)

Page 16: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Other examplesPeak energy use example often used by Dr.

Westfall

Household income example from the reading material (R code from Koenker 2012)

Page 17: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Winsorization

The transformation of statistics by limiting or replacing extreme values.

Developed by Charles P. Winsor, 1895-1951.

Used heavily in accounting and finance research.

Page 18: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Winsorization – How does it work?

Replaces data points with extreme residuals with percentiles (1%, 5%, 95%, 99%).

Because it uses quantiles, the winsorized variable must be continuous (not a dummy variable).

Beneficial when the outliers represent a different process than that of interest in our study.

Note that this is not the same as truncation, as it replaces values.

Page 19: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Limitations

Changes the data we analyze.Not appropriate to extrapolate or develop prediction

interval.

Originally designed to replace data points based on the residuals.Often, researchers winsorize all or certain independent

and dependent variables, even if the winsorized data point is close to the fitted value.

Reduces variability in the data.Biases standard errors downward.

Page 20: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

Demonstrations

Winsorization in RPeak energy data from Dr. Westfall

Summary demonstrations comparing OLS, quantile regression, and winsorized OLS.Example 1 – Heteroscedastic modelExample 2 – Simulation of a known process with

emphasis on outliers

Page 21: Quantile Regression vs Winsorization as methods for dealing with outlier prone processes Savannah Guo, Blair Marquardt, Tim Thorley 3/12/15

ReferenceArmstrong, C. S., Blouin, J. L., Jagolinzer, A. D., & Larcker, D. F. (2014). Corporate Governance, Incentives, and Tax Avoidance. Working Paper.Blatna, D. (2006-03). Outliers in regression. Trutnov, Vol. 30.Ghosh, D., & Vogt, A. (2012). Outliers: An evaluation of methodologies. In Joint Statistical Meetings, 3455-3460.Koenker, R. (2012). Quantile Regression in R: A Vignette. Working Paper.Koenker, R., & Hallock, K. F. (Fall 2001). Quantile Regression. Journal of Economic Perspectives, Vol. 15, No. 4., 143-156.Kriegel, H., Kroger, P., & Zimek, A. (2010). Outlier detection techniques. Retrieved from http://www.dbs.ifi.lmu.de/~zimek/publications/KDD2010/kdd10-outlier-tutorial.pdfLeone, A. J., Minutti-Meza, M., & Wasley, C. (2013, August). Influencial Observation and Inference in Accounting Research. Working Paper. Retrieved from Working Paper: https://accounting.wharton.upenn.edu/acct/assets/File/Influential%20Observations%20and%20Inference%20in%20Accounting%20Research.pdfLogan, J., & Petscher, Y. (2013, May 23). An Introduction to Quantile Regression. Retrieved from Modern Modeling Methods Conference Presentation: http://www.modeling.uconn.edu/Mosteller, F., & Tukey, J. (1977). Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison-Wesley.Outlier. (2015, February 11). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/OutlierQuantile Regression. (2015, February 16). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/Quantile_regressionShaw-Allen, P., Suter II, G., Cormier, S., & Yuan, L. (2012, July 31). Quantile Regression: Details. Retrieved from US Environmental Protection Agency: http://www.epa.gov/caddis/da_basic_3.details.htmlWinsorising. (2015, January 31). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/WinsorisingYale, C., & Forsythe, A. B. (August 1976). Winsorized Regression. Technometrics, Vol. 18, No. 3, 291-300.