quantile regression vs winsorization as methods for dealing with outlier prone processes savannah...
TRANSCRIPT
Quantile Regression vs Winsorization as
methodsfor dealing with outlier
prone processes
Savannah Guo, Blair Marquardt, Tim Savannah Guo, Blair Marquardt, Tim ThorleyThorley3/12/153/12/15
Outliers – What are they?
1. A data point that lies outside the range of the rest of our data.
2. An observation that is distant from other observations.
***Both definitions are fine to describe outliers, but vague when it comes to making critical decisions about your data
Outliers – How do they occur?
Measurement errorExamine or “scrub” your data for typical
errorsYou may never catch all errorsSome errors aren’t out on the extremes of x
and y
Outliers – How do they occur?
Extreme, but non-error, valuesExtreme, but accurate, data may tell an
important storyPerhaps this data was generated from a
different process, or you are being alerted to critical omitted variables
Outliers – How do they affect data analysis?
Different types of outliers will have different impacts on data analysis
Some observations that don’t appear to be outliers may still be influential.1. Data points that have a large impact on the
calculated values of various estimates (e.g, mean, reg. coefficients, standard errors, etc.).
Outliers and Leverage (Batna, 2006)
Outliers on Y (Response, DV) variable: OutliersMay represent model failure – different
process
Outliers on X (Predictor, IV) variable: Leverage PointGood Leverage PointsBad Leverage Points
Outlier examples in R
***The code used in the outlier examples comes from Dr. Westfall’s Regression class webpage, including last year’s (2014) material.
Quantile Wage Example
First example on 3/12 in course webpage.
Quantile regressionA regression function for each quantile (decile,
quartile, quintile etc.)Compare OLS & quantile regressions:
OLS: estimate the mean of the distribution of Y conditional on X
quantile: estimate the quantiles of the distribution of Y conditional on X
Conditional quantile functionConditional distributions p(y|x)parameters are estimated separately for each
identified quantile (mean, median, quartile, decile, etc) of Y
Optimization
Optimization (cont’d)
StrengthsComplete picture of the data generating
process.“Outliers” (birth-weight example in the reading
material).Systematic difference in the process (i.e.
heteroscedasticity) (income example)Require dependent variable to be continuous
An accounting example (Armstrong et al. 2014)
Relation between the board’s financial knowledge and level of tax avoidance.
Prior study shows mixed results using OLS. This paper proposes that it’s very important to examine the tails of the tax avoidance distribution.
Findings (next slide)
An accounting example (cont’d)
Other examplesPeak energy use example often used by Dr.
Westfall
Household income example from the reading material (R code from Koenker 2012)
Winsorization
The transformation of statistics by limiting or replacing extreme values.
Developed by Charles P. Winsor, 1895-1951.
Used heavily in accounting and finance research.
Winsorization – How does it work?
Replaces data points with extreme residuals with percentiles (1%, 5%, 95%, 99%).
Because it uses quantiles, the winsorized variable must be continuous (not a dummy variable).
Beneficial when the outliers represent a different process than that of interest in our study.
Note that this is not the same as truncation, as it replaces values.
Limitations
Changes the data we analyze.Not appropriate to extrapolate or develop prediction
interval.
Originally designed to replace data points based on the residuals.Often, researchers winsorize all or certain independent
and dependent variables, even if the winsorized data point is close to the fitted value.
Reduces variability in the data.Biases standard errors downward.
Demonstrations
Winsorization in RPeak energy data from Dr. Westfall
Summary demonstrations comparing OLS, quantile regression, and winsorized OLS.Example 1 – Heteroscedastic modelExample 2 – Simulation of a known process with
emphasis on outliers
ReferenceArmstrong, C. S., Blouin, J. L., Jagolinzer, A. D., & Larcker, D. F. (2014). Corporate Governance, Incentives, and Tax Avoidance. Working Paper.Blatna, D. (2006-03). Outliers in regression. Trutnov, Vol. 30.Ghosh, D., & Vogt, A. (2012). Outliers: An evaluation of methodologies. In Joint Statistical Meetings, 3455-3460.Koenker, R. (2012). Quantile Regression in R: A Vignette. Working Paper.Koenker, R., & Hallock, K. F. (Fall 2001). Quantile Regression. Journal of Economic Perspectives, Vol. 15, No. 4., 143-156.Kriegel, H., Kroger, P., & Zimek, A. (2010). Outlier detection techniques. Retrieved from http://www.dbs.ifi.lmu.de/~zimek/publications/KDD2010/kdd10-outlier-tutorial.pdfLeone, A. J., Minutti-Meza, M., & Wasley, C. (2013, August). Influencial Observation and Inference in Accounting Research. Working Paper. Retrieved from Working Paper: https://accounting.wharton.upenn.edu/acct/assets/File/Influential%20Observations%20and%20Inference%20in%20Accounting%20Research.pdfLogan, J., & Petscher, Y. (2013, May 23). An Introduction to Quantile Regression. Retrieved from Modern Modeling Methods Conference Presentation: http://www.modeling.uconn.edu/Mosteller, F., & Tukey, J. (1977). Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison-Wesley.Outlier. (2015, February 11). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/OutlierQuantile Regression. (2015, February 16). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/Quantile_regressionShaw-Allen, P., Suter II, G., Cormier, S., & Yuan, L. (2012, July 31). Quantile Regression: Details. Retrieved from US Environmental Protection Agency: http://www.epa.gov/caddis/da_basic_3.details.htmlWinsorising. (2015, January 31). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/WinsorisingYale, C., & Forsythe, A. B. (August 1976). Winsorized Regression. Technometrics, Vol. 18, No. 3, 291-300.