quantile regression vs winsorization as methods for dealing with outlier prone processes savannah...

Post on 19-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Quantile Regression vs Winsorization as

methodsfor dealing with outlier

prone processes

Savannah Guo, Blair Marquardt, Tim Savannah Guo, Blair Marquardt, Tim ThorleyThorley3/12/153/12/15

Outliers – What are they?

1. A data point that lies outside the range of the rest of our data.

2. An observation that is distant from other observations.

***Both definitions are fine to describe outliers, but vague when it comes to making critical decisions about your data

Outliers – How do they occur?

Measurement errorExamine or “scrub” your data for typical

errorsYou may never catch all errorsSome errors aren’t out on the extremes of x

and y

Outliers – How do they occur?

Extreme, but non-error, valuesExtreme, but accurate, data may tell an

important storyPerhaps this data was generated from a

different process, or you are being alerted to critical omitted variables

Outliers – How do they affect data analysis?

Different types of outliers will have different impacts on data analysis

Some observations that don’t appear to be outliers may still be influential.1. Data points that have a large impact on the

calculated values of various estimates (e.g, mean, reg. coefficients, standard errors, etc.).

Outliers and Leverage (Batna, 2006)

Outliers on Y (Response, DV) variable: OutliersMay represent model failure – different

process

Outliers on X (Predictor, IV) variable: Leverage PointGood Leverage PointsBad Leverage Points

Outlier examples in R

***The code used in the outlier examples comes from Dr. Westfall’s Regression class webpage, including last year’s (2014) material.

Quantile Wage Example

First example on 3/12 in course webpage.

Quantile regressionA regression function for each quantile (decile,

quartile, quintile etc.)Compare OLS & quantile regressions:

OLS: estimate the mean of the distribution of Y conditional on X

quantile: estimate the quantiles of the distribution of Y conditional on X

Conditional quantile functionConditional distributions p(y|x)parameters are estimated separately for each

identified quantile (mean, median, quartile, decile, etc) of Y

Optimization

Optimization (cont’d)

StrengthsComplete picture of the data generating

process.“Outliers” (birth-weight example in the reading

material).Systematic difference in the process (i.e.

heteroscedasticity) (income example)Require dependent variable to be continuous

An accounting example (Armstrong et al. 2014)

Relation between the board’s financial knowledge and level of tax avoidance.

Prior study shows mixed results using OLS. This paper proposes that it’s very important to examine the tails of the tax avoidance distribution.

Findings (next slide)

An accounting example (cont’d)

Other examplesPeak energy use example often used by Dr.

Westfall

Household income example from the reading material (R code from Koenker 2012)

Winsorization

The transformation of statistics by limiting or replacing extreme values.

Developed by Charles P. Winsor, 1895-1951.

Used heavily in accounting and finance research.

Winsorization – How does it work?

Replaces data points with extreme residuals with percentiles (1%, 5%, 95%, 99%).

Because it uses quantiles, the winsorized variable must be continuous (not a dummy variable).

Beneficial when the outliers represent a different process than that of interest in our study.

Note that this is not the same as truncation, as it replaces values.

Limitations

Changes the data we analyze.Not appropriate to extrapolate or develop prediction

interval.

Originally designed to replace data points based on the residuals.Often, researchers winsorize all or certain independent

and dependent variables, even if the winsorized data point is close to the fitted value.

Reduces variability in the data.Biases standard errors downward.

Demonstrations

Winsorization in RPeak energy data from Dr. Westfall

Summary demonstrations comparing OLS, quantile regression, and winsorized OLS.Example 1 – Heteroscedastic modelExample 2 – Simulation of a known process with

emphasis on outliers

ReferenceArmstrong, C. S., Blouin, J. L., Jagolinzer, A. D., & Larcker, D. F. (2014). Corporate Governance, Incentives, and Tax Avoidance. Working Paper.Blatna, D. (2006-03). Outliers in regression. Trutnov, Vol. 30.Ghosh, D., & Vogt, A. (2012). Outliers: An evaluation of methodologies. In Joint Statistical Meetings, 3455-3460.Koenker, R. (2012). Quantile Regression in R: A Vignette. Working Paper.Koenker, R., & Hallock, K. F. (Fall 2001). Quantile Regression. Journal of Economic Perspectives, Vol. 15, No. 4., 143-156.Kriegel, H., Kroger, P., & Zimek, A. (2010). Outlier detection techniques. Retrieved from http://www.dbs.ifi.lmu.de/~zimek/publications/KDD2010/kdd10-outlier-tutorial.pdfLeone, A. J., Minutti-Meza, M., & Wasley, C. (2013, August). Influencial Observation and Inference in Accounting Research. Working Paper. Retrieved from Working Paper: https://accounting.wharton.upenn.edu/acct/assets/File/Influential%20Observations%20and%20Inference%20in%20Accounting%20Research.pdfLogan, J., & Petscher, Y. (2013, May 23). An Introduction to Quantile Regression. Retrieved from Modern Modeling Methods Conference Presentation: http://www.modeling.uconn.edu/Mosteller, F., & Tukey, J. (1977). Data Analysis and Regression: A Second Course in Statistics. Reading, MA: Addison-Wesley.Outlier. (2015, February 11). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/OutlierQuantile Regression. (2015, February 16). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/Quantile_regressionShaw-Allen, P., Suter II, G., Cormier, S., & Yuan, L. (2012, July 31). Quantile Regression: Details. Retrieved from US Environmental Protection Agency: http://www.epa.gov/caddis/da_basic_3.details.htmlWinsorising. (2015, January 31). Retrieved from Wikipedia: http://en.wikipedia.org/wiki/WinsorisingYale, C., & Forsythe, A. B. (August 1976). Winsorized Regression. Technometrics, Vol. 18, No. 3, 291-300.

top related