investigation of macro editing techniques for outlier detection in survey data katherine jenny...

30
Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for Economic Programs

Upload: aaliyah-mcmahon

Post on 27-Mar-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Investigation of Macro Editing Techniques for Outlier Detection in

Survey Data

Katherine Jenny Thompson

Office of Statistical Methods and Research for Economic Programs

Page 2: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Simplified Survey Processing Cycle

Data Collection/Analyst Review Micro-editing

And ImputationIndividual Returns

Macro-editing Tabulated Initial

Estimates

Analyst InvestigationAnd Correction

Publication Estimates

Page 3: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Identifying Outlying Estimates

• Set of Estimates– Unknown parametric distribution (robust)– Contains outliers (resistant)

• Outlier-identification problems (Multiple Outliers)– Masking: difficult to detect an individual outlier– Swamping: too many false outliers flagged

Page 4: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Outlier Detection Approaches

• Sets of “bivariate” (Ratio) comparisons – Same estimate from two consecutive

collection periods (historic cell ratios)– Different estimates in same collection

period (current cell ratios)

• Multivariate comparisons– Current period data

Page 5: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Method for Bivariate Comparisons

• Resistant Fences Methods– Symmetrized Resistant fences– Asymmetric Fences

• Robust Regression• Hidiroglou-Berthelot Edit

Page 6: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Bivariate Comparisons (Current Cell Ratios)

• Linear relationship between payroll and employment• No intercept

Paired Estimates

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100 120

Total Employment

An

nu

al P

ay

roll

Paired Estimates

Page 7: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

“Traditional” Ratio Edit (Current Cell Ratio)

0

1000

2000

3000

4000

5000

6000

7000

8000

0 20 40 60 80 100 120

Total Employment

An

nu

al P

ayro

ll

Paired Estimates Lower Tolerance Upper Tolerance

• “Cone-shaped” tolerances• Goes through origin• Strong statistical association

Acceptance Region

Outlier Region

Outlier Region

Page 8: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Resistant Fences Methods

q25 q75

q25-1.5H q75+1.5H

• Different numbers of interquartile ranges (1.5 = Inner, 3 = Outer)

• Implicitly assumes symmetry

• May want to “symmetrize”, apply rule, use inverse transformation

Page 9: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Asymmetric Fences Methods

q75+3 (q75- m)q25+3 (m – q25)

• Different numbers of interquartile ranges (3 = Inner, 6 = Outer)

• Incorporates skewness of distribution in outlier rule (“Fences”)

Page 10: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Robust Regression

• Least Trimmed Squares Robust Regression • Resistant (minimizes median residual)• Outlier = |residual| 3 robust M.S.E.

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100 120

Total Employment

An

nu

al P

ayro

ll

Paired Estimates Robust Regression Line

Page 11: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Issue at Origin (Historic Cell Ratio)

0

10

20

30

40

50

60

70

80

0 5 10 15 20 25 30 35

Prior Month's Number of Employees

Cu

rren

t M

on

th's

Nu

mb

er o

f E

mp

loye

es

Page 12: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Hidiroglou-Berthelot (HB) Edit

-250

-200

-150

-100

-50

0

50

0 20 40 60 80 100 120

Employment

HB

"E

ffec

ts"

Upper Bound Lower Bound Effects

• Accounts for magnitude of unit (variability at origin)

Page 13: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Hidiroglou-Berthelot (HB) Edit

• Two-step transformation (Ei)– Centering transformation on ratios– Magnitude transformation that accounts for the relative

importance of large cases

• Asymmetric Fences “Type” Outlier Rule

• Key ParameterU = magnitude transformation parameter (0 U 1)C = controls width of outlier region

Page 14: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Multivariate Methods: Mahalanobis Distance

• Multivariate normal (,)

– T(X) estimates – C(X) estimates – p is the number of distinct variables (items)

• Prone to masking (difficult to detect individual outliers)

2~))()(())(( piii TxCTxMD XXX

Page 15: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Robust Alternatives

• M-estimation (not considered)• “Production Method”• Minimum Volume Ellipse (MVE)

– Resistant (50% breakdown) and robust

• Minimum Covariance Determinant (MCD)– Resistant (50% breakdown) and robust

• Assumption of Normality– Log-transformation

Page 16: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Evaluation: Classify Item Estimates

Input ValueReported

Final ValueTabulated

RatioInput/Final

OutlierPotentialOutlier

Not an Outlier

0

5

10

15

20

25

30

35

40

45

50

Ratio Values

Fre

qu

ency

Co

un

ts

Page 17: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Evaluation: Classify Ratios (Bivariate)

• Conservative– Ratio is “outlier” if numerator or

denominator is an outlier

• Anti-Conservative– Ratio is “outlier” if numerator or

denominator is an outlier or a potential outlier

Page 18: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Evaluation: Classify Records (Multivariate)• Conservative

– Record is “outlier” at least one estimate is an outlier

• Anti-Conservative– Record is “outlier” at least one estimate is

an outlier or a potential outlier

Page 19: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Evaluation Statistics: Bivariate Comparisons

• Individual Test Level• Type I Error Rate: proportion of false rejects• Type II Error Rate: proportion of false accepts• Hit Rate: proportion of flagged estimates that are

outliers

• All-Test Level• All-item Type II error rate

Page 20: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Evaluation Statistics: Multivariate Comparisons

• Type I error rate: the proportion of non-outlier records that are flagged as outliers

• Type II error rate: the proportion of outlier records that are not flagged as outliers (missed “bad” values)

Page 21: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Annual Capital Expenditures Survey (ACES)

• Sample Survey (Stratified SRS-WOR)– ACE-1: Employer companies– ACE-2: Non-employer companies (not discussed)

• New sample selection each year• Total and year-to-year change estimates

– Total Capital Expenditures– Structures (New and Used)– Equipment (New and Used)

Page 22: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Capital Expenditures Data

• Characterized by• Low year-to-year correlation (same

company)• Weak association with available auxiliary

data

• Editing procedures focus on additivity

• Outlier correction at micro-level

Page 23: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Bivariate Comparisons

Robust Regression

Resistant Fences

HB Edit

Structures/Total New Structures/Structures

New Structures/Used Structures

Equipment/Total New Equipment/Equipment

• Resistant Fences: (Symmetric or Asymmetric) (Inner or Outer)

• HB Edit: (U = 0.3 or 0.5) (c = 10 or 20 )

Page 24: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Results – Individual Tests

• Robust Regression prone to swamping– High Type I error rate (false rejects)

• Comparable performance with Asymmetric Inner Fences and HB Edit (U = 0.3, c = 10)– Low Type I error rates– High Hit Rates– High Type II error rates

• Other variations of Resistant Fences and HB edit not as good

Page 25: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Results – All-Tests

• Very large Type II error rates (approx. 50%)• Robust regression• Symmetric resistant outer fences• HB edit with c = 20

• Improved Type II error rates (30% - 40%)• Asymmetric inner fences • HB edit (U = 0.3, C=10)

Page 26: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Multivariate Results

• Original Data: considered methods ineffective• Log-transformed data: improved performance (MCD and MVE)

– Reduced Type I error rates

– Comparable Type II error rates (to original-data MCD and MVE)

Conservative Results: 2002

0

0.2

0.4

0.6

0.8

1

Production-MD MCD (original) MVE (original) MCD (log-transformed)

MVE (log-transformed)

Type I Error Rates Type II Error Rates

Page 27: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Multivariate Versus Bivariate:Different Outcomes (Conservative)

Combined HB edits flag more “outliers”:– Higher Type I error rate – Lower Type II error rates for the complete set of HB edits

Counts of Non-Flagged Outliers Type I Errors (False Rejects)

8

0

11

4

2002 2003

HB MVE

Counts of Missed OutliersType II Errors (False Accepts)

13 14

0

4

2002 2003

HB MVE

Page 28: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Comments• Economic data with inconsistent statistical

association between items in each collection period • Critical values must be determined by the data set at

hand (no “hard-coding”)• Dynamically

– Standardize the comparisons (HB edit, log transformation)– Compute outlier limits

• Could try hybrid approach:– Multivariate a few current cell ratio tests with the HB edit – Perform all bivariate tests, but unduplicate cells before

analyst review

Page 29: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Final Thoughts/Next Steps

• Examine one set of economic data and considered only two separate collections from this program.

• Extrapolation would be foolish• My results need to be validated on other

economic data sets – a more typical periodic business survey and/or – a well-constructed simulation study

Page 30: Investigation of Macro Editing Techniques for Outlier Detection in Survey Data Katherine Jenny Thompson Office of Statistical Methods and Research for

Any Questions?

Katherine Jenny Thompson

[email protected]