eurostat statistical data editing and imputation

37
Eurostat Statistical Data Editing and Imputation

Upload: hilary-barker

Post on 25-Dec-2015

241 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Eurostat Statistical Data Editing and Imputation

Eurostat

Statistical Data Editingand

Imputation

Page 2: Eurostat Statistical Data Editing and Imputation

Presented by

• Sander Scholtus • Statistics Netherlands

Page 3: Eurostat Statistical Data Editing and Imputation

Introduction

• Data arrive at a statistical institute...ID size

classnumber of employees

turnover (x €1000)

labour costs (x €1000)

other costs (x €1000)

total costs (x €1000)

0001 large 213 49,827 0 30,479 30,479

0002 large 3 64,933

0003 medium 42 1,462 51 1,513

0004 medium 29 6,301 891 6,350

0005 small 4 875,000 98,000 547,000 645,000

0006 small 8 1,716 175 998

0007 small 0 614 47 153 570

Page 4: Eurostat Statistical Data Editing and Imputation

Introduction

• Data arrive at a statistical institute...– …containing errors and implausible values– …containing missing values

• To produce statistical output of sufficient quality, these data problems have to be treated– Statistical data editing deals with errors– Imputation deals with missing values

Page 5: Eurostat Statistical Data Editing and Imputation

Statistical data editing

• Overview– Goals– Edit rules– Different editing methods and how to combine them– Modules in the handbook

Page 6: Eurostat Statistical Data Editing and Imputation

Statistical data editing – goals

• Traditional goal of editing:– Detect and correct all errors in the collected data

• Problems:– Very labour-intensive– Very time-consuming– Highly inefficient: measurement error is not the only

source of error in statistical output

Page 7: Eurostat Statistical Data Editing and Imputation

Statistical data editing – goals

• Modern goals of editing:1. To identify possible sources of errors so that the

statistical process may be improved in the future.2. To provide information about the quality of the data

collected and published.3. To detect and correct influential errors in the collected

data.4. If necessary, to provide complete and consistent micro-

data.sources: Granquist (1997), EDIMBUS (2007)

Page 8: Eurostat Statistical Data Editing and Imputation

Statistical data editing – edit rules

• Edit rules (edits, edit checks, checking rules)– Used to detect errors– Can be either hard or soft– General form:

IF (unit edit group)THEN (test variable acceptance region)

Page 9: Eurostat Statistical Data Editing and Imputation

Statistical data editing – edit rules

• Examples of edit rules:– Turnover ≥ 0

(non-negativity edit, hard)

– Profit = Turnover – Total costs(balance edit, hard)

– IF (Size class = “Small”)THEN (0 ≤ Number of employees < 10)(conditional edit, soft)

– IF (Economic activity = “Construction”)THEN (a < Turnover / Number of employees < b)(ratio edit, soft)

Page 10: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

deductive editing

selective editing

not selected

selected

manual editing

automatic editing

macro-editing

statistical microdata

raw microdata

Page 11: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

• Deductive editing– Directed at systematic errors– Deterministic detection and amendment

if-then rules algorithms

– Examples: unit of measurement errors (e.g. “4,000,000” instead of “4,000”) sign errors (e.g. “–10” instead of “10”) simple typing errors (e.g. “192” instead of “129”) subject-matter specific errors

Page 12: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

deductive editing

selective editing

not selected

selected

manual editing

automatic editing

macro-editing

statistical microdata

raw microdata

Page 13: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

• Selective editing– Prioritise records according to expected benefit of their

manual amendment on target estimates– Records can be selected as they arrive (input editing)– Common approach based on score functions

Local scores for key target variables, e.g.,

Use global score to summarise local scores (e.g., sum or maximum)

Page 14: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

deductive editing

selective editing

not selected

selected

manual editing

automatic editing

macro-editing

statistical microdata

raw microdata

Page 15: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

• Manual editing– Requires:

Human editors (subject-matter specialists) Dedicated software (interactive editing) Edit rules (hard and soft) Editing instructions

– Re-contacts with businesses are sometimes used– Important as a source for improvements in future rounds

of a repeated survey

Page 16: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

deductive editing

selective editing

not selected

selected

manual editing

automatic editing

macro-editing

statistical microdata

raw microdata

Page 17: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

• Automatic editing– Obtain consistent micro-data for non-influential records– Paradigm of Fellegi and Holt (1976): Data should be

made consistent with the edit rules by changing the fewest possible (weighted) number of items.

Leads to error localisation as a mathematical optimisation problem Imputation of new values as a separate step

– Requires: (Hard) edit rules Dedicated software (e.g.: Banff by Statistics Canada; SLICE by Statistics

Netherlands; R package editrules)

Page 18: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

deductive editing

selective editing

not selected

selected

manual editing

automatic editing

macro-editing

statistical microdata

raw microdata

Page 19: Eurostat Statistical Data Editing and Imputation

Statistical data editing – methods

• Macro-editing– Also known as output editing– Same purpose as selective editing– Uses data from all available records at once– Aggregate method:

Compute high-level aggregates Check their plausibility Drill down to suspicious lower-level aggregates Eventually: Drill down to suspicious individual records Feedback to manual editing

– Graphical aids (scatter plots, etc.) to find outliers

Page 20: Eurostat Statistical Data Editing and Imputation

Statistical data editing – modules

• Modules in the handbook:1. Main theme module2. Deductive editing3. Selective editing4. Automatic editing5. Manual editing6. Macro-editing7. Editing administrative data8. Editing for longitudinal data

Page 21: Eurostat Statistical Data Editing and Imputation

Imputation

• Overview– Missing data– Imputation methods– Special topics– Modules in the handbook

Page 22: Eurostat Statistical Data Editing and Imputation

Imputation – missing data

• Missing data may occur because of– Logical reasons

A particular question does not apply to a particular unit

– Unit non-response No data observed at all for a particular unit

– Item non-response Unit is not able to answer a particular question Unit is not willing to answer a particular question

– Editing Originally observed value discarded during automatic editing

Page 23: Eurostat Statistical Data Editing and Imputation

Imputation – missing data

• Imputation: filling in new (estimated) values for data items that are missing

• Commonly used for missing data due to item non-response and editing

• Obtain a completed micro-data file prior to estimation– Simplifies the estimation step– Prevents inconsistencies in the output

Page 24: Eurostat Statistical Data Editing and Imputation

Imputation – methods

• Deductive imputation• Model-based imputation• Donor imputation

• Assumption: All observed values are correct– Imputation applied after error localisation

Page 25: Eurostat Statistical Data Editing and Imputation

Imputation – methods

• Deductive imputation– Derive (rather than estimate) missing values from

observed values based on logical relations (edit rules) substantive imputation rules

– Can be very useful as a first imputation step

ID turnover (sales)

turnover (services)

turnover (other)

turnover (total)

1001 154 10 166

1002 147 147

ID turnover (sales)

turnover (services)

turnover (other)

turnover (total)

1001 154 10 2 166

1002 147 147

ID turnover (sales)

turnover (services)

turnover (other)

turnover (total)

1001 154 10 2 166

1002 147 0 0 147

Page 26: Eurostat Statistical Data Editing and Imputation

Imputation – methods

• Model-based imputation– Imputations based on a predictive model– Model fitted on the observed data, then used to impute

the missing data

Page 27: Eurostat Statistical Data Editing and Imputation

Imputation – methods

• Model-based imputation– Special cases:

Mean imputation

Model: , with

Imputed value:

Ratio imputation

Model: , with

Imputed value:

(Linear) regression imputation

Model:

Imputed value:

Page 28: Eurostat Statistical Data Editing and Imputation

Imputation – methods

• Model-based imputation– Choice of model depends on intended use of data

Estimating means and totals: mean or ratio imputation may be sufficient General purpose micro-data: important to model relationships

– Multivariate model-based imputation Multivariate regression imputation

(joint model for all variables) Sequential regression / chained equations

(separate model for each variable, conditional on the other variables)

Page 29: Eurostat Statistical Data Editing and Imputation

Imputation – methods

• Donor imputation– Missing values imputed by ‘borrowing’ observed values

from other (similar) units Unit with observed value: donor Unit with missing value: recipient

– Hot deck: donor and recipient in the same data file

Page 30: Eurostat Statistical Data Editing and Imputation

Imputation – methods

• Donor imputation– Special cases:

Random hot deck imputation

Donor selected at random (within classes)

Use auxiliary variables to define imputation classes

Nearest-neighbour imputation

Donor selected with minimal distance to recipient

Use auxiliary variables to define distance

Predictive mean matching

Special case of nearest-neighbour imputation

Distance based on predicted values from a regression model

Page 31: Eurostat Statistical Data Editing and Imputation

Imputation – special topics

• Choice of method/model/auxiliary variables– General problem in multivariate analysis– Auxiliary variables should explain

the target variable(s) the missing data mechanism

– Compare model fit among item respondents Can be misleading (“imputation bias”)

– Simulation experiments with historical data

Page 32: Eurostat Statistical Data Editing and Imputation

Imputation – special topics

• Imputation for longitudinal data– Repeated cross-sectional surveys– Panel studies

• Special imputation methods for longitudinal data– Last observation carried forward– Interpolation– Extrapolation– Little and Su method

Page 33: Eurostat Statistical Data Editing and Imputation

Imputation – special topics

• Imputations are estimates– Imputed values should be flagged

• Variance estimation with imputed data– Variance likely to be underestimated when…

…imputations are treated as observed variables …model predictions are imputed without a disturbance term …single imputation is used

– Alternative approach: Multiple imputation Not often used in official statistics (yet)

Page 34: Eurostat Statistical Data Editing and Imputation

Imputation – special topics

• Imputed values may be invalid/inconsistent– Examples:

Turnover = –100 (invalid) Labour costs = 0, Number of employees = 15 (inconsistent)

– Need not be a problem for estimating aggregates– Can be a problem if micro-data are distributed further

• Imputation under edit constraints– One-step method: constrained imputation model– Two-step method: imputation followed by data

reconciliation

Page 35: Eurostat Statistical Data Editing and Imputation

Imputation – modules

• Modules in the handbook:1. Main theme module2. Deductive imputation3. Model-based imputation4. Donor imputation5. Imputation for longitudinal data6. Little and Su method7. Imputation under edit constraints

Page 36: Eurostat Statistical Data Editing and Imputation

Thank you for your attention!

Page 37: Eurostat Statistical Data Editing and Imputation

References

• EDIMBUS (2007), Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys.

• Fellegi, I.P. and D. Holt (1976), A Systematic Approach to Automatic Edit and Imputation. Journal of the American Statistical Association 71, pp. 17–35.

• Granquist, L. (1997), The New View on Editing. International Statistical Review 65, pp. 381–387.