deliverable 2.6: selective editing hannah finselbach 1 and orietta luzi 2 1 ons, uk 2 istat, italy

23
Deliverable 2.6: Selective Editing Hannah Finselbach 1 and Orietta Luzi 2 1 ONS, UK 2 ISTAT, Italy

Upload: cory-francis

Post on 31-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Deliverable 2.6: Selective EditingHannah Finselbach1 and Orietta Luzi2

1ONS, UK2ISTAT, Italy

Overview

• Introduction• Related projects• Combining data sources• Selective editing – data sources and tools• Selective editing in SDWH Framework• Proposed case studies• Deliverable outcomes and recommendations

Introduction

• Selective editing options for a Statistical Data Warehouse – including options for weighting the importance of different outputs

• UK and Italy• Review or quality assure – Sweden (SELEKT)

• Q1: Would you like to review and give comments? (Yes/No)

Statistical Data Warehouse (SDWH)

• Benefits:– Decreased cost of data access and analysis– Common data model – Common tools– Drive increased use of administrative data– Faster and more automated data management

and dissemination

Statistical Data Warehouse (SDWH)

• Drawbacks:– Can have high cost – maintenance and implement

changes– Tools may need to be developed for statistical

processes– Methodological issues of SDWH framework –

covered by WP2• Phase 1 (SGA-1) “Work in progress” for most

NSIs

Combining data sources

• Many NSIs using admin data or registers to produce statistics

• Advantages include:– Reduction in data collection and statistical production

costs; large amount of data available; re-use data to reduce respondent burden.

• Drawbacks include:– Different unit types (statistical and legal); timeliness;

variable definition discrepancies.• Mixed source usually required

Editing

• UNECE Glossary of terms on Statistical Data Editing:– “an activity that involves assessing and

understanding data, and the three phases of detection, resolving, and treating anomalies…”

• Large amount of literature on:– Editing business surveys– Editing administrative data

Aims and related projects• This deliverable aims to add value by investigating how

to edit (selective editing) when combining sources• Mapping with other projects:

– EssNet on Data Integration– EssNet on Administrative Data– MEMOBUST– EDIMBUS Project (2007)– EUREDIT Project (2000-2003)– BLUE-ETS

• Q2: Do you know of any other relevant projects? (Yes/No)

Editing combined data sources

• SDWH will combine survey, register and admin data sources

• Editing required for:– maintaining business register and its quality;– a specific output and its integrated sources;– Improving the statistical system.

• Part of quality control in SDWH• Split processes for data sources? (e.g. France)

Combined sources - Questions…

• Q3: Do you currently combine data sources? – A. Yes; B. No; C. Unsure.

• Q4: Do you have separate editing processes for each data source? – A. Only survey data edited (do not edit admin data);– B. Data sources edited separately;– C. Data sources edited separately, but units/variables

in both sources edited for coherence;– D. Other.

Selective editing

• Editing – traditionally time consuming and expensive• Selective / significance editing:– Prioritises based on score function that expresses the

impact of their potential error on estimates– Score should consist of risk (suspicion) and influence

(potential impact) components– Divide anomalies into a critical and a noncritical stream

for possible clerical or manual resolution (possibly including follow-up)

– More efficient editing process

Selective editing – Survey and Admin data

• Use as auxiliary data in selective editing score function for survey data (e.g. UK, Italy)

• Use score of differences between data sources to determine which need manual intervention (e.g. France)

• Use scores based on historical data• Apply selective editing to admin data, same

score function as survey data, but weights=1 (e.g. France SBS system)

Selective editing – question

• Q5: Is selective editing used in the processing of admin/register data at your organisation?– A. No;– B. No, but admin data used as auxiliary for

selective editing of survey data;– C. No, but a score function is used to compare

data sources;– D. Yes, selective editing is applied to admin data; – E. Not sure.

Selective editing – tools

• SELEMIX – ISTAT• SELEKT – Statistics Sweden• Significance Editing Engine (SEE) – ABS• SLICE – Statistics Netherlands

• Q6: Are you aware of any other selective editing tools?– A. Yes, I can provide documentation; – B. Yes; – C. No.

Selective editing in SDWH

• Methodological issues:– Survey weight not meaningful in SDWH

• Weight=1?• Several sets of weights tailored for different uses?

– Selective editing data “without purpose”• Importance weight for all potential uses?• Alternative editing approach?

– Scores to compare data sources• Should score functions be used, or all discrepancies be followed up, or

automatically corrected?

– Selective editing of admin data – manual intervention?• Is selective editing appropriate if manual intervention is not possible?• Should automatic correction be applied to admin data identified as

suspicious?

Any solutions? …

• Survey weights used in selective editing score not meaningful– Q7: What do you think would be the best options:• A. Everything in SDHW represents itself and therefore

weights=1• B. Calculate several survey weights for all known uses

of unit data item and incorporate into one global score• C. Calculate separate scores for all outputs, and

combine (max, average, sum) • D. Other – discuss!

Any solutions? …

• Selective editing data “without purpose” – Q8: Is selective editing appropriate if the data will

be used multiple times, with unknown purpose at collection?• A. No;• B. No, another editing approach would be better;• C. Yes, we would use key known/likely outputs to

calculate the score;• D. Yes, I can suggest/recommend a solution;• E. Not sure;

Any solutions? …

• Scores to compare data sources – Q9: Should score functions be used to compare sources, or all

discrepancies be followed up, or automatically corrected?• A. All discrepancies need to be investigated by a data expert;• B. All discrepancies need to be flagged, and can then be corrected

automatically;• C. Scores should be used to flag only significant/influential

discrepancies, which should be investigated by a data expert;• D. Scores should be used to flag only significant/influential

discrepancies, which can then be corrected automatically;• E. Other – discuss?• F. Not sure.

Any solutions? …

• Selective editing of admin data– Q10: Is selective editing appropriate if manual intervention

is not possible?• A. No, only correct for fatal errors, systematic errors (e.g. unit

errors), and suspicious reporting patterns;• B. No, identify all errors/suspicious values and automatically

correct/impute;• C. Yes, identify only influential errors to avoid over

editing/imputing admin source;• D. Yes, as well as fatal errors, systematic errors and suspicious

reporting patterns – to also identify influential errors;• E. Other;• F. Not sure.

Experimental studies

• ISTAT: Prototype DWH for SBS– Use SELEMIX– Combine statistical and admin data sources at micro level to estimate

variables on economic accounts, known domains– Evaluate the quality of model-based selective editing and automatic

correction– Re-use available data for other output

• ONS: Combined sources for STS– Use SELEKT– Monthly business survey and VAT Turnover data– Compare selective editing or traditional editing of admin data

(followed by automatic correction), known domains – Re-use available data for other output

Deliverable outcome - recommendations

• Draft report put on CROS-portal – will include input from this workshop

• Provide recommendations for methodological issues of using selective editing in SDWH– Using best practice from NSIs, and– Outcome from experimental studies.

• Metadata checklist

Metadata requirements• Input to editing:

– Quality indictors (e.g. of data source)– Threshold for selective editing score– Potential publication domains– Question number– Predictor/Expected value for score (e.g. historical data, register data)– Domain total and/or standard error estimate for score– Edit identification– …

• Output from editing:– Raw and edited value– Selective editing score– Error number/description/type– Flag if suspicious– Flag if changed – …

Thank you!