new and emerging methods maria garcia and ton de waal un/ece work session on statistical data...

30
New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Upload: evan-jones

Post on 02-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

New and Emerging Methods

Maria Garcia and Ton de Waal

UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Page 2: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Introduction

New methods of data editing and imputation

Subdivided into 5 different themes: Automatic editing Imputation E & I for demographic variables Selective editing Software

Page 3: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Invited Papers

WP 30: Methods and software for editing and imputation: recent advancements at ISTAT (ISTAT, Italy)

WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US)

WP 31: Smoothing Imputations for categorical data in the linear regression paradigm (USCB, US)

Page 4: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Automatic editing: papers (1/2)

Six papers: WP 30: Methods and software for

editing and imputation: recent advancements at ISTAT (ISTAT, Italy)

WP 32: Using a quadratic programming approach to solve simultaneous ratio and balance edit problems (USCB, US)

WP 33: Data editing and logic (Australia)

Page 5: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Automatic editing: papers (2/2)

WP 43: Automatic editing system for the case of two short-term business surveys (Republic of Slovenia)

WP 44: A variable neighbourhood local search approach for the continuous data editing problem (Spain)

WP 46: Implicit linear inequality edits and error localization in the SPEER edit system (USCB, US)

Page 6: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Automatic Editing: main developments

Methods based on Fellegi-Holt model Developments at SORS

General system combines error localization with outlier detection

Plans for automation of implied edit generation Further improvements of SPEER

Preprocessing program for generation of implied edits

Improve error localization

Page 7: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Framework of Fellegi-Holt theory in propositional logic Generation of implied edits framed as

logical deduction Automatic tools that can potentially be

used for finding minimal deletion set

Automatic Editing: main developments

Page 8: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Automatic Editing: main developments

Methods based on some other approach Erroneous unit measures

Model as cluster analysis problem Ratio and balance constraints

Hybrid ratio editing and quadratic programming Controlled rounding

Error localization as a combinatorial optimization problem Continuous data Successful on very large data sets

Page 9: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Imputation: papers (1/2)

Six papers: WP 30: Methods and software for

editing and imputation: recent advancements at ISTAT (ISTAT, Italy)

WP 31: Smoothing imputations for categorical data in the linear regression paradigm (USCB, US)

WP 36: Integrated modeling approach to imputation and discussion on imputation variance (Statistics Finland)

Page 10: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Imputation: papers (2/2)

WP 40: Imputation of data subject to balance and inequality restrictions using the truncated normal distribution (Statistics Netherlands)

WP 41: On the imputation of categorical data subject to edit restrictions using loglinear models (Statistics Netherlands)

WP 48: Improving imputation: the plan to examine count, status, vacancy and item imputation in the decennial census (USCB, US)

Page 11: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Imputation: main developments

Model based methods Discrete Data

Constrained loglinear model Linear regression model

Continuous Data Truncated normal distribution followed

by MCEM

Page 12: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Imputation: main developments

Implementation of imputation methods Use Bayesian networks for

imputation of discrete data Development of QUIS for imputation

of continuous data written in SAS uses EM algorithm, nearest neighbor,

and MI

Page 13: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Imputation: main developments

Implementation of imputation methods Integrated Modeling Approach (IMAI)

Summary and analysis of principles of IMAI Estimation of imputation variance

U.S. Decennial Census Research on alternative imputation options Administrative records, model based

imputation, CANCEIS, hot deck Development of a truth deck for evaluation

Page 14: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

E & I for demographic variables: papers

Three papers: WP 30: Methods and software for

editing and imputation: recent advancements at ISTAT (ISTAT, Italy)

WP 35: Edit and imputation for the 2006 Canadian Census (Statistics Canada)

WP 38: New procedures for editing and imputation of demographic variables (ISTAT, Italy)

Page 15: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

E & I for demographic variables:

main developments

Further improvement of CANCEIS capability of processing all census variables improved editing and imputation of

alphanumeric, discrete, continuous and coded variables

improved user interface Development of DIESIS

combined use of “data driven” approach (NIM) and “minimum change” approach (Fellegi-Holt)

Page 16: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

E & I for demographic variables:

main developments

Development of DIESIS Use of graph theory to improve quality

of sequential imputation Optimization procedure to locate the

household reference person New approach for selection of donors

based on partitioning passed records into smaller subsets of similar characteristics

search for donor records within the smaller clusters

Page 17: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Selective editing: papers

Two papers: WP 42: Evaluation of score functions for

selective editing of annual structural business statistics (Statistics Netherlands)

WP 45: An editing procedure for low pay data in the annual survey of hours and earning (Office for National Statistics, UK)

Page 18: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Selective editing: main developments

Continued use and development of selective editing

Evaluation of selective editing approaches experiments with different sets of score

functions Development of “hybrid editing”

validate a sample of failed records use associated data to impute

remaining records

Page 19: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Software: papers

Four papers: WP 34: The transition from GEIS to BANFF

(Statistics Canada) WP 37: Concepts, materials and IT modules for

data editing of German statistics (Destatis, Germany)

WP 39: SLICE 1.5: a software framework for automatic edit and imputation (Statistics Netherlands)

WP 47: Improving an edit and imputation system for the US Census of agriculture (NASS, US)

Page 20: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Software: main developments

Flexibility modules rather than large systems are

developed standard statistical packages are used

(SAS in BANFF and US Census of Agriculture) Testing and implementation of the

software Quality control measures

e.g. for (donor) imputation Integration of the edit and imputation

software in entire production process process chain: planning, data collection, edit

and imputation

Page 21: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

General points for discussion

Are there any really new approaches? new approaches extensions of existing ideas? new approaches combinations of old ones?

Develop new approaches or consolidate old approaches? development versus evaluation studies and

testing prototype software versus implementation of

production software Is our focus shifting?

from editing towards imputation? from development towards implementation? from computational aspects towards quality

issues?

Page 22: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Automatic editing: points for discussion Can operations research techniques be

combined with techniques from mathematical logic?

What are the (dis)advantages of using SAT solvers when compare to direct integer programming methods?

What is the quality of the imputations when editing data using the quadratic programming approach?

Page 23: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Automatic editing: points for discussion What is the quality of the solutions found

by using the combinatorial optimization approach on real survey data? How fast is this approach on realistic data?

Can finite mixture models be used for detection of other types of systematic errors?

Should we invest on developing generic tools or software tools tailored to a particular application?

Page 24: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Automatic editing: points for discussion

Are there any other types of surveys that are worth the effort of generating implied edits prior to error localization?

What are the most cost-effective methods for edit/imputation in terms of resources, time, clerical intervention, quality of results?

Page 25: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Imputation: points for discussion

What are the (dis)advantages of using complex mathematical models for missing data imputation? Are these models too complex for survey practitioners?

What are the expected computational difficulties of applying complex models to real survey data?

What are the largest (most complex) surveys that can be imputed using these models?

Page 26: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Imputation: points for discussion

What is the quality of the imputations carried out using model based methods for filling-in missing data?

Can we compare the different imputation models?

Page 27: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Imputation: points for discussion

Can more guidelines for the IMAI process be developed?

To what extent can we develop a systematic way of applying IMAI?

Is imputation variance an important issue at the moment, or should we (still) focus on imputation bias?

Page 28: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

E & I for demographic variables:

points for discussion

Can CANCEIS/DIESIS be used for other data besides demographic census data?

Can CANCEIS/DIESIS be further developed?

Should we use a combination of edit and imputation methods or a single method for demographic variables?

Page 29: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Selective editing: points for discussion

Can selective editing be successfully applied to large/complex surveys?

Can current methods for selective editing be further developed?

Can a general theory for selective editing be developed?

How promising is hybrid editing?

Page 30: New and Emerging Methods Maria Garcia and Ton de Waal UN/ECE Work Session on Statistical Data Editing, 16-18 May 2005, Ottawa

Software: points for discussion

Should we develop generic software or software tools for particular applications?

How can we ensure the flexibility of software?

Are the software tools fast enough for large/complex data sets?

To what extent should we aim to automate the editing process?