edit and imputation of the 2011 abu dhabi census glenn hui and hanan aldarmaki statistics centre -...

Edit and Imputation of the 2011 Abu Dhabi CensusGlenn Hui and Hanan AlDarmaki

Statistics Centre - Abu Dhabi

UNECE CESWork Session on Statistical Data Editing

(Oslo, Norway, 26 September 2012)

Outline

• Census Overview• Edit and Imputation Methodologies• Societal Differences and Challenges• Performance Analysis• Data Editing in the 2005 Census• Conclusions

2011 Abu Dhabi Census Overview

• First census conducted by SCAD• Main collection via CAPI, October 2011• 20 questions• Three methodologies used for edit and imputation,

each with its own purpose:• Donor• Deterministic• Manual

Edit and Imputation Methodologies

Donor Imputation

• Canadian Census Edit and Imputation System v4.5 (CANCEIS) hot deck module

• Substitutes invalid value with value from “donor” record

Deterministic Imputation

• Correct data via hard-coded rules (SAS)• Applied mostly for out-of-scope responses

Manual Imputation

• Manually check and modify data.• Difficult cases like very large households.

Societal and Cultural Differences

• Very large household sizes: ~5 persons average• Contrast to typical ~2.5 averages in Western countries• Error rates increase with family size; used less exacting

DLTs to account for this• Households of 17+ treated as individual records, with some

manual imputation as well

Societal Differences continued

• Complex relationships in large households• Extended families• High proportion of household servants

• Multiple wives – special consistency rules required• Large Expatriate Population

• Many live in shared living arrangements• Significant portion live in employer-provided camps• Shares and collectives treated as 1-person households

Imputation Performance

Example Statistics

• Predictive Accuracy: R2 generated by regressing true on imputed values, used to assess predictive ability.

• Estimation Accuracy: Difference in means of true and imputed values, m1, used to assess aggregate imputation accuracy.

Imputation performance for Age

Test Dataset R2 m1 m1/ μ(Age)

Missing only 0.850 0.315 1.26%

Missing and Interchange 0.813 0.531 2.12%

• Test Data: Starting with clean data, introduced two types of errors: missing data and “interchange” errors.

• Most performance measures from Euredit project (Charlton, 2003)

Charlton, J. C. (2003).“Evaluating New Methods for Data Editing and Imputation - Results from the Euredit Project”, UNECE Statistical Data Editing Work session. Madrid, Spain.

Data Editing in the 2005 Census

2005: Manual and Deterministic imputation

• Phase 1: validation edits, outlier detection via SQL• Small subset imputed via deterministic imputation

• Phase 2: Most failed records corrected manually

Comparison to 2011

• 2005 performance unknown• 2005: Three methodologists, several months’ preparation

15 data clerks, 4+ months• 2011: Two methodologists, 5 months total

Conclusions

• Modern edit and imputation methodology successfully applied in distinct cultural context

• Reliable results• Measurable changes• More efficient approach

• Special thanks to CANCEIS E&I unit, Statistics Canada

edit and imputation of the 2011 abu dhabi census glenn hui and hanan aldarmaki statistics centre -...

Documents