paolo valente - unece statistical division slide 1 technology for census data coding, editing and...

28
Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census Technology for SPECA and CIS member countries (Astana, 7-8 June 2007)

Upload: lora-harris

Post on 13-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 1

Technology for census data coding, editing and imputation

Paolo Valente (UNECE)

UNECE Workshop on Census Technology for SPECA and CIS member countries

(Astana, 7-8 June 2007)

Page 2: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 2

Content:1.Coding2.Editing and

imputation

Reference material:

Handbook on Census Management for Population and Housing Censuses (Chapter IV, sections D-F)

Handbook on Population and Housing Census Editing

Page 3: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 3

1. Census data coding

Questions:1. How did you code the data in the

last census?

2. Were you satisfied or not with coding?

3. What problems did you find in coding?

4. Any problems with specific variables?

Page 4: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 4

Census data coding Data coding = Assigning classification

codes to the responses written on the census form

Coding systems:a) Manual b) Computer assistedc) Automaticd) Mix of a), b) or c)

Coding methodologies:a) Simple (1 or 2 words): ex. Birth placeb) Structured (> 1 question): ex.

Occupation c) Hierarchical: ex. Address

Page 5: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 5

Manual data coding

Clerks identify code using “code books”, and write it in the census form for later processing

Pros: Easy to implement No technology needed

Cons: Time consuming Labor intensive Risk of inconsistency

Page 6: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 6

Computer-assisted coding

Assisted by computerized system Computer-based code books How it works:

1) Coder type only few characters2) System selects matching list3) Coder choose right code4) Code automatically recorded by the

system

Page 7: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 7

Computer-assisted coding Pros:

Efficiency Good quality Particularly suitable for structured

coding (possibility to include coding rules)

Cons: Relatively complex system Long time needed for development Cost relatively high

Page 8: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 8

Automatic coding

Based on computerized algorithms No human intervention Text captured by ICR and matched

against indexes A score is assigned by the system to

the matched response: If score is above certain level,

response accepted If score is below level, human

intervention is needed (computer-assisted coding)

Page 9: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 9

Automatic coding

Matching rates depend on algorithms used and type of variable

Maximum matching rates in ideal circumstances:

For simple variables (birth place), approx. 80%

For complex variables (occupation, industry), approx. 50%

All responses not matched have to be processed with computer assisted coding

Page 10: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 10

Automatic coding Pros:

High efficiency Good quality (if system developed

accurately) Consistency Particularly suitable for structured

coding (possibility to include coding rules)

Cons: Very complex system Long time needed for development High cost Risk of systematic errors in case of

faults in matching algorithms or indexes

Page 11: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 11

Coding – Practices in 2000 round In general CIS countries used manual coding

About half of UNECE countries used automatic coding, in combination with computer-assisted or manual coding

In most cases software developed in-house Software for automatic coding:

ACTR (Automated Coding by Text Recognition) developed by Statistics Canada, also used by Italy, UK

See “Measuring Population and Housing”, Chapter III

Integrated software system, including computer assisted coding: CSPro (US Census Bureau)

Page 12: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 12

Coding in the 2010 census roundQuestions:

1. What are your plans for coding dataof next census?

2. Are you considering computer-assisted coding?

3. Why? …or why NOT?

Page 13: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 13

2. Editing and imputation

Questions on editing:

1. Which data did you edit in the last census?

2. How did you edit the data?

3. Did you have any problems?

Page 14: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 14

2. Editing and imputation

Questions on imputation:

1. Did you impute any missing data? If yes:

2. For which variables?3. What method and software you used?4. Did you produce statistics on imputation

rates?

Page 15: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 15

Editing and imputation

Editing = Detecting and correcting errors in census data

Imputation = assigning values to missing data

The two concepts are related and the two terms are sometimes used in different ways

Page 16: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 16

Editing and imputation

Different types of errors: Coverage errors (ex. omissions,

duplicates) Enumerator errors Respondent errors Coding errors Data entry errors

but also… Editing errors!

Page 17: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 17

Editing and imputation

Important not only to detect errors, but also to identify causes, in order to take appropriate measures and improve overall quality

Objectives of editing and imputation:

Improve quality of census data

Facilitate analysis of census data

Identify types and sources of errors

Page 18: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 18

Editing and imputation

Dilemma: what should be edited and what should NOT be edited?

Complex editing systems can be difficult and expensive to implement, and in some cases may introduce distortions

Go for relatively simple editing system!

Page 19: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 19

Editing and imputation

In general, the editing system should be:

Minimalist (only obvious errors) Automated (as much as

possible) Systematic Compliant with other NSI

procedures Compliant with intl. standards

Page 20: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 20

Editing and imputation

General guidelines for editing:

Make the fewest required changes possible

Eliminate obvious inconsistencies Supply entries for erroneous or

missing items by using other entries for the housing unit, person, or other persons in the household or comparable group as a guide

Page 21: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 21

Editing and imputation

Example of inconsistent information 1:

Reference person and spouse have same sex

Page 22: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 22

Editing and imputation

Example of inconsistent information 2:

Excessive age difference between mother and children

Page 23: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 23

Editing and imputation

Editing approaches: Top-down:

Items in sequence, from first to last

Multiple variable (Fellegi-Holt):1. A set of statements and relationships

among variables are checked in the household

2. The edit keeps track of all false statements

3. The system assess how to best changes the data

Page 24: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 24

Editing and imputation

Imputation methods: Static imputation (or “cold deck”)

Used mainly for missing values only Value assigned from predetermined set,

or distribution of valid responses The set of values does not change over

time Dynamic imputation (or “hot

deck”) Used for missing or inconsistent values Value assigned from “donor” with similar

characteristics, that changes constantly Response imputations change over time

See “Handbook on Census Editing”, Ch. II.E and Annex V

Page 25: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 25

Editing and imputation

Types of edits: Fatal edits identify errors with

certainty Query edits identify suspected

errors

Structure edits Check coverage and relations between

different units: persons, households, housing units, enumeration areas etc.

Edits for population and housing items

See “Handbook on Census Editing”, Chapters III, IV and V

Page 26: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 26

Editing and imputationPractices in 2000 round

Most ECE countries (33 out of 40) performed computer-supported editing, including several CIS countries

22 countries performed automatic imputations

Most countries developed specific software

Some countries used SAS, Oracle, SQL, CSPro

See “Measuring Population and Housing”, Chapter III

Page 27: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 27

Editing and imputationPlans for 2010 round

Questions: What are your plans for editing

and imputation? What editing approaches/methods

are you considering?

Page 28: Paolo Valente - UNECE Statistical Division Slide 1 Technology for census data coding, editing and imputation Paolo Valente (UNECE) UNECE Workshop on Census

Paolo Valente - UNECE Statistical Division Slide 28

Editing and imputationPlans for 2010 round

Questions: For which variables would you

consider imputation of missing values?