paolo valente - unece statistical division slide 1 technology for census data coding, editing and...
TRANSCRIPT
Paolo Valente - UNECE Statistical Division Slide 1
Technology for census data coding, editing and imputation
Paolo Valente (UNECE)
UNECE Workshop on Census Technology for SPECA and CIS member countries
(Astana, 7-8 June 2007)
Paolo Valente - UNECE Statistical Division Slide 2
Content:1.Coding2.Editing and
imputation
Reference material:
Handbook on Census Management for Population and Housing Censuses (Chapter IV, sections D-F)
Handbook on Population and Housing Census Editing
Paolo Valente - UNECE Statistical Division Slide 3
1. Census data coding
Questions:1. How did you code the data in the
last census?
2. Were you satisfied or not with coding?
3. What problems did you find in coding?
4. Any problems with specific variables?
Paolo Valente - UNECE Statistical Division Slide 4
Census data coding Data coding = Assigning classification
codes to the responses written on the census form
Coding systems:a) Manual b) Computer assistedc) Automaticd) Mix of a), b) or c)
Coding methodologies:a) Simple (1 or 2 words): ex. Birth placeb) Structured (> 1 question): ex.
Occupation c) Hierarchical: ex. Address
Paolo Valente - UNECE Statistical Division Slide 5
Manual data coding
Clerks identify code using “code books”, and write it in the census form for later processing
Pros: Easy to implement No technology needed
Cons: Time consuming Labor intensive Risk of inconsistency
Paolo Valente - UNECE Statistical Division Slide 6
Computer-assisted coding
Assisted by computerized system Computer-based code books How it works:
1) Coder type only few characters2) System selects matching list3) Coder choose right code4) Code automatically recorded by the
system
Paolo Valente - UNECE Statistical Division Slide 7
Computer-assisted coding Pros:
Efficiency Good quality Particularly suitable for structured
coding (possibility to include coding rules)
Cons: Relatively complex system Long time needed for development Cost relatively high
Paolo Valente - UNECE Statistical Division Slide 8
Automatic coding
Based on computerized algorithms No human intervention Text captured by ICR and matched
against indexes A score is assigned by the system to
the matched response: If score is above certain level,
response accepted If score is below level, human
intervention is needed (computer-assisted coding)
Paolo Valente - UNECE Statistical Division Slide 9
Automatic coding
Matching rates depend on algorithms used and type of variable
Maximum matching rates in ideal circumstances:
For simple variables (birth place), approx. 80%
For complex variables (occupation, industry), approx. 50%
All responses not matched have to be processed with computer assisted coding
Paolo Valente - UNECE Statistical Division Slide 10
Automatic coding Pros:
High efficiency Good quality (if system developed
accurately) Consistency Particularly suitable for structured
coding (possibility to include coding rules)
Cons: Very complex system Long time needed for development High cost Risk of systematic errors in case of
faults in matching algorithms or indexes
Paolo Valente - UNECE Statistical Division Slide 11
Coding – Practices in 2000 round In general CIS countries used manual coding
About half of UNECE countries used automatic coding, in combination with computer-assisted or manual coding
In most cases software developed in-house Software for automatic coding:
ACTR (Automated Coding by Text Recognition) developed by Statistics Canada, also used by Italy, UK
See “Measuring Population and Housing”, Chapter III
Integrated software system, including computer assisted coding: CSPro (US Census Bureau)
Paolo Valente - UNECE Statistical Division Slide 12
Coding in the 2010 census roundQuestions:
1. What are your plans for coding dataof next census?
2. Are you considering computer-assisted coding?
3. Why? …or why NOT?
Paolo Valente - UNECE Statistical Division Slide 13
2. Editing and imputation
Questions on editing:
1. Which data did you edit in the last census?
2. How did you edit the data?
3. Did you have any problems?
Paolo Valente - UNECE Statistical Division Slide 14
2. Editing and imputation
Questions on imputation:
1. Did you impute any missing data? If yes:
2. For which variables?3. What method and software you used?4. Did you produce statistics on imputation
rates?
Paolo Valente - UNECE Statistical Division Slide 15
Editing and imputation
Editing = Detecting and correcting errors in census data
Imputation = assigning values to missing data
The two concepts are related and the two terms are sometimes used in different ways
Paolo Valente - UNECE Statistical Division Slide 16
Editing and imputation
Different types of errors: Coverage errors (ex. omissions,
duplicates) Enumerator errors Respondent errors Coding errors Data entry errors
but also… Editing errors!
Paolo Valente - UNECE Statistical Division Slide 17
Editing and imputation
Important not only to detect errors, but also to identify causes, in order to take appropriate measures and improve overall quality
Objectives of editing and imputation:
Improve quality of census data
Facilitate analysis of census data
Identify types and sources of errors
Paolo Valente - UNECE Statistical Division Slide 18
Editing and imputation
Dilemma: what should be edited and what should NOT be edited?
Complex editing systems can be difficult and expensive to implement, and in some cases may introduce distortions
Go for relatively simple editing system!
Paolo Valente - UNECE Statistical Division Slide 19
Editing and imputation
In general, the editing system should be:
Minimalist (only obvious errors) Automated (as much as
possible) Systematic Compliant with other NSI
procedures Compliant with intl. standards
Paolo Valente - UNECE Statistical Division Slide 20
Editing and imputation
General guidelines for editing:
Make the fewest required changes possible
Eliminate obvious inconsistencies Supply entries for erroneous or
missing items by using other entries for the housing unit, person, or other persons in the household or comparable group as a guide
Paolo Valente - UNECE Statistical Division Slide 21
Editing and imputation
Example of inconsistent information 1:
Reference person and spouse have same sex
Paolo Valente - UNECE Statistical Division Slide 22
Editing and imputation
Example of inconsistent information 2:
Excessive age difference between mother and children
Paolo Valente - UNECE Statistical Division Slide 23
Editing and imputation
Editing approaches: Top-down:
Items in sequence, from first to last
Multiple variable (Fellegi-Holt):1. A set of statements and relationships
among variables are checked in the household
2. The edit keeps track of all false statements
3. The system assess how to best changes the data
Paolo Valente - UNECE Statistical Division Slide 24
Editing and imputation
Imputation methods: Static imputation (or “cold deck”)
Used mainly for missing values only Value assigned from predetermined set,
or distribution of valid responses The set of values does not change over
time Dynamic imputation (or “hot
deck”) Used for missing or inconsistent values Value assigned from “donor” with similar
characteristics, that changes constantly Response imputations change over time
See “Handbook on Census Editing”, Ch. II.E and Annex V
Paolo Valente - UNECE Statistical Division Slide 25
Editing and imputation
Types of edits: Fatal edits identify errors with
certainty Query edits identify suspected
errors
Structure edits Check coverage and relations between
different units: persons, households, housing units, enumeration areas etc.
Edits for population and housing items
See “Handbook on Census Editing”, Chapters III, IV and V
Paolo Valente - UNECE Statistical Division Slide 26
Editing and imputationPractices in 2000 round
Most ECE countries (33 out of 40) performed computer-supported editing, including several CIS countries
22 countries performed automatic imputations
Most countries developed specific software
Some countries used SAS, Oracle, SQL, CSPro
See “Measuring Population and Housing”, Chapter III
Paolo Valente - UNECE Statistical Division Slide 27
Editing and imputationPlans for 2010 round
Questions: What are your plans for editing
and imputation? What editing approaches/methods
are you considering?
Paolo Valente - UNECE Statistical Division Slide 28
Editing and imputationPlans for 2010 round
Questions: For which variables would you
consider imputation of missing values?