the use of administrative sources for statistical purposes matching and integrating data from...
TRANSCRIPT
The Use of Administrative Sourcesfor Statistical Purposes
Matching and Integrating Data from
Different Sources
What is Matching?• Linking data from different sources• Exact Matching - linking records from
two or more sources, often using
common identifiers• Probabilistic Matching - determining
the probability that records from different
sources should match, using a
combination of variables
Why Match?• Combining data sets can give more
information than is available from individual data sets
• Reduce response burden• Build efficient sampling frames• Impute missing data• To allow data integration
Models for Data Integration
• Statistical registers• Statistics from mixed source models
– Split population model– Split data approach– Pre-filled questionnaires– Using administrative data for non-
responders– Using administrative data for
estimation• Register-based statistical systems
Statistical Register
Survey data
Geographic information
systems
Administrative Sources
Other Statistical Registers
Satellite
registers
Statistical Registers
Mixed Source Models
• Traditionally one statistical output was based on one statistical survey
• Very little integration or coherence• Now there is a move towards more
integrated statistical systems• Outputs are based on several
sources
Split Population Model
• One source of data for each unit
• Different sources for different parts of the population
Split Population Model
Population of Statistical Units
Estimation Administrative Data
Statistical Survey
Statistics
Split Data Approach
• Several sources of data for each unit
Estimation Administrative Data
Statistical Survey
Unit 1 Unit 2 Unit 3 Unit n
Statistics
Pre-filled Questionnaires
• Survey questionnaires are pre-filled with data from other sources where possible
• Respondents check that the information is correct, rather than completing a blank questionnaire
• This reduces response burden ...... but may introduce a bias!
Example
Manufacture of wooden furniture
Using Administrative Data for Non-responders
• Administrative data are used directly to supply variables for units that do not respond to a statistical survey
• Often used for less important units, so that response-chasing resources can be focused on key units
Using Administrative
Data for Estimation• Administrative data are used as
auxiliary variables to improve the accuracy of statistical estimation
• Often used to estimate for small sub-populations or small geographic areas
Register-based Statisti
cal System
s
Real Estate Register
Business Register
Jobs and Other
Activities
Ad
min
istr
ati
ve
So
urc
es
Sta
tist
ical
Su
rve
ys
Statistical Outputs
Statistical Registers
Population Register
MatchingTerminology
Matching Keys
• Data fields used for matching e.g.• Reference Number• Name• Address• Postcode/Zip Code/Area Code• Birth/Death Date• Classification (e.g. ISIC, ISCO)• Other variables (age, occupation, etc.)
Distinguishing Power 1
• This relates to the uniqueness of the matching key
• Some keys or values have higher distinguishing powers than others
• High - reference number, full name, full address
• Low - sex, age, city, nationality
Distinguishing Power 2
• Can depend on level of detail– Born 1960, Paris
– Born 23 June 1960, rue de l’Eglise, Montmartre, Paris
• Choose variables, or combinations of variables with the highest distinguishing power
Match
• A pair that represents the same entity in reality
A A
Non-match
• A pair that represents two different entities in reality
A B
Possible Match
• A pair for which there is not enough information to determine whether it is a match or a non-match
A a
False Match
• A pair wrongly designated as a match in the matching process (false positive)
A B=
False Non-match
• A pair which is a match in reality, but is designated as a non-match in the matching process (false negative)
A A
MatchingTechniques
Clerical Matching
• Expensive
• Inconsistent
• Slow
• Intelligent
Automatic Matching
• Cheap
• Consistent
• Quick
• Limited intelligence
The Solution
• Use an automatic matching tool to find obvious matches and no-matches
• Refer possible matches to specialist staff
• Maximise automatic matching rates and minimise clerical intervention
How Automatic
Matching Works
Standardisation
• Generally used for text variables
• Abbreviations and common terms are replaced with standard text
• Common variations of names are standardised
• Postal codes, dates of birth etc. are given a common format
Blocking• If the file to be matched against is
very large, it may be necessary to break it down into smaller blocks to save processing time– e.g. if the record to be matched is in a
certain town, only match against other records from that town, rather than all records for the whole country
Blocking• Blocking must be used carefully, or
good matches will be missed
• Experiment with different blocking criteria on a small test data set
• Possible to have two or more passes with different blocking criteria to maximise matches
Parsing
• Names and words are broken down into matching keyse.g. Steven Vale stafan val
Stephen Vael stafan val
• Improves success rates by allowing matching where variables are not identical
Scoring
• Matched pairs are given a score based on how closely the matching variables agree
• Scores determine matches, possible matches and non-matches
Score100
x
y
0
Matches
PossibleMatches
Non-matches
How to DetermineX and Y
• Mathematical methodse.g. Fellegi / Sunter method
• Trial and Error
• Data contents and quality may change over time so periodic reviews are necessary
Enhancements
• Re-matching files at a later date reduces false non-matches (if at least one file is updated)
• Link to data cleaning software, e.g. address standardisation
Matching Software• Commercial products e.g.
Informatica, Trillium, Automatch
• In-house products– Jasper (Statistics Canada)
– Relais (ISTAT)
• Open-source products e.g. FEBRL
• No “off the shelf” products - all require tuning to specific needs
Internet Applications• Google (and other search engines)
– www.google.com
• Cascot – an automatic coding tool based on text matching– http://www2.warwick.ac.uk/fac/soc/ier/
publications/software/cascot/choose_classificatio/
• Address finders e.g. Postes Canada– http://www.postescanada.ca/tools/pcl/bin/
advanced-f.asp
Software Applications• Trigram method applied in SAS code
(freeware) for matching in the Eurostat business demography project
• Works by comparing groups of 3 letters, and counting matching groups
Trigram Method• Match “Steven Vale”
– Ste/tev/eve/ven/en /n V/ Va/Val/ale
• To “Stephen Vale”– Ste/tep/eph/phe/hen/en /n V/ Va/Val/ale– 6 matching trigrams
• And “Stephen Vael”– Ste/tep/eph/phe/hen/en /n V/ Va/Vae/ael– 4 matching trigrams
• Parsing would improve these scores
Matching in
Practice
Matching Records Without a Common Identifier
The UK Experience
by
Steven Vale (Eurostat / ONS)
and Mike Villars (ONS)
The Challenge
• The UK statistical business register relies on several administrative sources
• It needs to match records from these different sources to avoid duplication
• There is no system of common business identification numbers in UK
The Solution
• Records are matched using business name, address and post code
• The matching software used is Identity Systems / SSA-NAME3
• Matching is mainly automatic via batch processing, but a user interface also allows the possibility of clerical matching
Batch Processing 1
• Name is compressed to form a namekey, the last word of the name is the major key
• Major keys are checked against those of existing records at decreasing levels of accuracy until possible matches are found
• The name, address and post codes of possible matches are compared, and a score out of 100 is calculated
Batch Processing 2
• If the score is >79 it is considered to be a definite match
• If the score is between 60 and 79 it is considered a possible match, and is reported for clerical checking
• If the score is <60 it is considered a non-match
Clerical Processing
• Possible matches are checked and linked where appropriate using an on-line system
• Non-matches with >9 employment are checked - if no link is found they are sent a Business Register Survey questionnaire
• Samples of definite matches and smaller non-matches are checked periodically
Problems Encountered 1
• “Trading as” or “T/A” in the namee.g. Mike Villars T/A Mike’s Coffee Bar, Bar would be the major key, but would give too many matches as there are thousands of bars in the UK.
• Solution - split the name so that the last word prior to “T/A” e.g. Villars is the major key, improving the quality of matches.
Problems Encountered 2• The number of small non-matched units
grows over time leading to increasing duplication
• Checking these units is labour intensive
• Solutions
– Fine tune matching parameters
– Re-run batch processes
– Use extra information e.g. legal form / company number where available
Future Developments• Clean and correct addresses prior to
matching using “QuickAddress” and the Post Office Address File
• Links to geographical referencing
• Business Index - plans to link registers of businesses across UK government departments
• Unique identifiers?