taking the best out of both worlds? - gesis€¦ · possible terrorists. yet “big data” can...
TRANSCRIPT
Taking the Best out of Both Worlds? The Linkage of Surveys and Administrative Data
Gesis, Mannheim, September 18, 2014
Stefan Bender Manfred Antoni Joe Sakshaug Frauke Kreuter Alexandra Schmucker
Use of Adminstrative Data in Publications in Leading Journals, 1980-2010 (Raj Chetty)
2
Motivation
3
! Starting point: ‐ Increasing demand for comprehensive, longitudinal data in social
sciences. ‐ Rising problems with surveys, e.g. falling response rates,
increasing costs (Groves 2011). ‐ Process-produced data (Big Data, administrative data) are
increasingly examined regarding their value for research (Kreuter/Peng 2014). ‐ Each of these data sources has its specific shortcomings.
! Remedy: ‐ Balancing the disadvantages of different data sources by
combining their advantages
! Implementation: ‐ Create more comprehensive datasets using data linkage
Outline
! What is administrative data?
! Differences between administrative and survey data
! Advantages/disadvantages
! Linking administrative and survey records ‐ Examples from the IAB ‐ Linkage ‐ Informed Consent
! International access to linked data (FDZ)
! Conclusions
! Extra: Big data and informed consent 4
Advantages and disadvantages: Survey data
5
! Advantages: ‐ Specifically designed for research purposes (see Groves 2011) ‐ total survey error framework (see Groves/Lyberg 2010) ‐ Subjective information on behaviours, attitudes etc.
! Disadvantages: ‐ Missing data (unit-nonresponse, item-nonresponse, panel attrition) ‐ Misreporting (e.g. recall errors in retrospective interviews) ‐ Time restrictions ‐ High costs
What is administrative data?
! “Secondary” or “process” data that is collected and used primarily for administrative purposes
! Often generated by government agencies and public/private sector organizations, who keep records of the services they deliver and processes that they register
! Examples ‐ Social Secure System (health, pension and employment) ‐ Unemployment, active labor market programs, social benefit ‐ Pupil records ‐ Tax and income records ‐ Information collected from birth/death certificates
6
Administrative vs. Survey Data
! Unlike survey data, the primary use of administrative data is not for research purposes
! Administrative data is usually collected for a population ! Administrative populations often differ from traditional survey
populations ‐ Patients covered under a particular health insurance
organization ‐ Persons diagnosed with a particular illness (e.g., cancer, HIV/
AIDS) ‐ Persons with an established credit history
! Cannot add specific measures to administrative data ! Administrative data sources are usually longitudinal
7
Advantages of Using Administrative Data
! Relatively inexpensive to obtain and use
! Saves money and resources for data collection, since data are already available ‐ No respondent burden
! Can be more accurate than survey data because some measurement issues (e.g., forgetting, social desirability) are avoided
! Can provide detailed longitudinal information ‐ Lifetime earnings, medical expenditures ‐ Such information may be too burdensome for respondents to
report
8
Advantages (cont.)
! Often contains very large sample sizes that would be too costly to achieve in surveys
! Databases are regularly updated, sometimes continuously
! Data are collected systematically with quality control checks
! Nearly 100% coverage of the population of interest ‐ Includes individuals who may not respond to
surveys
9
Disadvantages of Using Administrative Data
! Administrative data alone is usually not sufficient to answer most research questions
! Researcher has no control over administrative content
! Such data may not contain all relevant variables of interest ‐ E.g., socio-demographic characteristics (e.g., education), household
composition, self-employment income, habits and behaviors, opinions and attitudinal measures, expectations, retirement plans ‐ Surveys can collect these variables, which can be used in
conjunction with administrative data
! Concepts, definitions, reference dates, and coverage of administrative variables may not meet the research objectives
10
Disadvantages (cont.)
! There may be quality issues associated with variables that are not central to the administrative tasks
! Variables may change over time without notice without any transformation ‐ E.g., occupation/industry codes
! Metadata (description and background) may be very limited
! Administrative data sources are often very large and their use can lead to significant processing costs
! Strong data protection laws may complicate the data access process and/or place restrictions on the publication of results
11
Combining Strengths of Both
! Combining survey and administrative data may provide the best of both worlds, and mitigate their disadvantages
! Increases the number of relevant variables for research purposes
! Administrative variables with poor quality could be replaced with higher quality survey variables, and vice versa
! Researcher has more control over the content of the data
! If administrative database serves as the sampling frame, then it is possible to do extensive nonresponse bias analysis
! At the IAB, they have a strategy for utilizing administrative data pre- and post-survey data collection
12
! Micro labor market data on individuals/households and establishments
Administrative Data of the RDC (FDZ) of the German Federal Employment Agency (BA)
13
Surveys
Data available at the FDZ
Administrative Data
Social Security
Notifications
Process- generated data of the
BA
14
BIG DATA
Exemplary project I: WeLL-ADIAB (I) (see Bender et al. 2009)
15
! Data sources: ‐ Employee survey (project ‘Further Training as Part of
Lifelong Learning’) [S] ‐ IAB Establishment Panel [S] ‐ Employment biographies [A] ‐ Establishment histories [A]
! Data linkage: ‐ Informed consent for linkage ‐ Linkage using social security and establishment number
! Data access: ‐ On-site use at the FDZ or via job submission
Exemplary project I: WeLL-ADIAB (II)
16 16
Establishment Histories
WeLL Employee Panel
IAB Establishment Panel
Employment Biographies
Establishment Histories
Employment Biographies
Administrative data
Survey data
IAB Example: PASS
! “Labor Market and Social Security” (PASS) survey
! Mixed-mode study (telephone and face-to-face) conducted in Germany
! Survey consists of two independent subsamples:
! General population sample ‐ Drawn from commercial database covering all private household
addresses
! Benefit recipient sample ‐ Drawn from Federal Employment records of persons who
received unemployment benefits at the reference date
17
IAB Example: PASS (cont.)
! Drawing from administrative records permit the study of nonresponse bias in each subsample ‐ The IAB routinely exploits this opportunity
! However, these records cannot be released to the public without informed consent from the survey unit
! PASS asks respondents for consent to link survey and administrative records for research purposes ‐ Consent rate almost 80 percent
18
Conceptual Pathway to Linkage
Responders Sample
Consenters
Non-‐ Consenters
Non-‐ Responders
nY rY
nrY
cY
ncY
Sample Frame/ Admin Data Y
19
Linked 𝑌 ↓𝐿
Non-‐Linked 𝑌 ↓𝑁𝐿
Linking Survey and Administrative Data
! Linking survey and administrative data is becoming increasingly common in the social and health sciences
! Basic idea of linkage ‐ Identify common variables in both data sets ‐ Link each survey record to corresponding administrative
record based on matching variables
! Different methods for linking survey and administrative data ‐ Exact matching ‐ Probabilistic matching ‐ Statistical matching (or data fusion)
Linked records belonging to the same unit
20
Exact Linkage
! A link is established based on a single unique identifier ‐ Social Security number ‐ Establishment number
! Purely deterministic approach
! Exact 1-to-1 matching
! Usually the survey must request the unique identifier from the respondent prior to linkage ‐ Bundled into the informed consent statement ‐ Assumed that the identifier is recorded without error
21
German Record Linkage Center (GRLC)
22
FDZ Nuremberg University of Duisburg-Essen
Focus: Service facility Focus: Research unit
Project advisory center Development and evaluation of linkage methods
Conducting (privacy preserving) record linkage
Development of free linkage software
Secure access to linked data Dissemination of current research results
Tutorials on record linkage
financed by DFG
Conceptual Pathway to Linkage
Responders Sample
Consenters
Non-‐ Consenters
Non-‐ Responders
nY rY
nrY
cY
ncY
Sample Frame/ Admin Data Y
23
Linked 𝑌 ↓𝐿
Non-‐Linked 𝑌 ↓𝑁𝐿
Informed Consent
§ Informed consent is believed to be an effective means of respecting individuals as autonomous decision makers with rights of self-determination.
§ For Germany informed consent is definied by law. § Before linking administrative with survey data,
informed consent of the surveyed units is needed.
The Selectivity of Consent
! Correlates of consent ‐ Age, race/ethnicity, gender, education, marital status, wealth,
earnings, health status, health insurance, employment (Sala et al., 2014; Sala et al., 2012; Bates and Pascale, 2006; Jenkins et al., 2006; Banks et al., 2005; Dunn et al., 2003; Young et al., 2001; Woolf et al, 2000; Olson, 1999; Pullen et al., 1992)
‐ Item missing data, interviewer characteristics, prior-wave outcomes (Sala et al., 2012 ; Jenkins et al., 2006)
‐ Wording and placement of consent request (Sala et al., 2014; Sakshaug and Kreuter, 2014; Sakshaug et al., 2013)
! Most studies have only looked at the selectivity of survey estimates, but selectivity of key administrative estimates is also a concern
25
PASS Example: Estimating Linkage Consent Bias
! “Labour Market and Social Security” (PASS) survey
! Almost 80% of respondents consented to linkage of Federal employment records
! Employment records contain several variables used to administer welfare and unemployment benefits ‐ Wages, benefits, and employment spells considered to be most
reliable
! Research questions ‐ Do linkage consent biases exist for some administrative
variables? ‐ How do consent biases compared to other sources of error?
26
Conceptual Pathway to Linkage
27
Responders Sample
Consenters
Non-‐ Consenters
Non-‐ Responders
nY rY
nrY
cY
ncY
Sample Frame/ Admin Data Y
Linked 𝑌 ↓𝐿
Non-‐Linked 𝑌 ↓𝑁𝐿
Nonresponse Bias
Non-consent Bias
Measurement Bias
Estimates of Linkage Consent Bias Relative to Other Sources of Bias
28
Variable Nonresponse Bias
Measurement Bias
Non-consent bias
Age 0.1 0.03 -0.3*
Foreign (%) -5.6* -2.5* -0.9*
UB II (%) 3.2* -7.1* -0.3
Disability (%) 0.4 6.0* 0.01
Employed (%) 1.0 -0.6 0.3
Income (30 days)
-71.4* 394.5* 1.7
Sakshaug and Kreuter (2012) * p < 0.05 • Non-consent biases are present, but generally smaller than other sources of error
Linkage and Informed Consant Rates
Name Linkage Consent
ALWA 86 92 PASS 90 86 SAVE 78 49 (57) AeKo 78 99
SFB 882 73 88 IAB-SOEP Migration 96 50 WeLL 100 92
29
The Research Data Centre of the BA in the IAB
30
! Tasks of the Research Data Centre (FDZ): ‐ Preparation, standardization and documentation of
research data ‐ Secure data access ‐ Advisory service on analytic potential, scope, validity
and handling of data
! Several projects on data linkage using different sources since the FDZ’s establishment in 2004
! Provision of (linked) data to external researchers
Data Access
Access is Easy, Quick and Cheap ! Easy ‐ Non-technical project proposal ‐ Approval by RDC (off-site use) or Federal Ministry (on-site use) ‐ Use agreement with the institution of the researcher
! Quick ‐ (Estimated) Time until user/institution receives contract: ‐ 2 weeks for off-site access (scientific use file) ‐ 6 weeks for on-site access
! Cheap ‐ Data access is free of charge ‐ No lab fees ‐ No restrictions on hours/visits of on-site use facilities or runs of
remote executions 31
32
§ UKDA, Essex will be next
Data Access I
Summary & outlook
33
! Data linkage allows a combination of traditional / designed research data and process-produced data from various sources.
! Linked data may help researchers to understand the data-generating process and to determine whether model assumptions are met.
! Total survey error framework has to be applied more thoroughly on process-produced data.
! Granting access to linked micro-data generally possible, but ‐ Increased richness of data also increases risk of deanonymisation. ‐ Ways of access to single data sources may not be suitable for their
combination.
è The FDZ needs to improve in terms of (remote) access to linked data.
è Anyone is welcome to do research with and on our linked data sets! Quality, analytic potential and accessibility of linked administrative, survey and publicly available data
www.cambridge.org/9781107637689 www.dataprivacybook.org
Privacy, Big Data, and the Public Good Frameworks for Engagement Edited by
Julia Lane American Institutes for Research,Washington DC Victoria Stodden Columbia University
Stefan Bender Institute for Employment Research of the German Federal Employment Agency
Helen Nissenbaum New York University
Massive amounts of data on human beings can now be analyzed. Pragmatic purposes abound, including selling goods and services, winning political campaigns, and identifying possible terrorists. Yet “big data” can also be harnessed to serve the public good: scientists can use big data to do research that improves the lives of human beings, improves government services, and reduces taxpayer costs. In order to achieve this goal, researchers must have access to this data – raising important privacy questions. What are the ethical and legal requirements? What are the rules of engagement? What are the best ways to provide access while also protecting confidentiality? Are there reasonable mechanisms to compensate citizens for privacy loss?
The goal of this book is to answer some of these questions. The book’s authors paint an intellectual landscape that includes legal, economic, and statistical frameworks. The authors also identify new practical approaches that simultaneously maximize the utility of data access while minimizing information risk.
Contributors Katherine J. Strandburg; Solon Barocas and Helen Nissenbaum; Alessandro Acquisti; Paul Ohm; Victoria Stodden; Steven E. Koonin and Michael J. Holland; Robert M. Goerge; Peter Elias; Daniel Greenwood, Arkadiusz Stopczynski, Brian Sweatt, Thomas Hardjono, and Alex Pentland; Carl Landwehr; John Wilbanks; Frauke Kreuter and Roger Peng; Alan F. Karr and Jerome P. Reiter; Cynthia Dwork
Order Today! Visit www.cambridge.org/9781107637689 or call 1.800.872.7423
20% Discount Promo Code: F4LANE
Forthcoming for
2014
Book Goals
§ Massive amounts of data on human beings can now be analyzed. § Pragmatic purposes abound, including selling goods, winning
political campaigns, and identifying possible terrorists. § Big data can also be harnessed to serve the public good:
scientists can use big data to do research that improves lives of human beings and more.
§ To achieve this goal, researchers must have access to this data – raising important privacy questions.
• What are the legal requirements? • What are the rules of engagement? • What are the best ways to provide access while also protecting
confidentiality? • Are there reasonable mechanisms to compensate citizens for
privacy loss?
Anonymity, Reachability, Information flow
§ Anonymity and consent are attractive: anonymization seems to take data outside the scope of privacy.
§ The value of anonymity inheres not in namelessness, but instead to something we called “reachability” —with or without access to identifying information.
§ Even when individuals are not ‘identifiable’, they may still be ‘reachable’, and may be subject to consequential inferences and predictions taken on that basis.
§ Big data involves practices that have radically disrupted entrenched information flows.
Book-Chapter by Barocas, Nissenbaum
The Tyranny of the Minority
§ The willingness of a few individuals to disclose certain information implicates everyone else who happens to share the more easily observable traits that correlate with the revealed trait.
§ This is the tyranny of the minority: the volunteered information of the few can unlock the same information about the many.
Book-Chapter by Barocas, Nissenbaum
Inference
§ A lot can be predicted about a person’s actions without knowing anything personal about them (especially in a big data context).
Book-Chapter by Barocas, Nissenbaum
Informed Consent
§ Informed consent is believed to be an effective means of respecting individuals as autonomous decision makers with rights of self-determination.
§ Thus, where anonymity is unachievable or simply does not make sense, informed consent often is the mechanism sought out by conscientious collectors and users of personal information.
§ Understood as a crucial mechanism for ensuring privacy, informed consent is a natural corollary of the idea that privacy means control over information about oneself.
Book-Chapter by Barocas, Nissenbaum
Transparency
§ The ideal offers data or human subjects true freedom of choice based on a sound and sufficient understanding of what the choice entails.
§ That simplicity and clarity unavoidably results in losses of fidelity.
§ Plain-language notices cannot provide information that people need to make decisions about complex contents in big data.
Book-Chapter by Barocas, Nissenbaum
My Conclusion for Big Data
! Blend big data and survey-based/official data.
! Use RDC structure for access to big data or combined data.
! No longer hands on work with data.
! Discussion of many topics needed: informed consent, non-participation, inference, privacy …
! Main issues: data protection, access and trust.
Ø We have to be more active in the public discussion, because big data is affecting our daily work!!!
www.iab.de
http:/fdz.iab.de/en.aspx
Stefan Bender [email protected]
www.cambridge.org/9781107637689 www.dataprivacybook.org
Privacy, Big Data, and the Public Good Frameworks for Engagement Edited by
Julia Lane American Institutes for Research,Washington DC Victoria Stodden Columbia University Stefan Bender Institute for Employment Research of the German Federal Employment Agency
Helen Nissenbaum New York University
Massive amounts of data on human beings can now be analyzed. Pragmatic purposes abound, including selling goods and services, winning political campaigns, and identifying possible terrorists. Yet “big data” can also be harnessed to serve the public good: scientists can use big data to do research that improves the lives of human beings, improves government services, and reduces taxpayer costs. In order to achieve this goal, researchers must have access to this data – raising important privacy questions. What are the ethical and legal requirements? What are the rules of engagement? What are the best ways to provide access while also protecting confidentiality? Are there reasonable mechanisms to compensate citizens for privacy loss?
The goal of this book is to answer some of these questions. The book’s authors paint an intellectual landscape that includes legal, economic, and statistical frameworks. The authors also identify new practical approaches that simultaneously maximize the utility of data access while minimizing information risk.
Contributors Katherine J. Strandburg; Solon Barocas and Helen Nissenbaum; Alessandro Acquisti; Paul Ohm; Victoria Stodden; Steven E. Koonin and Michael J. Holland; Robert M. Goerge; Peter Elias; Daniel Greenwood, Arkadiusz Stopczynski, Brian Sweatt, Thomas Hardjono, and Alex Pentland; Carl Landwehr; John Wilbanks; Frauke Kreuter and Roger Peng; Alan F. Karr and Jerome P. Reiter; Cynthia Dwork
Order Today! Visit www.cambridge.org/9781107637689 or call 1.800.872.7423
20% Discount Promo Code: F4LANE
Forthcoming for
2014
German Administrative Data
Social Security Notifications ! Procedure:
‐ Employers submit notifications to the social security system ‐ For every employee and marginal worker covered by the social
security system (notification requirement) ‐ Annually. Or: Begin or end of employment, employment interruption, change of health insurance ‐ Identification: social security number and establishment number
! Purpose of data collection: ‐ Calculation of social security contributions and (unemployment)
benefits ‐ Statistics
43
Procedure of Social Security Notifications (simplified)
44
Establishments/ Employers
Receiving offices of the notification
procedure (health insurance
companies)
German Federal Pension Fund
Federal Employment Agency
45
Notification to the Social Security System:
§ Social Security Number § Establishment Number § Last Name § First Name § Address § Reason for Notification § Times of Employment (on a daily basis) § Nationality § School Education § Vocational Training § Type of Employment § Wages § Occupational Status
2. Administrative Data
! No information about civil servants, freelancers or self-employed
‐ Internal processes of the Federal Employment Agency ‐ Payment/Receipt of unemployment benefits ‐ Participation in labour market programs ‐ Registered job search
! Exact start and end dates ! Computer-aided processes
! Since 2011 new information: new occupation classification, working hours
46
2. Administrative Data
! Federal Employment Agency transmits data to IAB ! IAB merges social security notifications and BA data
ð (complete) individual employment biographies
Employment history covered by social security system (since 1975) Unemployment benefit receipt (since 1975)
Registered job search (since 2000) Participation in labor market programs (since 2000)
47
apprenticeship pension
Inconsistencies
! Purpose of data collection: ‐ Calculation of social security contributions and (unemployment)
benefits ‐ Statistics
! Variables, which are highly accurate: ‐ Sex, Birthdate (included in the Social Security Number) ‐ Wage and beginning and ending of a job
! All of the other variables playing a minor role and therefor they are not highly accurate.
48