february 1, 2011 workshop: persistent identifiers for the social sciences 1 soep and doi...
TRANSCRIPT
February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 1
SOEP and DOIRequirements and Challenges
Jan Goebel
February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 2
Content
1. SOEP Overview
2. Problems
3. Conclusions
February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 3
SOEP Overview
• Socio-Economic Panel Study (SOEP) is a representative longitudinal study of private households in Germany
• Annual survey since 1984 of about 10,000 households (around 20,000 persons)
• Some of the many topics include household composition, occupational biographies, employment, earnings, health and indicators of subjective well-being
February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 4
SOEP is an ongoing Survey
• Common with all panel surveys• Each year we distribute an enhanced version with new
and changed data• Question are changing, new topics, ...
→ We do a lot but not just replication!• Even changes for „archived data“, like a change in the
coding scheme of ISCO
February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 5
• The SOEP currently (User DVD) consists of:– More than 320 data files– About 40.000 Variables
• Granulation to choose for citation? – Complete SOEP distribution of one year?– „Connected“ SOEP parts, e.g. Individual
questionnaires, HH-questionnaires, generated datasets
– Each data file– Each Variable (for each year or only once, longitudinal
concept?)
SOEP is not one dataset but a complex data structure
February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 6
• European user: 100% Version (English, German, different formats for SAS/SPSS/Stata/ASCII)
• Non-EU user: 95% Version (of cases)
• International comparative research: Part of the CNEF (Cross National Equivalent File)
• SOEP Geocodes (supplementary CD): Regional Planning Regions, Community types, etc.
• Country codes, Community codes, zip codes, microm:only by remote execution or at the Research Data Center (RDC SOEP)
• SOEP Pretests
• SOEP Related Studies
„The SOEP” is available in different versions
February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 7
SOEP can change during the period, because of updates
• Updates of weighting schemes or even bug fixes (also possible for older waves)
• Sometimes more than one update between distributions (cumulative updates?)
• How can a user know what version she is using?• Message-Digest Algorithm (MD5)• Secure Hash Algorithm (SHA-2)• Universal Numeric Fingerprint (UNF)
• Does rounding matter?• German/English Labels, different formats (SPSS, STATA, …)• Only update of a label bug?
February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 8
Conclusions
• Nesting of DOI should be possible:
Print DOI SOEP example SOEP DOI
Edited book Survey SOEP DVD 10.1000/soep.26
Article in book Data file SOEP dataset $PGEN 10.1000/soep.26.hgen
Table in article in book
Variable SOEP dataset $pgen variable ihinc$$
10.1000/soep.26.hgen.ihinc
• It should be possible for a user to identify the data, including version
The metadata of a DOI should include a SHA for each data file and format, which must also be persistent, like SHA-2
• Commitment about the persistence of the data provider
• It is not enough to identify the data source to make an scientific empirical analysis reproducible, you normally need the syntax also
February 1, 2011 Workshop: Persistent Identifiers for the Social Sciences 9
Thank you for your attention!