session 27 : resources for data management and handling social science data
DESCRIPTION
Session 27 : Resources for Data Management and Handling Social Science Data. 3 rd ESRC Research Methods Festival, Oxford, 1 July 2008 Workshop organised by the ‘Data Management through e-Social Science’ (DAMES) research Node of the National Centre for e-Social Science - PowerPoint PPT PresentationTRANSCRIPT
NCRM, Session 27, 1 July 2008
Session 27: Resources for Data Management and Handling Social
Science Data
3rd ESRC Research Methods Festival, Oxford, 1 July 2008
Workshop organised by the ‘Data Management through e-Social Science’ (DAMES) research Node of the National
Centre for e-Social Science
www.dames.org.uk / www.ncess.acuk
NCRM, Session 27, 1 July 2008 2
Resources for Data Management and Handling Social Science Data
1400-1430 Key issues, concerns, and the relevance of e-Science (Paul Lambert, Univ. Stirling)
1430-1500 Metadata, say what? (Jesse Blum, Univ. Stirling)
1500-1530 Software for Data Management: The Contribution of Stata (Karen Robson, Geary Inst., Univ. College Dublin)
1600-1630 Helping users see the wood for the trees: ESDS resources for managing and analysing data (Beate Lichtwardt, Univ. Essex)
1630-1700 Social Care Data: Exploring Issues (Alison Dawson & Alison Bowes, Univ. Stirling)
1700-1730 Handling data on occupations, educational qualifications, and ethnicity (Paul Lambert, Univ. Stirling)
NCRM, Session 27, 1 July 2008 3
Data management & handling social science data: Key issues, concerns, & the relevance of e-Science
1) The nature of data management
2) Key issues and concerns good habits and principles challenges
3) The contributions of… e-Social Science the DAMES Node (www.dames.org.uk)
NCRM, Session 27, 1 July 2008 4
‘Data management’ means… ‘the tasks associated with linking related data
resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES research Node..]
Usually performed by social scientists themselvesMost overt in quantitative survey data analysis
• ‘variable constructions’, ‘data manipulations’• navigating abundance of data – thousands of variables
Usually a substantial component of the work process
NCRM, Session 27, 1 July 2008 5
Some components…
Manipulating data Recoding categories / ‘operationalising’ variables
Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data)
Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions
Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’
Cleaning data ‘missing values’; implausible responses; extreme values
NCRM, Session 27, 1 July 2008 6
Example – recoding data
Count
323 0 0 0 0 323
982 0 0 0 0 982
0 425 0 0 0 425
0 1597 0 0 0 1597
0 0 340 0 0 340
0 0 3434 0 0 3434
0 0 161 0 0 161
0 0 0 1811 0 1811
0 0 0 0 2518 2518
0 0 0 331 0 331
0 0 0 0 421 421
0 0 0 257 0 257
102 0 0 0 0 102
0 0 0 0 2787 2787
138 0 0 0 0 138
1545 2022 3935 2399 5726 15627
-9 Missing or wild
-7 Proxy respondent
1 Higher Degree
2 First Degree
3 Teaching QF
4 Other Higher QF
5 Nursing QF
6 GCE A Levels
7 GCE O Levels or Equiv
8 Commercial QF, No OLevels
9 CSE Grade 2-5,ScotGrade 4-5
10 Apprenticeship
11 Other QF
12 No QF
13 Still At School No QF
Highesteducationalqualification
Total
-9.001.00
Degree2.00
Diploma
3.00 Higherschool orvocational
4.00 Schoollevel orbelow
educ4
Total
NCRM, Session 27, 1 July 2008 7
Example –Linking data Linking via ‘ojbsoc00’ : c1-5 =original data / c6 = derived from data / c7 = derived from www.camsis.stir.ac.uk
NCRM, Session 27, 1 July 2008 8
A bit of focus…
I tend to emphasise two data management activities:
1) Variable constructions o Coding and re-coding values
2) Linking datasetso Internal and external linkages
NCRM, Session 27, 1 July 2008 9
So why this workshop?
1. DM is a big part of the research process ..but receives limited methodological attention
2. Poor practice in soc. sci. DM is easily observed• Not keeping adequate records• Not linking relevant data • Not trying out relevant variable operationalisations
3. Even though.. There are plenty of existing resources and standards
relevant to data management activities There are suitable software and internet facilities People are working on DM support (e.g. DAMES)
NCRM, Session 27, 1 July 2008 10
DAMES research Node
social researchers often spend more time on data management than any other part of the research process
Data access / collection
Data Management
Data Analysis
UK Data ArchiveQualidata
Flagship social surveysOffice for National Statistics
Administrative dataSpecialist academic outputs
DAMESONS supportESDS support NCRM workshops
Essex summer school ESRC RDI initiatives
CQeSS
NCRM, Session 27, 1 July 2008 11
DM: Some further considerations
DM as stumbling block in research conduct UK has ample data, ample analytical resources, but
low levels of exploitation (esp. of complex data)Capacity building aims in DAMES
Lots of previous work in this field ..See below..
‘Data management’ also sometimes means..Data distributors supplying and monitoring use of
particular datasets (e.g. UK Data Archive DM guides)
NCRM, Session 27, 1 July 2008 12
2. Key issues and concerns
(4) good habits and principles
(3) Challenges
..Not solely about survey research..
NCRM, Session 27, 1 July 2008 13
(2.1) Good habit: Keep clear records of your DM activities
Reproducible (for self)Replicable (for all)Paper trail for whole
lifecycleCf. Dale 2006; Freese 2007
In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata)
Syntax Examples: www.longitudinal.stir.ac.uk
NCRM, Session 27, 1 July 2008 14
Stata syntax example (‘do file’)
NCRM, Session 27, 1 July 2008 15
Some comments on survey analysis software..
“A program like SPSS .. has two main components: the statistical routines, .. and the data management facilities. Perhaps surprisingly, it was the latter that really revolutionised quantitative social research” [Procter, 2001: 253]
“Socio-economic processes require comprehensive approaches as they are very complex (‘everything depends on everything else’). The data and computing power needed to disentangle the multiple mechanisms at work have only just become available.” [Crouchley and Fligelstone 2004]
NCRM, Session 27, 1 July 2008 16
Some personal comments on survey analysis software..
Data management and data analysis must be seen as integrated processesStata is the most effective software, as it
achieves advanced DM and DA functionality and makes good documentation easy
Others argue that more advanced analytical techniques necessitate other packages – I’m not convinced
NCRM, Session 27, 1 July 2008 17
(2.2) Principle: Use existing standards and previous research
Variable operationalisationsUse recognised recodes / standard classifications
• ONS harmonisation standards
• [Shaw et al. 2007]
• Cross-national standards. [Hoffmeyer-Zlotnick & Wolf 2003]
Use reproducible recodes / classifications (paper trail)
Other data file manipulations• Missing data treatments• Matching data files (finding the right data)
NCRM, Session 27, 1 July 2008 18
(2.3) Principle: Do something, not nothing
We currently put much more effort into data collection and data analysis, and neglect data manipulation
Survey research – the influence of ‘what was on the archive version’
…In my experience, a common reason why people didn’t do more DM was because they were frightened to…
NCRM, Session 27, 1 July 2008 19
(2.4) Principle: Learn how to match files
Complex data (complex research) is distributed across different files
In surveys, use a key linking variable for...One-to-one matching
SPSS: match files /file=“file1.sav” /file=“file2.sav” /by=pid. Stata: merge pid using file2.dta
One-to-many matching (‘table distribution’)SPSS: match files /file=“file1.sav” /table=“file2.sav” /by=pid .
Stata: merge pid using file2.dta
Many-to-one matching (‘aggregation’)SPSS: aggregate outfile=“file3.sav” /meaninc=mean(income)
/break=pid. Stata: collapse (mean) meaninc=income, by(pid)
NCRM, Session 27, 1 July 2008 20
Some challenges for data management..
(2.5) Agreeing about variable constructions
Unresolved debates about optimal measures and variables
Esp. in comparative research such as across time, between countries
http://www.longitudinal.stir.ac.uk/variables/
NCRM, Session 27, 1 July 2008 21
Some challenges for data management..
(2.6) Worrying about data security
DM activities could challenge data security Inspecting individual cases Multiple copies of related data files Ability to link with other datasets ‘Hands-on’ model of data review
New and exciting data resources • have more individual information• are more likely to be released with stringent conditions• may jeopardize traditional DM approaches
NCRM, Session 27, 1 July 2008 22
Some challenges for data management..
(2.7) Incentivising documentation / replicability
There is little to press researchers to better document DM, but much to press them not to
• Make DM and its documentation easier?• Reward documentation (e.g. citations)?
NCRM, Session 27, 1 July 2008 23
3) The relevance of e-Science
‘Data management through e-Social Science’
‘E-Science’ refers to adopting a number of particular approaches and standards from computing science, to applied research areas
These approaches include ‘the Grid’; distributed computing; data and computing standardisation; metadata; security; research infrastructures
DAMES (2008-11) – developing services / resources using e-Science approaches which will help social scientists in undertaking data management tasks
NCRM, Session 27, 1 July 2008 24
E-Science and Data Management
E-Science isn’t essential to good DM, but it has capacity to improve
and support conduct of DM… 1. Concern with standards setting
in communication and enhancement of data
2. Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources
3) Contribution of metadata tools/standards for variable harmonisation and standardisation
4) Linking data subject to different security levels
5) The workflow nature of many DM tasks
NCRM, Session 27, 1 July 2008 25
E.g. of GEODE: Organising and distributing specialist data resources (on occupations)
NCRM, Session 27, 1 July 2008 26
The contribution of DAMES 8 project themes
1.1) Grid Enabled Specialist Data Environments (‘GE*DE’)
2.1) Description, discovery & service use through metadata and data abstraction
1.2) Data resources for micro-simulation on social care data
2.2) Techniques to handle data from multiple sources
1.3) Linking e-Health and social science databases
2.3) Workflow modelling for social science
1.4) Training and interfaces for management of complex survey data
2.4) Security driven data management
NCRM, Session 27, 1 July 2008 27
DAMES agenda
Useful social science provisionsSpecialist data topics – occupations; education
qualifications; ethnicity; social care; health Mainstream packages and accessible resources
To exploit / engage with existing DM resources
In social science – e.g. CESSDA In e-Science – e.g. OGSA-DAI; OMII
NCRM, Session 27, 1 July 2008 28
..End of talk 1..
1400-1430 Key issues, concerns, and the relevance of e-Science (Paul Lambert, Univ. Stirling)
1430-1500 Metadata, say what? (Jesse Blum, Univ. Stirling)
1500-1530 Software for Data Management: The Contribution of Stata (Karen Robson, Geary Inst., Univ. College Dublin)
1600-1630 Helping users see the wood for the trees: ESDS resources for managing and analysing data (Beate Lichtwardt, Univ. Essex)
1630-1700 Social Care Data: Exploring Issues (Alison Dawson & Alison Bowes, Univ. Stirling)
1700-1730 Handling data on occupations, educational qualifications, and ethnicity (Paul Lambert, Univ. Stirling)
NCRM, Session 27, 1 July 2008 29
Appendix
Existing resources – sources and types of support for data management in the social sciences:
NCRM, Session 27, 1 July 2008 30
Existing resources (i): Data providersa) Documentation and metadata files
NCRM, Session 27, 1 July 2008 31
Existing resources (i): Data providers
b) Resources for variables CESSDA PPP on key variables http://www.nsd.uib.no/cessda/project/ UK Question Bank http://qb.soc.surrey.ac.uk/ ONS Harmonisation http://www.statistics.gov.uk/about/data/
c) Resources for datasets UK Census data portal, http://census.ac.uk/ IPUMS international census data facilities, www.ipums.org European Social Survey, www.europeansocialsurvey.org
d) Data manipulations prior to data release Missing data imputation / documentation Survey design / weighting information Influential – most analysts use ‘the archive version’
NCRM, Session 27, 1 July 2008 32
Existing resources (ii) Resource projects / infrastructures
- UK ESDS www.esds.ac.uk ESDS International | ESDS Government ESDS Longitudinal | ESDS Qualidata
- Helpdesks; online instructions; user support..
- UK ESRC NCRM / NCeSS / RDI initiatives- Longitudinal data – www.longitudinal.stir.ac.uk - Linking micro/macro - www.mimas.ac.uk/limmd/
- Other resources / projects / initiatives- EDACwowe - http://recwowe.vitamib.com/datacentre- ….
NCRM, Session 27, 1 July 2008 33
Existing resources (iii) Analytical and software support
Textbooks featuring data management [Levesque 2008] [Sarantakos 2007]
Software training covering DM Stata’s ‘data management’ manual SPSS user group course on syntax and data management,
www.spssusers.co.uk
But generally, sustained marginalisation of DM as a topic Advanced methods texts use simplistic data Advanced software for analysis isn’t usually combined with extended
DM requirements
NCRM, Session 27, 1 July 2008 34
Existing resources (iv) Data analysts’ contributions
Academic researchers often generate and publish their own DM resources, e.g.
Harry Ganzeboom on education and occupations, http://home.fsw.vu.nl/~ganzeboom/pisa/
Provision of whole or partial syntax programming examples
Analysts often drive wider resource provisions related to DM
CAMSIS project on occupational scales, www.camsis.stir.ac.uk
CASMIN project on education and social class
NCRM, Session 27, 1 July 2008 35
Existing resources (v) Literatures on harmonisation and standardisation
National Statistics Institutes’ principles and practices
E.g. ONS www.statistics.gov.uk/about/data/harmonisation/
Cross-national organisationsE.g. UNSTATS - http://unstats.un.org/unsd/class/
Academic studiesE.g. [Harkness et al 2003]; [Hoffmeyer-Zlotnick & Wolf
2003] [Jowell et al. 2007]
NCRM, Session 27, 1 July 2008 36
References
Blossfeld, H. P., & Rohwer, G. (2002). Techniques of Event History Modelling: New Approaches to Causal Analysis, 2nd Edition. Mawah, NJ: Lawrence Erlbaum Associates.
Crouchley, R., & Fligelstone, R. (2004). The Potential for High End Computing in the Social Sciences. Lancaster: Centre for Applied Statistics, Lancaster University, and http://redress.lancs.ac.uk/document-pool/hecsspotential.pdf.
Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158.
Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007.
Harkness, J., van de Vijver, F. J. R., & Mohler, P. P. (Eds.). (2003). Cross-Cultural Survey Methods. New York: Wiley.
Hoffmeyer-Zlotnik, J. H. P., & Wolf, C. (Eds.). (2003). Advances in Cross-national Comparison: A European Working Book for Demographic and Socio-economic Variables. Berlin: Kluwer Academic / Plenum Publishers.
Jowell, R., Roberts, C., Fitzgerald, R., & Eva, G. (2007). Measuring Attitudes Cross-Nationally. London: Sage.
Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS 16.0: A Guide for SPSS and SAS users. Chicago, Il.: SPSS Inc.
Procter, M. (2001). Analysing Survey Data. In G. N. Gilbert (Ed.), Researching Social Life, Second Edition (pp. 252-268). London: Sage.
Sarantakos, S. (2007). A Tool Kit for Quantitative Data Analysis Using SPSS. London: Palgrave MacMillan.
Shaw, M., Galobardes, B., Lawlor, D. A., Lynch, J., Wheeler, B., & Davey Smith, G. (2007). The Handbook of Inequality and Socioeconomic Position: Concepts and Measures. Bristol: Policy Press.