preparing data for sharing: the fair principles
TRANSCRIPT
PREPARING DATA FOR
SHARING
The FAIR Principles
Gareth Knight
London School of Hygiene & Tropical Medicine
ADMIT Network Meeting
01 December 2015
FAIR Principles
Findable
• Descriptive metadata
• Persistent Identifiers
Accessible
• Determining what to share
• Participant consent and risk management
• Access status
Interoperable
• XML standards
• Data Documentation Initiative
• CDISC
Reusable• Rights and
licence models
• Permitted and non-permitted use
http://datafairport.org/
Make your data:• Findable• Accessible• Interoperable• Reusable
Data Sharing in the sciences
• Data sharing has always taken place in some form
• Enlightenment during 17 – 18th
century built upon open debate and sharing of knowledge
• Science depends on openness and transparency to advance– Replicate results
– Correct errors & address bias
• Negative as well as positive findings need to be in the public domain
“Systematic Dictionary of the Sciences, Arts, and Crafts”Diderot & d'Alembert (1751 onwards)
Data Sharing in the News
“To make progress in science, we need to be open and share.”Neelie Kroes (2012)
vice president of the European Commissionhttp://europa.eu/rapid/press-release_SPEECH-12-258_en.htm
“To make progress in science, we need to be open and share.”Neelie Kroes (2012)
vice president of the European Commissionhttp://europa.eu/rapid/press-release_SPEECH-12-258_en.htm
Key Motivators
Research / Policy development Ensure validity
Funder Requirement Publisher requirements
Data reuse improves citation rate
• Studies that made data available in a public repository received 9% more citations than similar studies where data was not available
• Creators tend to cite own data up to 2 years
• Third party use grew over time: for 100 datasets deposited in year 0,
– 40 reuse papers in PubMed in year 2
– 100 by year 4
– 150+ by year 5.
Piwowar & Vision, T.J (2013). Data reuse and the open data citation advantage. https://peerj.com/articles/175/
Study of 10,557 articles published between 2001 and 2009 that
collected gene expression microarray data
Plan for Sharing
Data Management Plan• Data to be produced
• Management approach
• Sharing approach
– In what form?
– When will it take place?
– How will it be shared?
PlanningData
CollectionDatabase
SetupData
Capture
Data Processing & curation
Archiving & sharing
https://globalhealthdatamanagement.tghn.org/data-dudes/tools-templates/
DATA
DISCOVERY
Is your data findable?
Discovery Metadata
• Descriptive metadata created to describe key attributes of data:– Title
– Creator
– Content description
• Data repositories/journals capture and publish discovery metadata in several formats (DC, DataCite, DDI)
• Metadata ‘harvested’ by research data catalogues & search engines
• Metadata available to all, even if data is not
Registry of Research Data Repositorieshttp://service.re3data.org
Registry of Research Data Repositorieshttp://service.re3data.org
Citing Data
• Research data are a citable resource, same as papers & books
• 44-75 days is the estimated average lifespan of web URLs
• A unique, long-term identifier is necessary to enable citation
• Many persistent ID systems developed to solve problem
– DOI, Handle, ARK, etc.
• Data citation in reports and publications
UK Data Service: Citing Datahttps://www.ukdataservice.ac.uk/use-data/citing-data
UK Data Service: Citing Datahttps://www.ukdataservice.ac.uk/use-data/citing-data
DATA
ACCESS
Do you have permission to share? If so, what?
Data Selection
Meet funder / journal obligations
Encourage research use
Higher citation rate
Reproduce & validate results
ConstraintsMotivation
Concern that will attract lower rate of response or people will be less honest
Intellectual Property Rights issues
Participant consent doesn’t address
sharing
Data Protection legislation
Data sharing decisions built uponrecognition of all influencing factors
Information Commissioner Office. Data Sharing Code of Practicehttp://www.ico.org.uk/for_organisations/data_protection/topic_guides/data_sharing/
Information Commissioner Office. Data Sharing Code of Practicehttp://www.ico.org.uk/for_organisations/data_protection/topic_guides/data_sharing/
Handling individual level data
• Collected and analysed for specific purpose
• Stored no longer than is necessary
• Kept securely and safely to prevent unauthorised or unlawful access, process, loss, or destruction
EU Data Protection Directive 95/46/EC establishes limitations on how information on living individuals is held and used
Reform of the data protection legal framework in the EUhttp://ec.europa.eu/justice/data-protection/reform/index_en.htmReform of the data protection legal framework in the EU
http://ec.europa.eu/justice/data-protection/reform/index_en.htm
Informed Consent
Covered data:
• Variables
• Anonymised / identifiable
Allowed activities:
• Use in current project, e.g. topics
• Preserve and archive with 3rd party
• Future research – access & use
Communication method:
• Information Sheet
• F2f discussion
Time period for decision:
• Prior to capture
• Following capture & review
https://globalhealthtrainingcentre.tghn.org/articles/informed-consent/https://globalhealthtrainingcentre.tghn.org/articles/informed-consent/
http://retractionwatch.com/2014/02/05/journal-and-authors-apologize-unreservedly-for-distress-caused-to-deceased-childs-family-by-case-report/
Data Sharing as a barrier
Investigation of influence of open data policies on consent rate:
• No participants declined to participate, regardless of condition
• Rates of drop-out vs completion did not vary between open/non-open policies
• No significant change in potential consent rates when participants openly asked about the influence of open data policies on their likelihood of consent.
Some researchers consider sharing obligations to be abarrier to research participation
Risk Management
Assess likelihood that data can be used to:
• Identify a person directly
• Infer information about a person
• Link records relating to person to other info
Determine action to address issue:
• Randomisation - noise addition, permutation
• Generalisation - aggregating results, limiting geographic details
• Pseudonymisation - hash functions
Is there a risk of sharing personal or sensitive information?
UK Information Commissioner Office: Anonymisation Code of Practicehttp://www.ico.org.uk/for_organisations/data_protection/topic_guides/anonymisation
UK Information Commissioner Office: Anonymisation Code of Practicehttp://www.ico.org.uk/for_organisations/data_protection/topic_guides/anonymisation
https://www.flickr.com/photos/estherase/2190068148
When anonymisation goes wrong
New York City Taxi & Limousine Commission release anonymised 20 GB file on 173 million
journeys under FOI
Drivers' Hack License & Medallion number re-generated, identifying drivers annual income
Identify home address and destinations of residents
Identify journeys made by celebrities?
http://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/http://arstechnica.com/tech-policy/2014/06/poorly-anonymized-logs-reveal-nyc-cab-drivers-detailed-whereabouts/
Access Status
Control method
• Data Transfer Agreement
• Access controls
Application process:
• Request form
• Review process
Access criteria:
• Permitted users – how do you identify?
• Permitted use – topic, academic use,
• Other criteria: encryption, time period
Open Vs. controlled access
https://www.flickr.com/photos/toruokada/16958186672/
DATA
INTEROPERABILITY
Can data be analysed and harmonized?
Data Standards
Data exchange is dependent upon:
• Open formats
• Common standards
• Documented metadata specification
• Consistent vocabulary
• Documented workflows https://biosharing.org/
Clinical Data Interchange
Standards Consortium
Standards intended to improve consistencyacross the clinical trial lifecycle
ProtocolProtocolData
CollectionData
CollectionData
TabulationData
TabulationData
AnalysisData
Analysis
Archiving and
exchange
Archiving and
exchange
Protocol Representation
Model
Clinical Data Acquisition Standards
Harmonization (CDASH)
Operational DataModel (ODM)
andDefine-XML
Study Data Tabulation
Model(SDTM)
AnalysisData Model
(ADaM)
Data Documentation Initiative
• Maintained & developed by DDI Alliance
• Supported by data archives, producers, research data centers, university data libraries, statistics organizations, etc.
• Two versions:
– DDI2 / Codebook: An archived instance of a study
– DDI3 / DDI Lifecycle: Suitable for longitudinal and repeated surveys
An XML-based metadata standard developed for social science
and economic statistics
http://www.ddialliance.org/
Study
ConceptsConcepts
measures
SurveyInstruments
using
Questions
made up of
Universes
about
Responses
collect
resulting in
with values of
Variables
Comprised of
Categories/Codes,
Numbers
Data Files
Survey Data Model
Slide source:
https://www.unece.org/fileadmin/DAM/stat
s/documents/ece/ces/ge.33/2011/mtg2/W
P_1_Arofan.ppt
DDI Codebook
A codeBook consists of:
1. docDscr: describes the DDI document
2. stdyDscr: Title, abstract, methodologies, agencies, access policy
3. fileDscr: a description of files in the dataset
4. dataDscr: variables (name, code, etc.), variable groups, cubes
5. othMat: other related materials, e.g. document citation
3 levels - Study, dataset, variable
Preserves the collection of files associated with
an archival copy of a survey
DDI Lifecycle
http://www.ddialliance.org/what
Data collector
Data Analyst Data Curator
Secondary user
Each stage may be performed by different groups
DDI Metadata reuse
Basic metadata can be reused during study life:
• Concepts, questions, responses, variables, categories, codes, survey instruments, etc. may be adopted from earlier waves
Referencing earlier iterations:
• Unique identifier
• Version number - control over time
Common metadata ‘groups’ maintained by specific agencies:• Schemes: lists of items of a single type
• Modules: metadata for a specific purpose or lifecycle stage
• All maintainable metadata has a known owner or agency
Unique ID example
urn=“urn:ddi:3_0:VariableScheme.Variable=pop.umn.edu:STUDY0145_VarSch01(1_0).V101(1_1)”
This is a URN From DDI Version 3.0 For a variableThe scheme agency is
pop.umn.edu
With identifierSTUDY012345_VarSch01
Version 1.0 Variable ID isV101
Version 1.1
http://www.iza.org/conference_files/eddi09/ppt/thomas_wendy_course.pdf
DDI Cross-study comparison
Variables are comparable if they possess same properties:
• Age is comparable if has:– Same concept (e.g., age at last birthday)
– Same top-level universe (people)
– Same representation (i.e., an integer from 0-99)
DDI Comparison module:• Place similar items in same group and perform tailored comparison
• Mappings are context-dependent, i.e. sufficient for purposes of particular research
DDI Tools
DDI Codebook:
• Nesstar Publisher & Server
• IHSN Microdata Management Toolkit
• Collectica
• NADA
• UKDA - DExT, ODaF DeXtris
DDI Lifecycle
• Collectica Designer, Collectica for Excel, Portal
• Sledgehammer
DDI Toolshttp://www.ddialliance.org/resources/tools
DDI Toolshttp://www.ddialliance.org/resources/tools
DATA
REUSE
Can data be used for further research?
Data Rights
• Many rights apply to data– Copyright
– Moral
– Database
– Patents & trade secrets
• Rights issues vary between countries
• Ensure your project has clarified rights issues before sharing
https://www.flickr.com/photos/riekhavoc/4813140176/
Rights issues influence how data can be shared, used and cited
Data Licence Models
Many licence models exist, which can be applied at different granularity
• Creative Commons
• Open Data Commons
• GNU GPL, BSD and others for software
Do you have a standard Data Sharing Agreement within your institution?
A data licence outlines permitted & prohibited use
What secondary use is allowed?
http://www.bbc.co.uk/news/uk-scotland-tayside-central-14744240http://www.theguardian.com/society/2011/sep/01/cigarette-university-smoking-research-information
FAIR data
• Consider permitted use
• Apply appropriate licence
• Use open formats
• Consistent vocabulary
• Common metadata standards
• Consider what will be shared
• Obtain participant consent & perform risk management
• Describe your data in a data repository
• Apply a persistent identifiers
Findable
ReusableInteroperable
Accessible
Thank You for your attention!
Questions