How to write a Data Management Plan
Sarah Jones (DCC)Marjan Grootveld (DANS)
both involved in EUDAT and OpenAIRE
This work is licensed under the Creative Commons CC-BY 4.0 licence
Open Access Infrastructure for Research in Europe
www.openaire.eu
Who we are
Research Data Services, Expertise & Technology https://www.eudat.eu
Joint webinar held on 26 May 2016 covering:• Reasons to manage data• Horizon 2020 Open Research Data Pilot• How to manage and share data• EUDAT & OpenAIRE services
Slides, webinar recording and Q&A document online
www.openaire.eu/research-data-management-an-introductory-webinar-from-openaire-and-eudat
Introduction to RDM
• What is a DMP and why write one?
• Requirements under Horizon 2020
• Example plans
• Lessons and guidance
Overview
WHAT IS A DMP & WHY WRITE ONE?Image CC-BY-NC-SA by Leo Reynolds www.flickr.com/photos/lwr/13442910354
A DMP is a brief plan to define:• how the data will be created• how it will be documented• who will be able to access it• where it will be stored• who will back it up• whether (and how) it will be shared & preserved
DMPs are often submitted as part of grant applications, but are useful whenever researchers are creating data.
Data Management Plans
Why manage data?NON PECUNIAE INVESTIGATIONIS CURATORE SED VITAE FACIMUS PROGRAMMAS DATORUM
PROCURATIONIS(Not for the research funder, but for life we make data management plans)
• Make your research easier• Stop yourself drowning in irrelevant stuff• Save data for later• Avoid accusations of fraud or bad science• Write a data paper• Share your data for re-use• Get credit for it
CREATING DATA
PROCESSING DATA
ANALYSING DATA
PRESERVING DATA
GIVING ACCESS TO
DATA
RE-USING DATA
Research data lifecycleCREATING DATA: designing research, DMPs, planning consent, locate existing data, data collection and management, capturing and creating metadata
RE-USING DATA: follow-up research, new research, undertake research reviews, scrutinising findings, teaching & learning
ACCESS TO DATA: distributing data, sharing data, controlling access, establishing copyright, promoting data PRESERVING DATA: data storage, back-
up & archiving, migrating to best format & medium, creating metadata and documentation
ANALYSING DATA: interpreting, & deriving data, producing outputs, authoring publications, preparing for sharing
PROCESSING DATA: entering, transcribing, checking, validating and cleaning data, anonymising data, describing data, manage and store data
Ref: UK Data Archive: http://www.data-archive.ac.uk/create-manage/life-cycle
What data organisation would a re-user like?
Planning trick 1: think backwards
CREATING DATA
PROCESSING DATA
ANALYSING DATA
PRESERVING DATA
GIVING ACCESS TO
DATA
RE-USING DATA
Data organisation exercises
Design a data organisation for the project (folder structure, file naming convention, …)
Research Data Netherlands data support training: http://datasupport.researchdata.nl/en/start-de-cursus/iii-onderzoeksfase/organising-data/
Data organisation
http://datasupport.researchdata.nl/en/start-de-cursus/iii-onderzoeksfase/organising-data
Planning trick 2: include stakeholders
InstitutionRDM policy
Facilities
€$£Research funders
PublishersData Availability
policy
Commercial partners
https://www.openaire.eu/briefpaper-rdm-infonoads
Responsibilities in RDM
https://www.openaire.eu/briefpaper-rdm-infonoads
A DMP is about ‘keeping’ data
• Storing data < > archiving data• Archived data < > findable data• Findable < > accessible• Accessible < > understandable• Understandable < > usable
• A USB stick is not safe• A persistent ID is essential but no guarantee for
usability• Data in a proprietary format is not sustainable
• Findable– Assign persistent IDs, provide rich metadata, register in a searchable
resource,...
• Accessible– Retrievable by their ID using a standard protocol, metadata remain
accessible even if data aren’t...
• Interoperable– Use formal, broadly applicable languages, use standard vocabularies,
qualified references...
• Reusable– Rich, accurate metadata, clear licences, provenance, use of community
standards...
www.force11.org/group/fairgroup/fairprinciples
Making data FAIR
How to deal with data and context?
• Versioning, back-up, storage and archiving– During the project and in the long term
• Ethics, consent forms, legal access• Security and technical access• Usage licences
What should be preserved and shared?
• The data needed to validate results in scientific publications (minimally!).
• The associated metadata: the dataset’s creator, title, year of publication, repository, identifier etc.– Follow a metadata standard in your line of work, or a generic
standard, e.g. Dublin Core or DataCite, and be FAIR.– The repository will assign a persistent ID to the dataset: important
for discovering and citing the data. • Documentation: code books, lab journals, informed consent forms –
domain-dependent, and important for understanding the data and combining them with other data sources.
• Software, hardware, tools, syntax queries, machine configurations – domain-dependent, and important for using the data. (Alternative: information about the software etc.)
Basically, everything that is needed to replicate a study should be available. Plus everything that is potentially useful for others.
Research Data Alliance (RDA) http://rd-alliance.github.io/metadata-directory/standards/FAIR Guiding Principles for scientific data management & stewardship http://www.nature.com/articles/sdata201618How to select and appraise research data:www.dcc.ac.uk/resources/how-guides/appraise-select-research-data
DMPS IN HORIZON 2020 Image “Open Data” CC BY 2.0 by http://www.descrier.co.uk
Some funders that require DMPs
Common themes in DMPs1. Description of data to be collected / created
(i.e. content, type, format, volume...)
2. Standards / methodologies for data collection & management
3. Ethics and Intellectual Property(highlight restrictions on data sharing e.g. embargoes, confidentiality)
4. Plans for data sharing and access (i.e. how, when, to whom)
5. Strategy for long-term preservation
Start planning and communicating early
Horizon 2020: Open Research Data Pilot
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
• Open access to research data refers to the right to access and re-use digital research data. Openly accessible research data can typically be accessed, mined, exploited, reproduced and disseminated free of charge for the user.
• The use of a Data Management Plan (DMP) is required for projects participating in the Open Research Data Pilot, detailing what data the project will generate, whether and how they will be exploited or made accessible for verification and re-use, and how they will be curated and preserved.
Who’s involved in this pilot?Current situation:• Researchers funded by Horizon 2020 within 9
specified call areas - https://www.openaire.eu/opendatapilot
• Opt out and opt in are possible. • A DMP per datasetAs of 2017:• European Cloud Initiative to give Europe a global
lead in the data-driven economy.• For new projects open data will become the default
option. The pilot will be extended to cover all call areas. Opting out remains possible.
• http://europa.eu/rapid/press-release_IP-16-1408_en.htm
Open, unless…
• The EC’s goal is Open Access to research data: as open as possible, as closed as necessary.
• Grant Agreement, Art. 29.3, Open Access to research data:
• When applicable: explain in the DMP why you need to (partially) opt out.
Timing the DMP• Note that the Commission does NOT require
applicants to submit a DMP at the proposal stage (see next slide).
• A DMP is therefore NOT part of the evaluation.
• DMPs are a deliverable for those in the pilot (due by month 6).
• Note that the Commission requires updates. A DMP is a living or “active” document.
Proposal phaseWhere relevant*, H2020 proposals can include a section on data management which is evaluated under the criterion ‘Impact’.
• What types of data will the project generate/collect?• What standards will be applied?• How will this data be exploited &/or shared/made accessible for verification
and reuse? • If data cannot be made available, why not?
• How will this data be curated and preserved?
Your data management policy should reflect the current state of consortium agreements on RDM.
* For “Research and Innovation actions” and “Innovation Actions”
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
Initial DMP (at 6 months)The DMP should address the points below on a dataset by dataset basis:
• Dataset reference and name
• Data set description
• Standards and metadata
• Data sharing
• Archiving and preservation (including storage and backup)
See Annex 1 at: http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
More elaborate DMPScientific research data should be easily:1. Discoverable
Are the data discoverable and identifiable by a standard mechanism e.g. DOIs?
2. AccessibleAre the data accessible and under what conditions e.g. licenses, embargoes?
3. Assessable and intelligibleAre the data and software assessable and intelligible to third parties for peer-review? E.g. can judgements be made about their reliability and the competence of those who created them?
4. Useable beyond the original purpose for which it was collected
Are the data properly curated and stored together with the minimum software and documentation to be useful by third parties in the long-term?
5. Interoperable to specific quality standardsAre the data and software interoperable, allowing data exchange? E.g. were common formats and standards for metadata used?
See Annex 2 at: http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
DMPonlineA web-based tool to help researchers write DMPs
Includes a template for Horizon 2020Guidance from EUDAT and OpenAIRE being added
https://dmponline.dcc.ac.uk
How the tool worksClick to write a generic DMP
Or choose your funder to get their specific template
Pick your uni to add local guidance and to get their template if no funder applies
Choose any additional optional guidance
EUDAT guidance
OpenAIRE support• Summary on the Open Research Data pilot
https://www.openaire.eu/opendatapilot
• Brief guide on developing a DMPhttps://www.openaire.eu/opendatapilot-dmp
• Selecting a data repositoryhttps://www.openaire.eu/opendatapilot-repository
• Developing guidance to add to DMPonline
• Will be adding an ‘export to Zenodo’ feature in early 2017 to allow DMPs to be published and assigned a DOI
Deliver the DMP and keep it up to date
• EC: “Since DMPs are expected to mature during the project, more developed versions of the plan can be included as additional deliverables at later stages. (…) New versions of the DMP should be created whenever important changes to the project occur due to inclusion of new data sets, changes in consortium policies or external factors.”
Focus on how you will ensure your data are “FAIR”
Active DMPs
• Interested in ways to support this active quality, where “active” is understood as “able to evolve and be monitored”?
• Join the RDA’s Active Data Management Plans interest group https://rd-alliance.org/groups/active-data-management-plans.html
• And see recordings, slides and notes of the international and interdisciplinary ADMP Workshop 28-30 June 2016 https://indico.cern.ch/event/520120
Option: add SSI template for software projects
Two templates available for Software Management Plans in DMPonline courtesy of SSI
www.software.ac.uk/resources/guides/software-management-plans
EXAMPLE PLANS
Example plans• 108 DMPs from the National Endowment for the Humanities
www.neh.gov/divisions/odh/grant-news/data-management-plans-successful-grant-applications-2011-2014-now-available
• 20+ scientific DMPs submitted to the NSF (USA) provided by UCSD
– http://libraries.ucsd.edu/services/data-curation/data-management/ dmp-samples.html
• Example DMP collection from Leeds University• https://library.leeds.ac.uk/research-data-tools
• Further examples: • www.dcc.ac.uk/resources/data-management-plans/guidance-example
s
Example: OpenMinTed
OpenMinTed aims to create an infrastructure
for Text and Data Mining (TDM) of
scientific and scholarly content
Have adopted their own structure to create a ‘Data and Software Management Plan’
http://openminted.eu
Example: OpenMinTed – Data chapter
Six high-level datasets identified:1. Scholarly publications 2. Language and knowledge resources 3. Services and workflows 4. Automatically and manually generated annotations 5. Consortium publications 6. Metadata
Described in a table per dataset (see illustration)
OpenMinTed – Software examples
Example: CAPSELLACAPSELLA aims to develop ICT solutions for farmers and other
actors engaged in agrobiodiversity
Devised a questionnaire to collate datset information from
project partners
Identified 13 datasets, 6 of which are imported as is, 3
aggregated, 3 transformed and 1 generated
www.capsella.eu
Example dataset record
Data description examples
The final dataset will include self-reported demographic and behavioural data from interviews with the subjects and
laboratory data from urine specimens provided. From NIH data sharing statements
Every two days, we will subsample E. affinis populations growing under our treatment conditions. We will use a microscope to
identify the life stage and sex of the subsampled individuals. We will document the information first in a laboratory notebook and
then copy the data into an Excel spreadsheet. The Excel spreadsheet will be saved as a comma separated value (.csv) file.
From DataOne – E. affinis DMP example
Metadata examplesMetadata will be tagged in XML using the Data Documentation
Initiative (DDI) format. The codebook will contain information on study design, sampling methodology, fieldwork, variable-level detail,
and all information necessary for a secondary analyst to use the data accurately and effectively.
From ICPSR Framework for Creating a DMP
We will first document our metadata by taking careful notes in the laboratory notebook that refer to specific data files and describe all columns, units,
abbreviations, and missing value identifiers. These notes will be transcribed into a .txt document that will be stored with the data file. After all of the
data are collected, we will then use EML (Ecological Metadata Language) to digitize our metadata. EML is one of the accepted formats used in ecology,
and works well for the types of data we will be producing. We will create these metadata using Morpho software, available through KNB. The
metadata will fully describe the data files and the context of the measurements.
From DataOne – E. affinis DMP example
Data sharing examples
We will make the data and associated documentation available to users under a data-sharing agreement that provides for: (1) a commitment to using the data
only for research purposes and not to identify any individual participant; (2) a commitment to securing the data using appropriate computer technology; and (3) a commitment to destroying or returning the data after analyses are completed.
From NIH data sharing statements
The videos will be made available via the bristol.ac.uk website (both as streaming media and downloads) HD and SD versions will be provided to
accommodate those with lower bandwidth. Videos will also be made available via Vimeo, a platform that is already well used by research students at Bristol.
Appropriate metadata will also be provided to the existing Vimeo standard.
All video will also be available for download and re-editing by third parties. To facilitate this Creative Commons licenses will be assigned to each item. In order to ensure this usage is possible, the required permissions will be gathered from
participants (using a suitable release form) before recording commences.
From University of Bristol Kitchen Cosmology DMP
Examples restrictionsBecause the STDs being studied are reportable diseases, we will be
collecting identifying information. Even though the final dataset will be stripped of identifiers prior to release for sharing, we believe that there remains the possibility of deductive disclosure of subjects with unusual
characteristics. Thus, we will make the data and associated documentation available to users only under a data-sharing agreement.
From NIH data sharing statements
1. Share data privately within 1 year. Data will be held in Private Repository, but metadata will be
public 2. Release data to public within 2 years.
Encouraged after one year to release data for public access. 3. Request, in writing, data privacy up to 4 years.
Extensions beyond 3 years will only be granted for compelling cases.4. Consult with creators of private CZO datasets prior to use.
Pis required to seek consent before using private data they can access
From Boulder Creek Critical Zone Observatory DMP
Archiving examplesThe investigators will work with staff at the UKDA to determine
what to archive and how long the deposited data should be retained. Future long-term use of the data will be ensured by
placing a copy of the data into the repository.From ICPSR Framework for Creating a DMP
Data will be provided in file formats considered appropriate for long-term access, as recommended by the UK Data Service. For example, SPSS Portal format and tab-delimited text for qualitative tabular
data and RTF and PDF/A for interview transcripts. Appropriate documentation necessary to understand the data will also be provided. Anonymised data will be held for a minimum of 10 years following project completion, in compliance with LSHTM’s
Records Retention and Disposal Schedule. Biological samples (output 3) will be deposited with the UK BioBank for future use.
From Writing a Wellcome Trust Data Management and Sharing Plan
Share your example DMPs!
Send us links to your DMPs
We will add them to the DCC list
Aim to cover wide range of disciplines
and funders
www.dcc.ac.uk/ share-DMPs
LESSONS AND RESOURCESImage ‘Energy Resources | Energie Quelle’ CC-BY-NC by K. H. Reichert www.flickr.com/photos/reupa/19502634575
Tips for writing DMPs
• Seek advice - consult and collaborate
• Consider good practice for your field
• Base plans on available skills & support
• Make sure implementation is feasible
• Think about things early…
Plan to share data from the outset
• Negotiation on licenses and consent agreement may preclude later sharing if not careful
• Costings can’t be included retrospectively
• Useful to consider data issues at the consortium negotiation stage to make sure potential issues are identified and sorted asap
Decisions made early on affect what you can do later
Sharing data: what is meant?
With collaborators while research is active
Data are mutable
(Open) data sharing
Data are stable, searchable, citable,
clearly licensed
Storing data: what is meant?
Storing and backing up files while research is active
Likely to be on a networked filestore or hard drive
Easy to change or delete
Archiving or preserving data in the long-term
Likely to be deposited in a digital repository
Safeguarded and preserved
Archiving, repositories, ehm?
• Horizon 2020 ORD pilot participants are asked to “deposit your data in a research data repository”: a digital archive collecting and displaying datasets and their metadata.
• Select a data repository that will preserve your data, metadata and possibly tools in the long term.
• It is advisable to contact the repository of your choice when writing the first version of your DMP.
• Repositories may offer guidelines for sustainable data formats and metadata standards, as well as support for dealing with sensitive data and licensing.
Where to find a repository?
• More information: https://www.openaire.eu/opendatapilot-repository• Zenodo: http://www.zenodo.org • Re3data.org: http://www.re3data.org
Searching with Re3data.org
www.fosteropenscience.eu/content/re3data-demo
How to select a repository?
• Certification as a ‘Trustworthy Digital Repository’ with an explicit ambition to keep the data available in long term.
• Matches your particular data needs: e.g. formats accepted; mixture of Open and Restricted Access.
• Provides guidance on how to cite the deposited data.
• Gives your submitted dataset a persistent and globally unique identifier for sustainable citations and to link back to particular researchers and grants.
www.openaire.eu/opendatapilot-repository
Data Seal of Approvalnestor sealISO 16363
Keep everything? For always?
• When regenerating data would be cheaper than archiving, don’t archive. Select what data you’ll need and want to retain.
• 10 years is often stated in data policies and academic codes, but data can be valuable for ages, in climatology, sociology, health sciences, astronomy, linguistics, … Look beyond minimal retention periods where relevant.
• Explain your selection criteria in the DMP.
DCC How-to guide: http://www.dcc.ac.uk/resources/how-guides/appraise-select-dataRDNL Selection criteria: http://www.researchdata.nl/en/services/data-management/selecting-research-data/
Licensing research data• Horizon 2020 guidelines point to CC-BY or CC-0
• EUDAT licensing wizard help you pick licence for data & software
http://ufal.github.io/public-license-selector
• DCC How-to guide helps you to license datawww.dcc.ac.uk/resources/how-guides/license-research-data
Metadata standards
Metadata Standards Directory• Broad, disciplinary listing of
standards and tools• Maintained by RDA group
http://rd-alliance.github.io/metadata-directory
Biosharing• A portal of data standards,
databases, and policies • Focused on life, environmental
and biomedical sciences
https://biosharing.org
• How to develop a DMP www.dcc.ac.uk/resources/how-guides/develop-data-plan
• RDM brochure and template https://dans.knaw.nl/en/about/organisation-and-policy/information-material?set_language=en
• OpenAIRE guidelines• www.openaire.eu/opendatapilot-dmp
• ICPSR framework for a DMP www.icpsr.umich.edu/icpsrweb/content/datamanagement/dmp/framework.html
Guidelines on DMPs
• Guidelines on Data Management in Horizon 2020
• Provides summary of requirements
• Includes templates for DMPs
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
EC guidance
KEY MESSAGESImage “Fishbone” CC BY-NC-ND 2.0 by ttps://www.flickr.com/photos/mrjnl/
Key messages• Data management is part of good research practice whether
you plan to make the data open or not – it benefits you!
• The process of planning and reflecting are most important. Think about the desired end result and plan for this.
• Approach the DMP in whatever way best fits your project – adopt a different template to suit– add sections / elements e.g. ethics, software– decide whether to describe each dataset in detail– focus effort on datasets you’ll create rather than reuse…
www.eudat.eu www.openaire.eu
Thanks – any questions?Contact us:
Marjan Grootveld: [email protected] Sarah Jones: [email protected]
Acknowledgements:
Thanks to DANS and DCC for reuse of slides, and to the OpenMinTeD and CAPSELLA projects for sharing their Data Management Plans
www.eudat.eu www.openaire.eu
Please let us know what you thought of the webinar
https://eudat.eu/evaluation-form-for-the-webinar-how-to-write-a-data-management-plan