(or how to get started with rdm)
TRANSCRIPT
Creating a Data Management Plan (or how to get started with RDM)
Myriam Mertens | Ghent University Library
Nele Pauwels | Knowledge Centre for Health Ghent
Aims of today’s workshop 1. provide you with a basic understanding of data management planning and why
it’s important
2. give you a broad overview of the range of issues/topics involved in research data management
3. help you get started with writing your own data management plan
4. point you to existing resources for further information, training and advice
2
Some housekeeping information• today: mix of us presenting & activities
• slides are available via osf.io/zse46
• useful links document: goo.gl/E2jnz2
3
Data management plan (DMP)
document outlining how data will be handled during and after a project
increasingly required by research funders/institutions
good practice even if not required, because…
5
Planning is first step towards good research data management (RDM)
“[RDM] ensure[s] that data are of a high quality,
are well organized, documented, preserved, sustainable, accessibleand reusable.” (Corti et al.
2014)
6
Data in the digital age
• data “explosion”
- navigating and using data is the challenge
• digital data are fragile, e.g. because of
- hardware/software failure
- human error
- malicious attacks
- natural disasters
- passage of time! (changing technology, loss of information…)
- …
7
Why you should care about RDM
• increases research efficiency
• facilitates collaboration
• encourages data reuse (increased visibility!)
• minimizes risk of data loss
• supports research integrity & quality
8
RDM is a crucial part of good research practice
• secure preservation for a reasonable period
• access: as open as possible, as closed as necessary
• FAIR principles (Findable, Accessible, Interoperable & Reusable data)
• data = legitimate & citable products of research
Expectations regarding data include:
9
DMPs – a mere administrative burden?
- takes time and effort upfront, but…
+ saves time and problems later on
+ helps consider whole range of RDM activities/issues
+ makes expectations, procedures & responsibilities explicit
+ leads to more informed decisions about data
+ helps identify resources required (& obtain funding)
12
Key DMP topics
“[…] plans typically state what data will be created and how, and outline the plans for sharing and
preservation, noting what is appropriate given the nature of the data and any restrictions that may need to be applied.” (DCC website)
1. description of data to be generated or used
2. methods, standards for collecting/creating & documenting data
3. ethics & legal issues
4. plans for data sharing
5. strategy for preserving data beyond project end
13
‘Research data’ can mean a lot…
any information collected/createdfor the purposes of analysis to generate
scientific claims
• content: numerical, textual, audiovisual, multimedia... data
• data format/object: spreadsheets/tabular data, fieldwork notes, databases, images, audio recordings, marked up texts, surveys, instrument readings…
• mode of data collection: experimental, observational, simulation, derived/compiled… data
• digital or non-digital data
• primary or secondary data
• raw, processed or analyzed data
For example:
15
File formats
• potential problems
- not interoperable: other people cannot open file
- obsolescence: problems to open file at a later date
• formats for long-term access
- non-proprietary: no specific (version of) software required to open file
- open, documented standard
- widely used
- lossless (rather than lossy)
16
Some examples of recommended formats
• be aware of potential errors/information loss when converting
• if necessary, consider saving data in both proprietary and open format
17
Type of data Formats
tabular data .csv; .tab; .por (SPSS
portable format); .xml
textual data .rtf; .txt; .xml
image data .tif (TIFF 6.0
uncompressed)
audio data .flac
Source: https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats
Data volume
• how much data will you produce? how fast will it grow?
• where will you store data? do you have enough storage capacity?
• what about back-ups?
18
3-2-1 backup rule
• have at least 3 copies of important files, on at least 2 different types of storage media, with at least 1 offsite copy
• back up regularly & automatically
19
Central storage options (DICT)• shared network drives (‘shares’): secure, regular backups
• OneDrive for Business (http://onedrive.ugent.be): 1 TB; no confidential data unless encrypted
• Sharepoint (http://sharepoint.ugent.be): no confidential data unless encrypted
20
Requirements StorageoptionsDICT
Verylargedatasets
Activedata
SharingoutsideofUGent
Localcopy
HPC(HighPerformanceCom
puting)
ArchivalShares
ACL(AccessControlList)Shares
Sharepoint
OneDriveforBusiness
Confidentialdata
Yes No ü ü
Yes Yes,onHPC ü Encrypt
Yes Yes ? ?
No No No ü ü
No No Yes ü Encrypt
No Yes No ü ü ü
No Yes Yes No ü ü Encrypt
No Yes Yes Yes ü Encrypt
(No)
case-by-case
…considerdeposit
Activity: • Introduce yourself to your neighbour and describe the types of data you produce/use
- e.g. content
- e.g. mode of data collection
- e.g. digital/non-digital
- e.g. formats
21
Have a logical system for organizing data (files)
• should be meaningful to you and your colleagues
• should allow you to find files/data easily
• develop standards early in project & use consistently
• don’t forget non-digital materials
24
• hierarchical folderstructure
• database for large, complex datasets
• non-digital data: filing system, labels to identify content of data folders
For example
Use file naming conventions
• names should
- uniquely identify and reflect content of a file
- be consistent
- have no special characters and spaces
- not be too long
- use date conventions (YYYYMMDD or YYYY-MM-DD)
27
• 19991021_WesternBlot07
• western blot experiment number 7 from 21 Oct 1999
• Int024_MP_2008-06-05.doc
• interview with participant 24, interviewed by Marc Peters on 5 June 2008
For example
Versioning • would you know which version of the data to use? Or how the versions differ from
each other?
28
Have a strategy for file versioning
• record changes made to files
• identify different versions of files
• decide which versions to keep & how to organize them (e.g. master & working copies)
29
• dates or version numbers in file names (v1.0, v1.1, v2.0...)
• keep a log or file history table
• use version control software (e.g. Git), or version control features in software
possible strategies
Data documentation
• any descriptive or contextual information necessary to find, assess, understand & properly use your data
• start as soon as you start collecting data
31
Levels of data documentation
• study level
- published article may not be enough!
- conditions of data collection (e.g. study description, protocols, instruments & software/hardware used…)
- any changes made to collected data (e.g. processing & analysis procedures, scripts)
- overall structure of files
• data level
- information about individual data items/elements within data files (e.g. variable names & descriptions, value codes, within-file annotations,…)
32
Capturing data documentation
• in separate files, e.g.
- readme.txt files
- lab notebooks
- data dictionaries/codebooks
- project reports, publications
• embedded within data files
• using metadata (‘data about data’)
- highly structured format for describing data
- useful for searching through large amounts of data, to facilitate exchange & comparison of data
- elements from a controlled list, as defined by a metadata schema
Metadata example for image data, based on the Dublin Core
metadata schema (reproduced from Briney 2015)
33
Don’t reinvent the wheel: use existing standards!• standard = an agreed way of doing something
- Standards are agreed-upon conventions for doing something, e.g. managing a process or delivering a service, and are established by community consensus or an authority.
• examples:
- minimal reporting standards for research and publications (e.g. systematic review, qualitative research, diagnostic tests)
- minimal reporting standards for biomedical investigations (e.g. Flow Cytometry, quantitative Real-Time PCR, microarray experiment, microarray experiment, T cell assay, immunohistochemistry, genotyping experiment, gel electrophoresis)
- standard vocabularies, ontologies (e.g. International Classification of Diseases, Breast Cancer Grading Ontology)
- standard data structures/formats (e.g. proteomics, x-ray data, statistical data)
34
Activity• which minimal information should I report when performing my experiment(s), and/or
writing my article?
=> Search
35
Personal data• any information relating to an
identified/identifiable natural person
• handling is regulated by European and Belgian privacy/data protection legislation (e.g. new GDPR), e.g.
- certain requirements for lawful processing of personal data (including data subject’s informed consent)
- appropriate organisational & technical measures to protect personal data
• sensitive personal data
- info about racial or ethnic origin, political opinions, religious/philosophical beliefs, trade-union membership, health or sex life, criminal offences…
- benefit from stronger protection
37
Otherwise confidential data, e.g.
• you have signed a non-disclosure agreement/contract with a confidentiality clause
• data are otherwise sensitive
- i.e. disclosure may harm endangered species, vulnerable sites or groups, national security…
- also see ethical standards governing confidentiality in research
• data have economic valorisation potential
- duty to report to UGent’s TechTransfer office before disclosing anything!
38
Personal/confidential data - some things to consider• don’t collect more personal data than needed
• seek the permissions required to collect & handle these data
- e.g. informed consent, ethical approval, …
- permission for data collection, but also for processing, archiving, sharing, destroying…
• anonymize/pseudonomize personal data to protect privacy
• pay attention to data security to prevent unauthorized access & disclosure
- physical security (e.g. lock labs, offices…)
- security of computer systems and files (e.g. passwords, encryption, up-to-date software, controlled access to files/folders, …)
- network security (e.g. firewall protection, no confidential data on servers/computers connectedto external network…)
familiarize yourself with UGent Information Security Policy & Guidelines!
39
Intellectual property (IP)
• IP issues can affect your ability to (re)use, archive and/or share data
• data may be protected by IP rights
- e.g. copyright, database right
- permission from rights owner(s) required, e.g. to reproduce or publicly communicate data
• your data may have economic valorisationpotential
- UGent normally owns such research results
- sharing may not be possible (to protect confidential knowhow), or only after an embargo period (to seek patent protection first) – always check with UGent TechTransferfirst!
40
Intellectual property (IP)• (rights to) data may be (co-)owned by a third party
- e.g. when you re-use existing data
- e.g. when research is funded by or conducted in collaboration with external partner
- check/obtain necessary third-party permissions!
41
Sharing data with others• easier than ever in the digital age
• nothing new in certain domains + changing culture among researchers
- e.g. P. Masuzzo & L. Martens (2017), “Do you speak open science? Resources and tips to learn the language”, PeerJ Preprints 5: e2689v1. doi: 10.7287/peerj.preprints.2689v1
• increasingly required by research funders and publishers
- funder policies on access to research data (e.g. European Commission – H2020)
- journal data availability policies (e.g. PLOS journals, Nature, BioMed Central journals…)
43
Sharing does not necessarily mean “open data”
• fully ‘open’: “anyone can freely access, use, modify, and share for any purpose” (opendefinition.org)
• “as open as possible, as closed as necessary” - approach (cf. ethical & legal restrictions)
• possible to share data under more restricted conditions
- e.g. only a subset of the data
- only with certain (types of) users
- only for certain types of use
- after an embargo period…
44
Various ways of sharing data
• email data “upon request”
• disseminate via a project or personal website
• make data available via a trusted database or data repository
- general-purpose/multi-disciplinary, domain-specific, or institutional
- helps make data citable and FAIR
G. Polanczyk et al., “The Worldwide Prevalence of ADHD: A Systematic Review
and Metaregression Analysis”, The American Journal of Psychiatry 164 (2007) 6: 942-
948. doi: 10.1176/ajp.2007.164.6.942
45
A trusted data repository
• assigns a unique persistent identifier (e.g. DOI) to dataset, which resolves to a landing page
• provides online access to metadata (always public) + access to data & documentation (open or more restricted)
• states data reuse rights (via licenses)
• uses standards to promote interoperability
Dataset record from the 4TU.ResearchData repository.
doi: 10.4121/uuid:3106fb06-9723-49d1-b829-94778fa5aa6d
46
Publish a data paper
• extensive dataset description, published in a journal
• link to data deposited in repository
• paper and data are peer-reviewed
• cited like a traditional article
• format offered by regular journals + dedicated data journals (e.g. Scientific Data)
S.M. Kadri et al., “A variant reference data set for the Africanized
honeybee, Apis mellifera”, Scientific Data 3 (2016), Article number:
160097. doi: 10.1038/sdata.2016.9747
Licensing research data
• use licenses to make reuse rights clear
• many repositories use standard (rather than bespoke) licenses
- Creative Commons Licenses
- Open Data Commons Licenses
- but less suitable for restricted data
Licenses conformant with Open Definition principles. From
“Conformant Licenses” by Open Definition, licensed under CC-BY 48
Reusing data• find existing datasets via repositories’ data
catalogues
• cite any datasets you reuse in your publications
• minimum elements:
- Author
- Publication Year,
- Title
- Publisher
- Location (usually PID + resolver service)
Example
Benkman C (2016). Data from: Matching habitat choice in nomadic crossbills appears most pronounced when food is most limiting. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.dg41r
49
Activity• Why NOT share research data: give minimal 3 arguments AGAINST data sharing
• Why share research data: give minimal 3 arguments IN FAVOUR OF sharing
50
Why not share research data?Common arguments AGAINST sharing
impossible because of privacy or IP
fear of being ‘scooped’
fear of errors being exposed
fear of misinterpretation or misuse
too much effort/too costly
data not of interest to anyone else
lack of reward
51Adapted from Corti et al. 2014
Advantages of sharing research data
52
Common arguments FOR sharing
helps uncover errors, fraud, irreproducible results
avoids duplication of effort (greater ‘return on investment’)
public access to publicly funded research
results in citations (citation advantage for papers with
shared data + citations when others reuse your data)
opportunities for new collaborations, co-authorships
advances science (accelerates discovery, facilitates new
research questions/new forms of research)
returning the favour (already reusing other people’s data)
Adapted from Corti et al. 2014
Preserving data• what happens to research data once a project is completed?
54
Vines, Timothy et al. “The Availability of Research
Data Declines Rapidly with Article Age.” Current
Biology 24 (2014) 1: 94–97.
doi:10.1016/j.cub.2013.11.014
probability of supporting
data still being available
declined by 17% every
year
Don’t keep everything (indefinitely)
• select what to keep and how long, based on e.g.:
- obligations to keep data/documentation (legal obligations, funder policies…): e.g. clinical trial documents need to be kept for 20 years!
- what is needed to verify & validate your publications
- what cannot be recreated or is too expensive to recreate
- potential re-use value
- scientific, historical, cultural significance
- …
55
Preserving takes more than storing data
• keeping data files readable and usable over time requires appropriate strategies, e.g. :
- preparing data for preservation (e.g. sustainable file formats, documentation/metadata)
- moving files to new storage hardware every 3 to 5 years
- monitoring for file corruption using checksums
- making backups is still necessary
- …
56
If possible, outsource preservation
57
• for example, to a trusted external data repository
- suitable for publicly shareable data that need longer retention periods
- check explicit commitment to preservation (e.g. preservation policy, certificate, statement on how long data will be supported...)
• confidential data may need to stay in-house
Things to consider when choosing a repository
• does it
- provide a persistent & unique identifier to your dataset?
- provide a landing page for each dataset, with metadata?
- help you track usage (e.g. access & download statistics)
- have a certificate to indicate trustworthiness (e.g. DSA)?
- match your data needs (e.g. your type of data are accepted)?
- meet legal requirements in terms of data protection and allowing reuse without unnecessary licensing conditions?
- provide guidance on how to cite data?
- charge for its services?
Adapted from https://www.openaire.eu/opendatapilot-repository 59
Don’t forget non-digital materials• for example:
- biological materials: BCCM culture collection (3 hosted @ Ghent University, Faculty of Sciences)
- human tissue: Bimetra biobank @ Ghent University Hospital
60
ActivitySearch for a suitable domain-specific or general-purpose repository for your research data in re3data.org
Examples general-purpose: Dryad, Zenodo
Examples domain-specific: UniProtKnowledgebase, International Mouse Strain Resoure, Genbank
61
DMPonline.be• local instance of open source software developed by DCC (UK)
• launched as a pilot at UGent in 2015
• now hosted on BELNET servers + accessible for researchers from institutions with DMPbelgium consortium
- currently:
64
Further tips for writing a DMP • check applicable data policies
- e.g. Ghent University RDM Policy Framework
• keep it simple, but be as specific as possible
• justify your decisions
• consider it a ‘living’ document
• have a look at example DMPs
• familiarize yourself with RDM terminology & best practices (for your field)
66
Example plans• examples on the Digital Curation Centre (DCC) website
http://www.dcc.ac.uk/resources/data-management-plans/guidance-examples
• examples in the Zenodo repository
https://zenodo.org/search?page=1&size=20&q=data management plans
• public DMPs on the DMPTool website
https://dmptool.org/public_dmps
• DMPs published in RIO (Research Ideas and Outcomes OA journal)
http://riojournal.com/browse_user_collection_documents?collection_id=3
67
Online RDM training resources • FOSTER training portal
• OpenAIRE webinars
• EUDAT training materials
• Digital Curation Centre How-to Guides & Checklists
• UK Data Service ‘Prepare & Manage Data’ webpages
• MANTRA – Research Data Management Training
• ‘Research Data Management and Sharing’ MOOC on Coursera
• Data Management Training Clearinghouse
• Data4lifesciences Handbook for Adequate Natural Data Stewardship
• FAIRDOM Knowledge Hub
68
Want to get some feedback on your DMP? • have a look at our Generic DMP Review Rubric
- a (self-)evaluation form for DMPs based on UGent generic DMP template
- available at https://osf.io/ezxm5/
• or… send us your DMP!
69
Credits• slides draw heavily and/or adapt materials from:
K. Briney (2015), Data Management for Researchers: Organize, Maintain and Share your Data for Research Success (Pelagic Pub Ltd).
L. Corti, V. Van den Eynde & M. Woolard (2014), Managing and Sharing Research Data. A Guide to Good Practice (Sage).
S. Jones (2013), ‘Research Data Management’, Licensed under CC-BY
S. Jones (2016), ‘What is a Data Management Plan?’, Licensed under CC-BY 4.0
T. Ross-Hellauer & S. Jones (2016), ‘Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT’, Licensed under CC-BY 4.0
71
• images [slide 2]: From ‘Research Data Management: An Overview - 2014-05-12’ by Research Support Team, IT
Services, University of Oxford, licensed under CC-BY-NC-SA 4.0
[slide 5]: ‘Writing’ by Aiconica, licensed under CC0 1.0
[slide 6]: ‘Database’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK
[slide 7]: ‘Data Ocean’ by Auke Herrema – Het Bouwteam, licensed under CC-BY
[slide 8]: ‘FAIRDOM – Research Data Management’ by Stiftfilm.de, all rights reserved
[slide 9]: ‘The European Code of Conduct for Research Integrity. Revised Edition’ by ALLEA – All European Academies, redistribution permitted for educational, scientifc and private purposes if the source is quoted.
[slide 10]: ‘Publications and Data’ by Auke Herrema, licensed under CC-BY 4.0
[slide 11]: From ‘Policy for Research Data Management’ by University of Copenhagen – Faculty of Health and Medical Sciences
[slide 12]: ‘Planning’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK
[slide 16]: ‘File formats colllection’ created by Freepik
[slide 18]: ‘Social media information overload’ by Mark Smiciklas, licensed under CC-BY-NC 2.0
[slide 19]: From ‘Research Data Management: An Overview - 2014-05-12’ by Research Support Team, IT Services, University of Oxford, licensed under CC-BY-NC-SA 4.0
[slide 20]: storage options DICT (UGent) by Johan Van Camp
72
• images [slide 23]: From ‘Analyzing DMPs to inform Research Data Services’ by A. L. Whitmire, licensed under CC-BY
4.0
[slide 24]: From ‘Template Research Data Management workshop for STEM researchers’ by R. Higman and M. Teperek, licensed under CC-BY 4.0
[slide 26]: From ‘Template Research Data Management workshop for STEM researchers’ by R. Higman and M. Teperek, licensed under CC-BY 4.0
[slide 28]: From ‘Introduction to Rsearch Data Management’ by A. Whitmire & S. Van Tuyl, licensed under CC-BY
[slide 30]: From ‘Data Handling: Documentation, Organization and Storage’ by Sebastian Netscher, licensedunder CC-BY
[slide 31]: From ‘Research Data Management: An Overview - 2014-05-12’ by Research Support Team, IT Services, University of Oxford, licensed under CC-BY-NC-SA 4.0
[slide 32]: ‘Loss of Data’ by Auke Herrema – Het Bouwteam, licensed under CC-BY
[slide 35]: ‘Metadata’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK
[slide 37]: ‘Privacy’ by NickYoungson, licensed under CC-BY-SA 3.0
[slide 38]: ‘File: Lorenzo Federici 2’ by Walteroma10, licensed under CC-BY-SA 3.0
[slide 40]: ‘Property’ by Nick Youngson, licensed under CC-BY-SA 3.0
[slide 44]: ‘Data Tree’ by Auke Herrema – Het Bouwteam, licensed under CC-BY
73
• images
[slide 48]: From ‘Conformant Licenses’ by Open Definition, licensed under CC-BY
[slide 51]: ‘Data Sharing’ by Auke Herrema – Het Bouwteam, licensed under CC-BY
[slide 55]: ‘How to Choose’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK
[slide 56]: ‘A Domesday system at the Vintage Computer Festival 2010, Bletchley UK’ by Regregex, licensed under CC-BY 3.0
[slide 57]: ‘Trustworthy Digital Preservation’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed underCC-BY 2.5 DK
[slide 58]: From ‘How to select a repository’ by OpenAIRE, licensed under CC-BY 4.0
[slide 59]: From ‘Research Data Management Briefing Paper’ by OpenAIRE, licensed under CC-BY 4.0
[slide 60]: BCCM consortium by BCCM
[slide 63]: ‘Tools’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK
[slide 66]: FromV. Van den Eynden et al., Managing and Sharing Data. Best practice for Researchers (UK Data Archive, 2009), licensed under CC-BY-NC-SA 3.0
[slide 70]: ‘Knowledge’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK
74
75
How the tool works
https://dmponline.be
Log in with:
- institutional
credentials
(BELNET
Federation)
- local account
- ORCID (if
profile linked
to ORCID)
1. Viewing existing plans
77
Click ‘View plans’
button to see the
list of plans you
have created,
and/or plans that
others have
shared with you
2. Creating a new plan
Select funder to
get its template
Select institution to
get local guidance, as
well as institutional
template(s) - if
funder not applicable
Choose additional
optional guidance 78
2. Creating a new plan: featuresProgress
indicator
Section
Questio
n
Write down
your answer
here
Leave a
comment for
collaborators
Custom guidance
from funder,
university,
group… 80
81
3. Sharing your plan
Manage
collaborators
Add
collaborator by
entering email
addressSelect
permission
level