(or how to get started with rdm)

83
Creating a Data Management Plan (or how to get started with RDM) Myriam Mertens | Ghent University Library Nele Pauwels | Knowledge Centre for Health Ghent

Upload: others

Post on 20-Nov-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Creating a Data Management Plan (or how to get started with RDM)

Myriam Mertens | Ghent University Library

Nele Pauwels | Knowledge Centre for Health Ghent

Aims of today’s workshop 1. provide you with a basic understanding of data management planning and why

it’s important

2. give you a broad overview of the range of issues/topics involved in research data management

3. help you get started with writing your own data management plan

4. point you to existing resources for further information, training and advice

2

Some housekeeping information• today: mix of us presenting & activities

• slides are available via osf.io/zse46

• useful links document: goo.gl/E2jnz2

3

IntroductionWhat is a Data

Management Plan, and why create one?

4

Data management plan (DMP)

document outlining how data will be handled during and after a project

increasingly required by research funders/institutions

good practice even if not required, because…

5

Planning is first step towards good research data management (RDM)

“[RDM] ensure[s] that data are of a high quality,

are well organized, documented, preserved, sustainable, accessibleand reusable.” (Corti et al.

2014)

6

Data in the digital age

• data “explosion”

- navigating and using data is the challenge

• digital data are fragile, e.g. because of

- hardware/software failure

- human error

- malicious attacks

- natural disasters

- passage of time! (changing technology, loss of information…)

- …

7

Why you should care about RDM

• increases research efficiency

• facilitates collaboration

• encourages data reuse (increased visibility!)

• minimizes risk of data loss

• supports research integrity & quality

8

RDM is a crucial part of good research practice

• secure preservation for a reasonable period

• access: as open as possible, as closed as necessary

• FAIR principles (Findable, Accessible, Interoperable & Reusable data)

• data = legitimate & citable products of research

Expectations regarding data include:

9

Research data: from neglect…

Research data lifecycle

10

… to valuable scientific output!

Research data lifecycle

11

DMPs – a mere administrative burden?

- takes time and effort upfront, but…

+ saves time and problems later on

+ helps consider whole range of RDM activities/issues

+ makes expectations, procedures & responsibilities explicit

+ leads to more informed decisions about data

+ helps identify resources required (& obtain funding)

12

Key DMP topics

“[…] plans typically state what data will be created and how, and outline the plans for sharing and

preservation, noting what is appropriate given the nature of the data and any restrictions that may need to be applied.” (DCC website)

1. description of data to be generated or used

2. methods, standards for collecting/creating & documenting data

3. ethics & legal issues

4. plans for data sharing

5. strategy for preserving data beyond project end

13

1What types of data will you use or produce?

14

‘Research data’ can mean a lot…

any information collected/createdfor the purposes of analysis to generate

scientific claims

• content: numerical, textual, audiovisual, multimedia... data

• data format/object: spreadsheets/tabular data, fieldwork notes, databases, images, audio recordings, marked up texts, surveys, instrument readings…

• mode of data collection: experimental, observational, simulation, derived/compiled… data

• digital or non-digital data

• primary or secondary data

• raw, processed or analyzed data

For example:

15

File formats

• potential problems

- not interoperable: other people cannot open file

- obsolescence: problems to open file at a later date

• formats for long-term access

- non-proprietary: no specific (version of) software required to open file

- open, documented standard

- widely used

- lossless (rather than lossy)

16

Some examples of recommended formats

• be aware of potential errors/information loss when converting

• if necessary, consider saving data in both proprietary and open format

17

Type of data Formats

tabular data .csv; .tab; .por (SPSS

portable format); .xml

textual data .rtf; .txt; .xml

image data .tif (TIFF 6.0

uncompressed)

audio data .flac

Source: https://www.ukdataservice.ac.uk/manage-data/format/recommended-formats

Data volume

• how much data will you produce? how fast will it grow?

• where will you store data? do you have enough storage capacity?

• what about back-ups?

18

3-2-1 backup rule

• have at least 3 copies of important files, on at least 2 different types of storage media, with at least 1 offsite copy

• back up regularly & automatically

19

Central storage options (DICT)• shared network drives (‘shares’): secure, regular backups

• OneDrive for Business (http://onedrive.ugent.be): 1 TB; no confidential data unless encrypted

• Sharepoint (http://sharepoint.ugent.be): no confidential data unless encrypted

20

Requirements StorageoptionsDICT

Verylargedatasets

Activedata

SharingoutsideofUGent

Localcopy

HPC(HighPerformanceCom

puting)

ArchivalShares

ACL(AccessControlList)Shares

Sharepoint

OneDriveforBusiness

Confidentialdata

Yes No ü ü

Yes Yes,onHPC ü Encrypt

Yes Yes ? ?

No No No ü ü

No No Yes ü Encrypt

No Yes No ü ü ü

No Yes Yes No ü ü Encrypt

No Yes Yes Yes ü Encrypt

(No)

case-by-case

…considerdeposit

Activity: • Introduce yourself to your neighbour and describe the types of data you produce/use

- e.g. content

- e.g. mode of data collection

- e.g. digital/non-digital

- e.g. formats

21

2What methods, standards will you use to create & document your data?

22

Organizing data

23

Have a logical system for organizing data (files)

• should be meaningful to you and your colleagues

• should allow you to find files/data easily

• develop standards early in project & use consistently

• don’t forget non-digital materials

24

• hierarchical folderstructure

• database for large, complex datasets

• non-digital data: filing system, labels to identify content of data folders

For example

Organizing data• which example looks better, and why?

25

File naming • would you know what these files are in 3 years’ time?

26

Use file naming conventions

• names should

- uniquely identify and reflect content of a file

- be consistent

- have no special characters and spaces

- not be too long

- use date conventions (YYYYMMDD or YYYY-MM-DD)

27

• 19991021_WesternBlot07

• western blot experiment number 7 from 21 Oct 1999

• Int024_MP_2008-06-05.doc

• interview with participant 24, interviewed by Marc Peters on 5 June 2008

For example

Versioning • would you know which version of the data to use? Or how the versions differ from

each other?

28

Have a strategy for file versioning

• record changes made to files

• identify different versions of files

• decide which versions to keep & how to organize them (e.g. master & working copies)

29

• dates or version numbers in file names (v1.0, v1.1, v2.0...)

• keep a log or file history table

• use version control software (e.g. Git), or version control features in software

possible strategies

Documenting data

30

Data documentation

• any descriptive or contextual information necessary to find, assess, understand & properly use your data

• start as soon as you start collecting data

31

Levels of data documentation

• study level

- published article may not be enough!

- conditions of data collection (e.g. study description, protocols, instruments & software/hardware used…)

- any changes made to collected data (e.g. processing & analysis procedures, scripts)

- overall structure of files

• data level

- information about individual data items/elements within data files (e.g. variable names & descriptions, value codes, within-file annotations,…)

32

Capturing data documentation

• in separate files, e.g.

- readme.txt files

- lab notebooks

- data dictionaries/codebooks

- project reports, publications

• embedded within data files

• using metadata (‘data about data’)

- highly structured format for describing data

- useful for searching through large amounts of data, to facilitate exchange & comparison of data

- elements from a controlled list, as defined by a metadata schema

Metadata example for image data, based on the Dublin Core

metadata schema (reproduced from Briney 2015)

33

Don’t reinvent the wheel: use existing standards!• standard = an agreed way of doing something

- Standards are agreed-upon conventions for doing something, e.g. managing a process or delivering a service, and are established by community consensus or an authority.

• examples:

- minimal reporting standards for research and publications (e.g. systematic review, qualitative research, diagnostic tests)

- minimal reporting standards for biomedical investigations (e.g. Flow Cytometry, quantitative Real-Time PCR, microarray experiment, microarray experiment, T cell assay, immunohistochemistry, genotyping experiment, gel electrophoresis)

- standard vocabularies, ontologies (e.g. International Classification of Diseases, Breast Cancer Grading Ontology)

- standard data structures/formats (e.g. proteomics, x-ray data, statistical data)

34

Activity• which minimal information should I report when performing my experiment(s), and/or

writing my article?

=> Search

35

3 How will you handle ethics & legal issues?

36

Personal data• any information relating to an

identified/identifiable natural person

• handling is regulated by European and Belgian privacy/data protection legislation (e.g. new GDPR), e.g.

- certain requirements for lawful processing of personal data (including data subject’s informed consent)

- appropriate organisational & technical measures to protect personal data

• sensitive personal data

- info about racial or ethnic origin, political opinions, religious/philosophical beliefs, trade-union membership, health or sex life, criminal offences…

- benefit from stronger protection

37

Otherwise confidential data, e.g.

• you have signed a non-disclosure agreement/contract with a confidentiality clause

• data are otherwise sensitive

- i.e. disclosure may harm endangered species, vulnerable sites or groups, national security…

- also see ethical standards governing confidentiality in research

• data have economic valorisation potential

- duty to report to UGent’s TechTransfer office before disclosing anything!

38

Personal/confidential data - some things to consider• don’t collect more personal data than needed

• seek the permissions required to collect & handle these data

- e.g. informed consent, ethical approval, …

- permission for data collection, but also for processing, archiving, sharing, destroying…

• anonymize/pseudonomize personal data to protect privacy

• pay attention to data security to prevent unauthorized access & disclosure

- physical security (e.g. lock labs, offices…)

- security of computer systems and files (e.g. passwords, encryption, up-to-date software, controlled access to files/folders, …)

- network security (e.g. firewall protection, no confidential data on servers/computers connectedto external network…)

familiarize yourself with UGent Information Security Policy & Guidelines!

39

Intellectual property (IP)

• IP issues can affect your ability to (re)use, archive and/or share data

• data may be protected by IP rights

- e.g. copyright, database right

- permission from rights owner(s) required, e.g. to reproduce or publicly communicate data

• your data may have economic valorisationpotential

- UGent normally owns such research results

- sharing may not be possible (to protect confidential knowhow), or only after an embargo period (to seek patent protection first) – always check with UGent TechTransferfirst!

40

Intellectual property (IP)• (rights to) data may be (co-)owned by a third party

- e.g. when you re-use existing data

- e.g. when research is funded by or conducted in collaboration with external partner

- check/obtain necessary third-party permissions!

41

4What are your plans for externally sharing data?

42

Sharing data with others• easier than ever in the digital age

• nothing new in certain domains + changing culture among researchers

- e.g. P. Masuzzo & L. Martens (2017), “Do you speak open science? Resources and tips to learn the language”, PeerJ Preprints 5: e2689v1. doi: 10.7287/peerj.preprints.2689v1

• increasingly required by research funders and publishers

- funder policies on access to research data (e.g. European Commission – H2020)

- journal data availability policies (e.g. PLOS journals, Nature, BioMed Central journals…)

43

Sharing does not necessarily mean “open data”

• fully ‘open’: “anyone can freely access, use, modify, and share for any purpose” (opendefinition.org)

• “as open as possible, as closed as necessary” - approach (cf. ethical & legal restrictions)

• possible to share data under more restricted conditions

- e.g. only a subset of the data

- only with certain (types of) users

- only for certain types of use

- after an embargo period…

44

Various ways of sharing data

• email data “upon request”

• disseminate via a project or personal website

• make data available via a trusted database or data repository

- general-purpose/multi-disciplinary, domain-specific, or institutional

- helps make data citable and FAIR

G. Polanczyk et al., “The Worldwide Prevalence of ADHD: A Systematic Review

and Metaregression Analysis”, The American Journal of Psychiatry 164 (2007) 6: 942-

948. doi: 10.1176/ajp.2007.164.6.942

45

A trusted data repository

• assigns a unique persistent identifier (e.g. DOI) to dataset, which resolves to a landing page

• provides online access to metadata (always public) + access to data & documentation (open or more restricted)

• states data reuse rights (via licenses)

• uses standards to promote interoperability

Dataset record from the 4TU.ResearchData repository.

doi: 10.4121/uuid:3106fb06-9723-49d1-b829-94778fa5aa6d

46

Publish a data paper

• extensive dataset description, published in a journal

• link to data deposited in repository

• paper and data are peer-reviewed

• cited like a traditional article

• format offered by regular journals + dedicated data journals (e.g. Scientific Data)

S.M. Kadri et al., “A variant reference data set for the Africanized

honeybee, Apis mellifera”, Scientific Data 3 (2016), Article number:

160097. doi: 10.1038/sdata.2016.9747

Licensing research data

• use licenses to make reuse rights clear

• many repositories use standard (rather than bespoke) licenses

- Creative Commons Licenses

- Open Data Commons Licenses

- but less suitable for restricted data

Licenses conformant with Open Definition principles. From

“Conformant Licenses” by Open Definition, licensed under CC-BY 48

Reusing data• find existing datasets via repositories’ data

catalogues

• cite any datasets you reuse in your publications

• minimum elements:

- Author

- Publication Year,

- Title

- Publisher

- Location (usually PID + resolver service)

Example

Benkman C (2016). Data from: Matching habitat choice in nomadic crossbills appears most pronounced when food is most limiting. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.dg41r

49

Activity• Why NOT share research data: give minimal 3 arguments AGAINST data sharing

• Why share research data: give minimal 3 arguments IN FAVOUR OF sharing

50

Why not share research data?Common arguments AGAINST sharing

impossible because of privacy or IP

fear of being ‘scooped’

fear of errors being exposed

fear of misinterpretation or misuse

too much effort/too costly

data not of interest to anyone else

lack of reward

51Adapted from Corti et al. 2014

Advantages of sharing research data

52

Common arguments FOR sharing

helps uncover errors, fraud, irreproducible results

avoids duplication of effort (greater ‘return on investment’)

public access to publicly funded research

results in citations (citation advantage for papers with

shared data + citations when others reuse your data)

opportunities for new collaborations, co-authorships

advances science (accelerates discovery, facilitates new

research questions/new forms of research)

returning the favour (already reusing other people’s data)

Adapted from Corti et al. 2014

5How will your data be

preserved for long-term access & use?

53

Preserving data• what happens to research data once a project is completed?

54

Vines, Timothy et al. “The Availability of Research

Data Declines Rapidly with Article Age.” Current

Biology 24 (2014) 1: 94–97.

doi:10.1016/j.cub.2013.11.014

probability of supporting

data still being available

declined by 17% every

year

Don’t keep everything (indefinitely)

• select what to keep and how long, based on e.g.:

- obligations to keep data/documentation (legal obligations, funder policies…): e.g. clinical trial documents need to be kept for 20 years!

- what is needed to verify & validate your publications

- what cannot be recreated or is too expensive to recreate

- potential re-use value

- scientific, historical, cultural significance

- …

55

Preserving takes more than storing data

• keeping data files readable and usable over time requires appropriate strategies, e.g. :

- preparing data for preservation (e.g. sustainable file formats, documentation/metadata)

- moving files to new storage hardware every 3 to 5 years

- monitoring for file corruption using checksums

- making backups is still necessary

- …

56

If possible, outsource preservation

57

• for example, to a trusted external data repository

- suitable for publicly shareable data that need longer retention periods

- check explicit commitment to preservation (e.g. preservation policy, certificate, statement on how long data will be supported...)

• confidential data may need to stay in-house

Find a data repository via Re3data.org

58

Things to consider when choosing a repository

• does it

- provide a persistent & unique identifier to your dataset?

- provide a landing page for each dataset, with metadata?

- help you track usage (e.g. access & download statistics)

- have a certificate to indicate trustworthiness (e.g. DSA)?

- match your data needs (e.g. your type of data are accepted)?

- meet legal requirements in terms of data protection and allowing reuse without unnecessary licensing conditions?

- provide guidance on how to cite data?

- charge for its services?

Adapted from https://www.openaire.eu/opendatapilot-repository 59

Don’t forget non-digital materials• for example:

- biological materials: BCCM culture collection (3 hosted @ Ghent University, Faculty of Sciences)

- human tissue: Bimetra biobank @ Ghent University Hospital

60

ActivitySearch for a suitable domain-specific or general-purpose repository for your research data in re3data.org

Examples general-purpose: Dryad, Zenodo

Examples domain-specific: UniProtKnowledgebase, International Mouse Strain Resoure, Genbank

61

Putting it all together: Create your own DMP

62

Use an online planning tool: DMPonline.be

DMPonline.be• local instance of open source software developed by DCC (UK)

• launched as a pilot at UGent in 2015

• now hosted on BELNET servers + accessible for researchers from institutions with DMPbelgium consortium

- currently:

64

65

How the tool works: demo

https://dmponline.be

Further tips for writing a DMP • check applicable data policies

- e.g. Ghent University RDM Policy Framework

• keep it simple, but be as specific as possible

• justify your decisions

• consider it a ‘living’ document

• have a look at example DMPs

• familiarize yourself with RDM terminology & best practices (for your field)

66

Example plans• examples on the Digital Curation Centre (DCC) website

http://www.dcc.ac.uk/resources/data-management-plans/guidance-examples

• examples in the Zenodo repository

https://zenodo.org/search?page=1&size=20&q=data management plans

• public DMPs on the DMPTool website

https://dmptool.org/public_dmps

• DMPs published in RIO (Research Ideas and Outcomes OA journal)

http://riojournal.com/browse_user_collection_documents?collection_id=3

67

Online RDM training resources • FOSTER training portal

• OpenAIRE webinars

• EUDAT training materials

• Digital Curation Centre How-to Guides & Checklists

• UK Data Service ‘Prepare & Manage Data’ webpages

• MANTRA – Research Data Management Training

• ‘Research Data Management and Sharing’ MOOC on Coursera

• Data Management Training Clearinghouse

• Data4lifesciences Handbook for Adequate Natural Data Stewardship

• FAIRDOM Knowledge Hub

68

Want to get some feedback on your DMP? • have a look at our Generic DMP Review Rubric

- a (self-)evaluation form for DMPs based on UGent generic DMP template

- available at https://osf.io/ezxm5/

• or… send us your DMP!

- [email protected]

- [email protected]

69

Thank you for listening!

70

Credits• slides draw heavily and/or adapt materials from:

K. Briney (2015), Data Management for Researchers: Organize, Maintain and Share your Data for Research Success (Pelagic Pub Ltd).

L. Corti, V. Van den Eynde & M. Woolard (2014), Managing and Sharing Research Data. A Guide to Good Practice (Sage).

S. Jones (2013), ‘Research Data Management’, Licensed under CC-BY

S. Jones (2016), ‘What is a Data Management Plan?’, Licensed under CC-BY 4.0

T. Ross-Hellauer & S. Jones (2016), ‘Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT’, Licensed under CC-BY 4.0

71

• images [slide 2]: From ‘Research Data Management: An Overview - 2014-05-12’ by Research Support Team, IT

Services, University of Oxford, licensed under CC-BY-NC-SA 4.0

[slide 5]: ‘Writing’ by Aiconica, licensed under CC0 1.0

[slide 6]: ‘Database’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK

[slide 7]: ‘Data Ocean’ by Auke Herrema – Het Bouwteam, licensed under CC-BY

[slide 8]: ‘FAIRDOM – Research Data Management’ by Stiftfilm.de, all rights reserved

[slide 9]: ‘The European Code of Conduct for Research Integrity. Revised Edition’ by ALLEA – All European Academies, redistribution permitted for educational, scientifc and private purposes if the source is quoted.

[slide 10]: ‘Publications and Data’ by Auke Herrema, licensed under CC-BY 4.0

[slide 11]: From ‘Policy for Research Data Management’ by University of Copenhagen – Faculty of Health and Medical Sciences

[slide 12]: ‘Planning’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK

[slide 16]: ‘File formats colllection’ created by Freepik

[slide 18]: ‘Social media information overload’ by Mark Smiciklas, licensed under CC-BY-NC 2.0

[slide 19]: From ‘Research Data Management: An Overview - 2014-05-12’ by Research Support Team, IT Services, University of Oxford, licensed under CC-BY-NC-SA 4.0

[slide 20]: storage options DICT (UGent) by Johan Van Camp

72

• images [slide 23]: From ‘Analyzing DMPs to inform Research Data Services’ by A. L. Whitmire, licensed under CC-BY

4.0

[slide 24]: From ‘Template Research Data Management workshop for STEM researchers’ by R. Higman and M. Teperek, licensed under CC-BY 4.0

[slide 26]: From ‘Template Research Data Management workshop for STEM researchers’ by R. Higman and M. Teperek, licensed under CC-BY 4.0

[slide 28]: From ‘Introduction to Rsearch Data Management’ by A. Whitmire & S. Van Tuyl, licensed under CC-BY

[slide 30]: From ‘Data Handling: Documentation, Organization and Storage’ by Sebastian Netscher, licensedunder CC-BY

[slide 31]: From ‘Research Data Management: An Overview - 2014-05-12’ by Research Support Team, IT Services, University of Oxford, licensed under CC-BY-NC-SA 4.0

[slide 32]: ‘Loss of Data’ by Auke Herrema – Het Bouwteam, licensed under CC-BY

[slide 35]: ‘Metadata’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK

[slide 37]: ‘Privacy’ by NickYoungson, licensed under CC-BY-SA 3.0

[slide 38]: ‘File: Lorenzo Federici 2’ by Walteroma10, licensed under CC-BY-SA 3.0

[slide 40]: ‘Property’ by Nick Youngson, licensed under CC-BY-SA 3.0

[slide 44]: ‘Data Tree’ by Auke Herrema – Het Bouwteam, licensed under CC-BY

73

• images

[slide 48]: From ‘Conformant Licenses’ by Open Definition, licensed under CC-BY

[slide 51]: ‘Data Sharing’ by Auke Herrema – Het Bouwteam, licensed under CC-BY

[slide 55]: ‘How to Choose’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK

[slide 56]: ‘A Domesday system at the Vintage Computer Festival 2010, Bletchley UK’ by Regregex, licensed under CC-BY 3.0

[slide 57]: ‘Trustworthy Digital Preservation’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed underCC-BY 2.5 DK

[slide 58]: From ‘How to select a repository’ by OpenAIRE, licensed under CC-BY 4.0

[slide 59]: From ‘Research Data Management Briefing Paper’ by OpenAIRE, licensed under CC-BY 4.0

[slide 60]: BCCM consortium by BCCM

[slide 63]: ‘Tools’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK

[slide 66]: FromV. Van den Eynden et al., Managing and Sharing Data. Best practice for Researchers (UK Data Archive, 2009), licensed under CC-BY-NC-SA 3.0

[slide 70]: ‘Knowledge’ by Jørgen Stamp, attribution: digitalbevaring.dk, licensed under CC-BY 2.5 DK

74

75

How the tool works

https://dmponline.be

Log in with:

- institutional

credentials

(BELNET

Federation)

- local account

- ORCID (if

profile linked

to ORCID)

76

How the tool works

https://dmponline.be

1. Viewing existing plans

77

Click ‘View plans’

button to see the

list of plans you

have created,

and/or plans that

others have

shared with you

2. Creating a new plan

Select funder to

get its template

Select institution to

get local guidance, as

well as institutional

template(s) - if

funder not applicable

Choose additional

optional guidance 78

2. Creating a new plan: answering questions

79

Click ‘+’ sign to

open up section

and see questions

2. Creating a new plan: featuresProgress

indicator

Section

Questio

n

Write down

your answer

here

Leave a

comment for

collaborators

Custom guidance

from funder,

university,

group… 80

81

3. Sharing your plan

Manage

collaborators

Add

collaborator by

entering email

addressSelect

permission

level

4. Exporting a plan Select export

format

Adjust export

settings as

needed

82

5. Finding help

83

Click ‘Help’

button for

guidance