critical thinking: data management in research – why and how?
TRANSCRIPT
||Ana Sesartic / Matthias Töwe 1
Ana Sesartic & Matthias TöweDigital Curation Office
Critical Thinking:Data Management in Research – Why and How?
20.04.2016
||Ana Sesartic / Matthias Töwe 2
from the ETH library’s Digital Curation Office sharing a scientific background ourselves
here to discuss what you need to think about in data management as part of your research
to learn more about your needs in the process
and to motivate you to think critically about the chances and limitations of data management and re-use.
20.04.2016
Nice to meet you, we’re…
||Ana Sesartic / Matthias Töwe 3
Your scientific background
Previous experience with data management
Motivation to attend this course
20.04.2016
Shortly tell us something about…
||Ana Sesartic / Matthias Töwe 4
What is data management and why should it concern you? Data exercise – what it takes to understand someone’s data Long term preservation of data
COFFEE BREAK
Data management plans ETH regulations, intellectual property, privacy and access rights Data management tools Feedback round
20.04.2016
Today’s Programme
||Ana Sesartic / Matthias Töwe 5
Placemat Exercise
What does Research Data Management mean to you?
20.04.2016
Bild durch Klicken auf Symbol hinzufügen
||Ana Sesartic / Matthias Töwe 620.04.2016
Placemat Method
SummarisedGroup Inputs
Input participant 1
Input participant 3
Input participant
3
Input participant
4
||Ana Sesartic / Matthias Töwe 7
What is data management and why should it concern you?
20.04.2016
Introduction
DigitalResearch Data
Hypothesis / Research question
Data capture or -collection
Analysis and interpretation
SynthesisPublication
Access and verification
Re-use
||Ana Sesartic / Matthias Töwe 8
Share, publish, preserve your data for others – and for yourself !
20.04.2016
Two major issues
Manage your data while doing research
||
What is data?
“A reinterpretable representation of information in a formalized manner suitable for communication, interpretation, or processing.”
Digital Curation Centre
20.04.2016Ana Sesartic / Matthias Töwe 9
Slide adapted from the PrePARe Project
||
What is data management?
Data management is a general term covering how you organize, structure, store, and care for the information used or generated during a research project
It includes:
How you deal with information on a day-to-day basis over the lifetime of a project
What happens to data in the longer term – what you do with it after the project concludes
20.04.2016Ana Sesartic / Matthias Töwe 10
||Ana Sesartic / Matthias Töwe 1120.04.2016
"A story told in file names":
Source:http://www.phdcomics.com/comics/archive.php?comicid=1323
Copyright: Jorge Cham
Does this remind you of something?
||
Why spend time and effort on this?
So you can work efficiently and effectively
Save yourself time and reduce frustration
Facilitate collaboration in your group
Highlight patterns or connections that might otherwise be missed
Because your data is precious – and costly
To enable data re-use and sharing – even for yourself
To meet funders’ and institutional requirements
You are the experts who can make informed decisions!
20.04.2016Ana Sesartic / Matthias Töwe 12
||
Why spend time and effort on this? You as the authors are responsible for the work you deliver
during your PhD
As group leaders you will be held responsible for complying with your institution’s and funders’ regulations…
…and therefore you will have to set and enforce rules for your group to achieve this
You oversee students’ and staff members’ work and need to rely on them following good scientific practice
You may be able to influence the discussion in your community, in your institution and with funders
20.04.2016Ana Sesartic / Matthias Töwe 13
||Ana Sesartic / Matthias Töwe 1420.04.2016
Idealised Data Life Cycle
DigitalResearch Data
Hypothesis / Research question
Data capture or -collection
Analysis and interpretation
SynthesisPublication
Access and verification
Re-use
||Ana Sesartic / Matthias Töwe 1520.04.2016
Manage the transition from private to public
(Derived from Andrew Treloar’s Diagram: Private Research , Shared Research , Publication , and the Boundary Transitions, http://andrew.treloar.net/research/diagrams/data_curation_continuum.pdf, 2012)
||Ana Sesartic / Matthias Töwe 16
Benefits of data sharing and publication
SPEED: The research process becomes faster
EFFICIENCY: Data collection can be funded once, and used many times for a variety of purposes
ACCESSIBILITY: Interested third parties can (where appropriate) access and build upon publicly-funded research resources with minimal barriers to access
IMPACT and LONGEVITY: Open publications and data receive more citations, over longer period
TRANSPARENCY and QUALITY: The evidence that underpins research can be made open for anyone to scrutinise, and attempt to replicate findings. This leads to a more robust scholarly record
Benefits of data sharing and publication
20.04.2016
||Ana Sesartic / Matthias Töwe 17
“In genomics research, a large-scale analysis of data sharing shows that studies that made data available in repositories received 9% more citations, when controlling for other variables; and that whilst self-reuse citation declines steeply after two years, reuse by third parties increases even after six years.”
(Piwowar and Vision, 2013)
Benefits of Open Data:Impact and Longevity
20.04.2016
Van den Eynden, V. and Bishop, L. (2014). Incentives and motivations for sharing research data, a researcher’s perspective. A Knowledge Exchange
Report, http://repository.jisc.ac.uk/5662/1/KE_report-incentives-for-sharing-researchdata.pdf
||Ana Sesartic / Matthias Töwe 18
Bad News There is no Nobel Prize for data management… …but there is some data management behind any prize We are no experts in your field(s) of work – you are!
Good News Data management can spare you (real) trouble This is not Rocket Science You may already be doing a lot of this – as part of your research You are the experts on your data, anyway!
20.04.2016
Mixed News
||Ana Sesartic / Matthias Töwe 1920.04.2016
Constraints for preservation and sharing
Data is usually created without its publication in mind
Research data requires comprehensive documentation for being properly interpreted or even being re-used
For text documents metadata can be improved at a later date
For research data, only technical metadata can be extracted later, but little to no documentation of content and context can be added:
„Garbage in, garbage out“
||Ana Sesartic / Matthias Töwe 20
Exercise:What it takes to understand someone’s data
20.04.2016
Bild durch Klicken auf Symbol hinzufügen
||Ana Sesartic / Matthias Töwe 21
1. What information is needed to understand this data?2. What does the metadata tell you and what not?3. Is this information sufficient? If not, what else do you
need to know?
Data, metadata and context are needed to properly understand a data set.
20.04.2016
Discuss
||Ana Sesartic / Matthias Töwe 2220.04.2016
Critically re-thinking the re-use of third party data
Data management does not start with your own data, but also includes a critical view on other people’s data you use:
Do you understand how they were produced?
Do you have enough information on evaluating their reliability?
Are you comfortable with using data without talking to its producers?
Will you know in a few months time which data you re-used from other researchers?
Do you know how to cite the data you use?
||Ana Sesartic / Matthias Töwe 23
Long term preservation of data
20.04.2016
And how to prepare for it
||
What does long term mean?
Different time horizons and purposes
Keeping data for at least ten years (or for longer as is deemed appropriate in your field) to ensure accountability if results are challenged
Potentially unlimited retention of data with permanent value (e.g. long running series of observational data)
Permanent retention of published data which is considered as part of the scientific record and is expected to remain available just like articles and journals are
In general “long term” signifies any time period which spans technological changes in the way data is being used
20.04.2016Ana Sesartic / Matthias Töwe 24
||Ana Sesartic / Matthias Töwe 25
Old files?
20.04.2016
Cartoon «Old Files»,taken from:http://xkcd.com/1360/
How old is your oldest file? Is it in use?
Age of files need not be a problem in itself…
…but obsolescence is
How much are you willing to invest to recover unreadable files?
Files in regular use will not become obsolete unnoticed
||
“Storage is becoming cheaper”
This is not (only) about storage
Cheaper storage helps – unless data grows faster than costs fall
Just keeping everything is neither archiving nor preservation
Preservation is about ensuring
Integrity and authenticity of stored data
Technical usability of data in the future: it can be rendered and used
Scientific usability of data in the future: it can be understood and interpreted
This is closely related to the issue of sharing and publishing data
20.04.2016Ana Sesartic / Matthias Töwe 26
||Ana Sesartic / Matthias Töwe
What is needed to ensure re-usability of data?
Data Curation
Content Preservation
Bitstream Preservation
What? Why? Who?
Ensure intellectualre-usability
Ensure technicalre-usability
Ensure technicalstability
Adapted after Jens Ludwig, Wissgrid
Data Producers
ETH-Bibliothek
IT-ServicesETH Zurich
20.04.2016 27
||Ana Sesartic / Matthias Töwe
What is needed to ensure re-usability of data?
Data Curation
Content Preservation
Bitstream Preservation
What? Why? Who?
Ensure intellectualre-usability
Ensure technicalre-usability
Ensure technicalstability
Adapted after Jens Ludwig, Wissgrid
Data Producers
ETH-Bibliothek
IT-ServicesETH Zurich
20.04.2016 28
You!
||
How does this relate to data management?
Proper data management or its absence determine if preservation of data will be possible
For a period of ten years, data management alone might suffice, but thinking further ahead is useful
If data is to be kept and used for longer periods, further measures apply:
Data should be as self-contained as possible, including documentation of any tools used or better: the tools themselves;remember e.g. including reference outputs for model algorithms
More care is required in the choice and use of file formats
20.04.2016Ana Sesartic / Matthias Töwe 29
||Ana Sesartic / Matthias Töwe 30
Open standards (non proprietary)
Well documented Widely used and supported by many tools
Uncompressed (or at least losslessly compressed)
Unencrypted When in doubt, keep original and create a copy in an open or
exchange format
Don’t rely on file extensions Consider that data might be used in different operating systems
20.04.2016
Preferences for file formats
||Ana Sesartic / Matthias Töwe 31
Examples
Images: uncompressed TIFF; JPEG2000
Text: ASCII, including XML etc.Add encoding information and dependencies such as stylesheets or TeX-libraries
Text (page based): PDF/A1-b, (PDF)
Data from spreadsheets: CSV
Spreadsheets: (CSV), (ODF, OOXML)
20.04.2016
Examples
||Ana Sesartic / Matthias Töwe 32
This does not mean you «must not» keep data in other formats
Just be aware that proprietary or undocumented formats (even your own!) might cause trouble in the future
Think about adding an alternative format (yes, redundantly) for a proprietary one…
…and add any context information you yourself would like to have on your own formats in a few years time in a readme-file, an accompanying document or as metadata
20.04.2016
Note
||Ana Sesartic / Matthias Töwe 33
Data Management Planning
20.04.2016
Bild durch Klicken auf Symbol hinzufügen
||Ana Sesartic / Matthias Töwe 3420.04.2016
Consequences for preservation and sharing
A structured process with the aim of data publication and/or long term preservation must apply early in the life cycle: data management
Quality depends on producers, any later curator can only perform certain limited checks
Preservation and re-use have limits as a matter of principle and strongly depend on today’s preparatory efforts
||Ana Sesartic / Matthias Töwe 35
A brief plan written at the start of a project and updated during its course to define:
What data will be collected or created?
How the data will be documented and described?
Where the data will be stored?
Who will be responsible for data security and backup?
Which data will be shared and/or preserved?
How the data will be shared and with whom?
What is a Data Management Plan (DMP)?
20.04.2016
||Ana Sesartic / Matthias Töwe 36
DMPs are increasingly required with grant applications, but are useful whenever researchers are creating data.
They can help researchers to:
Make informed decisions to anticipate & avoid problems
Develop procedures early on for consistency
Ensure data are accurate, complete, reliable and secure
Avoid duplication, data loss and security breaches
Save time and effort to make their lives easier!
Why develop a DMP?
20.04.2016
||Ana Sesartic / Matthias Töwe 3720.04.2016
What to do?Data Management Planning
Data Sharing and Publishing should be part of Data Management Planning from the beginning of a project
Check mandates with funders, in your community and with your institution
What can and may be made publicly available at a later stage?
Which means of publication should be used?
Will there be additional or consequential costs?
What are the prerequisites for re-using your data? Are there dependencies from soft- or hardware?
Will any restrictions for access and re-use be required?
||Ana Sesartic / Matthias Töwe 3820.04.2016
What to do?Data Management Checklist ETH / EPFL
Supports you in the creation of a DMP or in discussing data management in general
Covers general planning and the phases of the data life-cycle, from data collection and creation to data sharing and long-term management
Special sections cover documentation and metadata, file formats, storage, ethical and intellectual property issues
http://bit.ly/rdmchecklist
||
DMPOnline
20.04.2016 39Ana Sesartic / Matthias Töwe
https://dmponline.dcc.ac.uk/
The DMPOnline tool by the UK Digital Curation Centre helps you create Horizon 2020 compliant data management plans, by answering a questionnaire, ensuring that your scientific data are:
Discoverable Accessible Assessable and intelligible Usable beyond the original purpose for which it was collected Interoperable to specific quality standards
||Ana Sesartic / Matthias Töwe 4020.04.2016
What to do?Data production and filing
From the start of data production, data publishing should be considered and prepared – if appropriate
Complying with requirements from third parties (funders, universities, community standards, law)
Use of comprehensible, documented structures
Use of transparent, speaking or documented naming conventions (e.g. with key to abbreviations); names should not exceed reasonable limits
Versioning, supporting an easy selection of data e.g. for publication or further processing
Try to make your data as self-contained as possible
Avoid unclear redundancies
||Ana Sesartic / Matthias Töwe 4120.04.2016
What to do?Group Policy
Self-critical question:
What must data look like to enable us to re-use it with scientific conviction and trust into its quality and correctness?
Is this true for our own data? What is missing?
Tasks for group leaders within the group
Agree on binding rules and fix them in writing
Instruct new members
«Archive in an ordered way or delete» - do not create «data graveyards»
Control implementation while staff members are still there
||Ana Sesartic / Matthias Töwe 4220.04.2016
An altruistic exercise?
There are benefits for yourself!
Your data is transparent and well managed – no matter if it’s public or not
Data can better be shared within group, e.g. handed over to new students
Sharing and publishing with little additional effort and at short notice
Visibility of your work increases through Open Access to transparently organised and qualitatively relevant data
Should a publication be challenged, all material is within reach
Requirements by funders and institutions can be fulfilled with little additional effort
||Ana Sesartic / Matthias Töwe 4420.04.2016
Limits to planning – our know-how
Learning processes within communities
What will really work over the next years, what is missing, what is not necessary?
It is not clear, how much re-use will actually happen Are expectations realistic?
In which disciplines and for which data types is re-use an attractive option?
Where lies a balance between effort and opportunities for re-use?
Researchers and service units can and must learn together
||Ana Sesartic / Matthias Töwe 45
ETH regulations, intellectual property, privacy and access rights
20.04.2016
Bild durch Klicken auf Symbol hinzufügen
||Ana Sesartic / Matthias Töwe 46
Recent Overview
https://itsecurity.ethz.ch/en/#/manage_your_data 20.04.2016
||Ana Sesartic / Matthias Töwe 47
Guidelines for Research Integrity
20.04.2016
At the ETH Zurich research is founded on intellectual honesty. Researchers […] are committed to scientific integrity and truthfulness in research and peer review.
https://www.ethz.ch/content/dam/ethz/main/research/pdf/forschungsethik/Broschure.pdf
||Ana Sesartic / Matthias Töwe 48
Article 11. Collection, documentation and storage of primary data
All steps in the treatment of primary data (statistical analyses, reorganizations, etc.) must be documented in a form appropriate to the discipline in question (e.g. laboratory logs, other data carriers) in such a way as to ensure that the results obtained from the primary data can be reproduced completely.
The project management is responsible for data management (data collection, storage, data access, compliance with data protection requirements, etc.). In particular, it should ensure that, following completion of the project, the data and materials are retained for the period prescribed in the discipline, and are duly destroyed within the period prescribed by law, if appropriate.
20.04.2016
||Ana Sesartic / Matthias Töwe 49
Compliance Guide […] all ETH members […] are
required to integrate the general conditions and internal directives into the work process.
The following Compliance Guide is designed to serve as an orientation tool. […]To facilitate implementation, each point is supplemented with further information and contact persons available for consultation.
https://rechtssammlung.sp.ethz.ch/Dokumente/133en.pdf
20.04.2016
||Ana Sesartic / Matthias Töwe 50
Privacy
People-related data need to be preserved according to Swiss data protection law
Appropriate anonymization might be required The deletion of individual datasets must be possible at all
times The study subjects need to sign a declaration of consent
20.04.2016
||
Intellectual Property Rights
Intellectual Property Rights (IPR) can be defined as rights acquired over any work created or invented with the intellectual effort of an individual. These rights are time-limited and do not require explicit assertion.
Owners of works are granted exclusive rights
to publish the work
to license the work’s distribution to others
to sue in case of unlawful or deceptive copying
It may be possible to waive these rights through a public declaration similar to a licence
20.04.2016Ana Sesartic / Matthias Töwe Page 51
||Ana Sesartic / Matthias Töwe 52
What you need to consider
Respect the rights of others Third parties Individuals you work with
In case of doubt: seek permission even when a CC-licence is assigned
Note that according to ETH law, ETH reserves most immaterial rights in works by its employees. When in doubt, contact ETH transfer (www.transfer.ethz.ch)
Make sure you keep sufficient rights Eg. for Open Access Publishing (green path) Eg. with respect to patent applications: ETH transfer (www.transfer.ethz.ch
)20.04.2016
||Ana Sesartic / Matthias Töwe 5320.04.2016
share-alikeby non-derivative Some rights
reserved
share
non-commercial public domainremix
||
www.dcc.ac.uk/resources/how-guides/license-research-data
Licensing research data
Outlines pros and cons of each approach and gives practical advice on how to implement your licence
CREATIVE COMMONS LIMITATIONS
NC Non-CommercialWhat counts as
commercial?
SA Share AlikeReduces interoperability
ND No DerivativesSeverely restricts use
Horizon 2020 guidelines point to
OR
Ana Sesartic / Matthias Töwe 20.04.2016 54
||
Open Access & Article Processing Charges
20.04.2016Ana Sesartic / Matthias Töwe 55
For a number of OA publishers, ETH-Bibliothek either covers the Article Processing Charges (APC) or has negotiated reduced rates
For more information:http://www.library.ethz.ch/en/ms/Open-Access-at-ETH-Zurich/Publishing-in-open-access-journals/Publishing-in-open-access-journals-Funding
||Ana Sesartic / Matthias Töwe 56
Further Information on Open Access
Critical Thinking Workshop:Think, check, submit: Making informed decisions in open access publishingThu, 26 May 2016http://www.library.ethz.ch/en/Dienstleistungen/Schulungen-Tutorials-Fuehrungen/Workshops/Think-check-submit-Informierte-Entscheidungen-ueber-das-Open-Access-Publizieren-treffen
Open Access at ETH Zurichhttp://www.library.ethz.ch/en/Open-Access
Open Research Datahttp://www.library.ethz.ch/en/ms/Open-Access-at-ETH-Zurich/What-is-Open-Access/Open-Research-Data 20.04.2016
||Ana Sesartic / Matthias Töwe 58
Versioning:How do you currently handle it? What works well? What went wrong?
Naming conventions:Do you have any? Which rules apply?
Sharing:Which tools or services do you use? What are your experiences?
Literature Management:Which tools do you use? What are their pros and cons?
20.04.2016
Group discussion: current practice
||Ana Sesartic / Matthias Töwe 5920.04.2016
Repositories and Registries
http://www.re3data.orghttp://datadryad.org
https://zenodo.org
http://figshare.com
https://www.openaire.eu/search/data-providers
||
Deposit in a repository – but in which one?
http://databib.org
http://service.re3data.org/search Ana Sesartic / Matthias Töwe 20.04.2016 60
||
OpenAIRE
Open Access Infrastructure for Research in Europe
Aggregates data on OA publications Mines and enriches its content by linking things together Provides services & APIs e.g. to generate publication lists www.openaire.eu
20.04.2016Ana Sesartic / Matthias Töwe 61
||Ana Sesartic / Matthias Töwe 62
Where will your data reside? Which legislation applies, e.g. in terms of data protection? Is the service sustainable? Do you trust the provider? Who else can access and use which of your data?
Research Data Personal Data
How can you get your data back? Is a certain license required? Are there immediate or longer term costs?
20.04.2016
Criteria for chosing services and tools
||Ana Sesartic / Matthias Töwe 6320.04.2016
Collaboration - Organisation
https://www.openproject.org
http://www.redmine.org
https://trello.com
https://slack.com
https://tagpacker.com
||Ana Sesartic / Matthias Töwe 6420.04.2016
Collaboration - Sharing
https://owncloud.org
https://www.dropbox.com
https://www.switch.ch/drive/
https://cifex.ethz.ch/
https://polybox.ethz.ch
https://www.wetransfer.com
||Ana Sesartic / Matthias Töwe 6520.04.2016
Collaboration – Versioning
https://subversion.apache.org
https://github.com
https://bitbucket.org
||Ana Sesartic / Matthias Töwe 6620.04.2016
Collaboration - Writing
https://www.overleaf.com
https://www.authorea.com
https://atlas.oreilly.com
https://hypothes.is
https://evernote.com
http://simplenote.com
||Ana Sesartic / Matthias Töwe 67
www.zotero.org
20.04.2016
Collaboration – Reference Management
www.mendeley.com
endnote.com
www.citavi.com
www.citeulike.org www.bibsonomy.org
||Ana Sesartic / Matthias Töwe 68
ETH-Bibliothek Digital Curation (http://www.library.ethz.ch/Digital-Curation)
ETH Data Archive Docuteam packer DMP-Checklist Archival formats Consulting and training
DOI registration (http://www.library.ethz.ch/DOI-Desk-EN) Open Access (http://www.library.ethz.ch/en/Open-Access) ETH E-Collection (http://e-collection.library.ethz.ch/index.php?lang=en) ETH E-Citations (http://e-citations.ethbib.ethz.ch/index.php?lang=en)
20.04.2016
Services at ETH Zurich I
||Ana Sesartic / Matthias Töwe 69
Scientific IT Services Research Data Analysis and Management
(http://www.sis.id.ethz.ch/researchdatamanagement/)
openBIS / openBIS ELN-LIMS (https://sis.id.ethz.ch/software/openbis.html)
IT Services Storage provision, usually via your IT Support Group, e.g.
HSM (Hierarchical Storage Management) (http://www1.ethz.ch/id/services/list/storage/nas)
LTS (Long-Term Storage)(http://www1.ethz.ch/id/services/list/storage/LTS)
20.04.2016
Services at ETH Zurich II
||Ana Sesartic / Matthias Töwe 70
Think about what you do! Start early Agree on clean concepts and simple tools You do not need the latest sophisticated apps Talk to colleagues Check what your local service providers can offer «Keep it as simple as possible – but distrust it!»
20.04.2016
Take home message
||Ana Sesartic / Matthias Töwe 7120.04.2016
Thank you!Are there any questions?
Dr. Matthias TöweHead Digital CurationETH-BibliothekRämistrasse 1018092 Zurich044 632 60 [email protected]
Dr. Ana SesarticDigital CurationETH-BibliothekRämistrasse 1018092 Zurich044 632 73 [email protected]
www.library.ethz.ch/Digital-Curation