citing data in research articles: principles, implementation, challenges - and the benefits of...
TRANSCRIPT
Citing data in research articles: principles, implementation, challenges - and the benefits of changing our ways
Jo McEntyreEurope PMC, EMBL-EBIwww.ebi.ac.uk
Life Science Data
Familiar Complexity!A
rticl
e ‘P
acka
ge’
Ext
erna
l Res
ourc
es
“Recognized” data repos: file|structured record,
Accession|DOI|API+ Accession
Institutional repos:file|structured record, URL|
DOI|API+Accession
Author database|‘website’: file|struct record, URL|DOI|API+Accession
Supp info tables/data: file, URL|DOI
Cross-reference
Dataset list
Ref to external resRef to external
res
Reference list
Fig Source data: file, URL|DOI
Fig (caption + graphic)
Cross-reference
Ref to external resource
Adapted from Thomas Lemberger, EMBO
Europe PMC literature database
Europe PMC• Abstracts: 30 million• Full-text articles: 3 million
• Article citation counts• Grants• ORCIDs • Semantic annotation• Data citations• Data integration
Europe PMC is a member of the PMC International Collaboration.
Funded by 28 European funders of life science research
About EMBL-EBI
• Part of the European Molecular Biology Laboratory
• International, non-profit research institute
• Europe’s hub for biological data services and research
Making data discoverable Labs around
the world deposit data
and we…
Archive it
Classify itShare it with other data providers
Analyse, add value and
integrate it
…provide tools to
help researchers
use itA collaborative
enterprise
Journal Data Publishing
Data Citation in Europe PMC full text
Literature*
Added-Value
Submitted
*OMIM, Clinical trials, GO
Submission statements vs reuse?
260K
Data Citation Principals Engender Two Big Ideas
"sound, reproducible scholarship rests upon a foundation of robust, accessible
data"
"data should be considered legitimate, citable products of research"
These slides are adapted from: http://www.slideshare.net/joanstarr/data-citation-a-joint-declaration-of-principles
1 Importance2Credit and Attribution3Evidence4Unique Identification5Access6Persistence7Specificity and Verifiability8 Interoperability and flexibilityFull Principles: https://www.force11.org/datacitation
Joint Declaration on Data Citation Principles
Joint Declaration
Data should be considered legitimate, citable products of research. Data citations should be accorded the same importance in the scholarly record as citations of other research objects, such as publications.
1. Importance
Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data, recognizing that a single style or mechanism of attribution may not be applicable to all data.
2. Credit and Attribution
Joint Declaration
In scholarly literature, whenever and wherever a claim relies upon data, the corresponding data should be cited.
3. Evidence
Joint Declaration
A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.
4. Unique identification
etc.. !!!
Joint Declaration
Data citations should facilitate access to the data themselves and to such associated metadata, documentation, code, and other materials, as are necessary for both humans and machines to make informed use of the referenced data.
5. AccessJoint Declaration
Unique identifiers, and metadata describing the data, and its disposition, should persist -- even beyond the lifespan of the data they describe.
6. Persistence
Joint Declaration
Data citations should facilitate identification of, access to, and verification of the specific data that support a claim. Citations or citation metadata should include information about provenance and fixity sufficient to facilitate verifying that the specific timeslice, version and/or granular portion of data retrieved subsequently is the same as was originally cited.
7. Specificity and VerifiabilityJoint Declaration
Data citation methods should be sufficiently flexible to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability of data citation practices across communities.
8. Interoperability and flexibility
Joint Declaration
Many organizational endorsements
An implementation example
Principle 2:Credit andAttribution
Principle 4, 5, 6:Unique IDAccess Persistence
Principle 7:Specificity andVerifiability
Principle 8: Interoperability and flexibility
Creators, Year, Dataset Title, DOI, Data Repository, version
(Resolves to landing page with access to metadata, docs, and data)
Slide fromMercè Crosas, Ph.D.
Harvard University
http://europepmc.org/articles/PMC3089613
Large dataset:
http://europepmc.org/articles/PMC3535838
http://europepmc.org/articles/PMC3766260
http://europepmc.org/articles/PMC3704603
http://europepmc.org/articles/PMC3710810
Fig. 2
!! 2469 references !!
http://europepmc.org/articles/PMC2672098
Examples of Implementations of Data Citations in Reference Lists
http://europepmc.org/articles/PMC3661987
<mixed-citation publication-type="other">
Occurrence in reference list:
Occurrence in text:
Tagged in reference list as:
http://europepmc.org/articles/PMC3646594
<mixed-citation publication-type="thesis">
Occurrence in text:
Occurrence in reference list:
Tagged in reference list as:
http://europepmc.org/articles/PMC3722494
<mixed-citation publication-type="webpage">
Also in this reference list: a non-DOI data citation
Occurrence in text:
Occurrence in reference list:
Tagged in reference list as:
http://europepmc.org/articles/PMC3626513
<mixed-citation publication-type="journal">
Occurrence in text:
Occurrence in reference list:
Tagged in reference list as: Cite data generated in the course of the work
described?
JATS support for data citation<mixed-citation publication-type='data'> <name><surname>Heinz</surname><given-names>D.W.</given-names></name>, <name><surname>Baase</surname><given-names>W.A.</given-names></name>, <etal>et. al.</etal> <data-title>How amino-acid insertions are allowed in an alpha-helix of T4 lysozyme</data-title>. <source>PDB Europe</source>, accession <pub-id pub-id-type='accession' assigning-authority='pdb' xlink:href='http://www.ebi.ac.uk/pdbe/entry/search/index?text:102L'>102l</pub-id>. <pub-id pub-id-type='doi' xlink:href='http://dx.doi.org/10.2210/pdb102l/pdb'>10.2210/pdb102l/pdb</pub-id></mixed-citation>
Minimal, maximal & extensible citation
Resource name
ID
Resource name
Resolution ‘template’ ID
Author list
Resource name
Resolution ‘template’ ID
Time
? Author list
Resource name
Resolution ‘template’ ID
Time
?
For example: new data vs pre-existing
data
For example:version
Thomas Lemberger, EMBO
Integrated Research
Reused from: seier+seier, Flickr
Reused from: Images Money, Flickr
Articles
Data
People
Institutions
Funders
A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.
4. Unique identification
etc..
Joint Declaration
1. Discoverability through accessibility• Deposit in a public/open database• Where possible, structured archive (e.g. PDB,
ENA) >> unstructured archive (e.g. Zenodo, Figshare)
• Uniquely identify it: PID, Accession number, DOI, ROI
• Give it context: metadata (and more)
• All of the above = citable = Discoverable
2. Discoverability through structured datastructured data is one of the true
enablers of life science
- Discovery of homology between genes across species
- Predicting function based on protein folds
• Structured data can be cross-analysed, compared by algorithm, and encourages development of new products and tools
Structured data is good value for money
Annual cost of generating new protein structure data in labs around the world
Annual cost of maintaining itin a central database
Degrees of DataUnstructured
/semi-structured
Structured
Added Value
Metadata
A picture of a graphA spreadsheet of my results
A record in a DNA sequence database
A graphical display of a genome
A narrative with citations, pictures and attachments
Article
Metadata – critical to discoverabilityGeneric: title, submitters, date, file format, version.
citationbasic search
Wagner F.F., 23-APR-2002, TPA: Homo sapiens SMP1 gene, RHD gene and RHCE gene, INSDC, 14-NOV-2006 (Rel. 89, Last updated, Version 7). BN000065
Specific: organism, tissue, assay, page number …
deep search analysis computation
BioStudyEBI
BioStudy database for unstructured data
Study
Publications
Ontologies
Data files
Other DBs
Metadata
Other DBs
Elixir: An international distributed infrastructure for
• Data• Standards• Tools• Compute• Training• Industry
THE END