leveraging publication metadata to help overcome the data ingest bottleneck

29
Todd J. Vision National Evolutionary Synthesis Center Department of Biology University of North Carolina at Chapel Hill ORCID Participant Meeting, Harvard, May 2011 Leveraging publication metadata to help overcome the data ingest bottleneck

Upload: tjvision

Post on 30-May-2015

1.044 views

Category:

Technology


3 download

DESCRIPTION

A talk on Dryad given at the ORCID Participant Meeting in Boston, 5/18/2011

TRANSCRIPT

Page 1: Leveraging publication metadata to help overcome the data ingest bottleneck

Todd J. VisionNational Evolutionary Synthesis Center

Department of Biology University of North Carolina at Chapel Hill

ORCID Participant Meeting, Harvard, May 2011

Leveraging publication metadata to help overcome the data ingest

bottleneck

Page 2: Leveraging publication metadata to help overcome the data ingest bottleneck

• The End To make data archiving integral to scientific

publishing.

• The scope Data underlying findings in the peer-reviewed

biological literature.

• The Means Integrated submission of data with the

manuscript Low barrier to submission (at the datafile level) Free reuse of data (free as in both speech & beer) Journals share responsibility for governance and

sustainability

Page 3: Leveraging publication metadata to help overcome the data ingest bottleneck

The long tail of orphan data in “small science”

Volu

me

Rank frequency of datatype

Specialized repositories(e.g. GenBank, PDB)

Orphan data

after B. Heidorn

Page 4: Leveraging publication metadata to help overcome the data ingest bottleneck

The long tail of orphan data in “small science”

Volu

me

Rank frequency of datatype

Specialized repositories(e.g. GenBank, PDB)

Orphan data

after B. Heidorn

Bumpus HC (1898) The Elimination of the Unfit as Illustrated by the Introduced Sparrow, Passer domesticus. A Fourth Contribution to the Study of Variation. pp. 209-226 in Biological Lectures from the Marine Biological Laboratory, Woods Hole, Mass.

Page 5: Leveraging publication metadata to help overcome the data ingest bottleneck

A publication package

Page 6: Leveraging publication metadata to help overcome the data ingest bottleneck

1

1. Integrated manuscript and data submission

A publication package

Page 7: Leveraging publication metadata to help overcome the data ingest bottleneck

1

1. Integrated manuscript and data submission

A publication package2

2. Handshaking with specialized repositories

Page 8: Leveraging publication metadata to help overcome the data ingest bottleneck

Submit manuscript

Integrated

Page 9: Leveraging publication metadata to help overcome the data ingest bottleneck

Manuscript metadata

Submit manuscript

Integrated

Page 10: Leveraging publication metadata to help overcome the data ingest bottleneck

Submit data

Manuscript metadata

Submit manuscript

Integrated

Page 11: Leveraging publication metadata to help overcome the data ingest bottleneck

Submit data

Manuscript metadata

Peer review

Review passcode

Submit manuscript

Integrated

Page 12: Leveraging publication metadata to help overcome the data ingest bottleneck

Submit data

Manuscript metadata

Peer review

Review passcode

Acceptance notification Curation

Data DOIProduction

Submit manuscript

Integrated

Page 13: Leveraging publication metadata to help overcome the data ingest bottleneck

Submit data

Manuscript metadata

Peer review

Review passcode

Acceptance notification Curation

Data DOIProduction

Article metadata Curation

Submit manuscript

Integrated

Page 14: Leveraging publication metadata to help overcome the data ingest bottleneck

Submit data

Manuscript metadata

Peer review

Review passcode

Acceptance notification Curation

Data DOIProduction

Article metadata Curation

ArticlePublicatio

n

Data publicati

on

Submit manuscript

Integrated

Page 15: Leveraging publication metadata to help overcome the data ingest bottleneck
Page 16: Leveraging publication metadata to help overcome the data ingest bottleneck

Submit data

Manuscript metadata

Peer review

Review passcode

Acceptance notification Curation

Data DOIProduction

Article metadata Curation

ArticlePublicatio

n

Data publicati

on

Non-integrated

Submit data

Submit manuscript

Integrated

Page 17: Leveraging publication metadata to help overcome the data ingest bottleneck

Submit data

Manuscript metadata

Peer review

Review passcode

Acceptance notification Curation

Data DOIProduction

Article metadata Curation

ArticlePublicatio

n

Data publicati

on

Non-integrated

Submit data

Author adds DOI

Data DOI

Article publicati

onArticle metadataharvested

Submit manuscript

Integrated

Page 18: Leveraging publication metadata to help overcome the data ingest bottleneck

ArticleWu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M,

Venter JC, Eisen JA (2011) Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. PLoS ONE 6(3): e18011. doi:10.1371/journal.pone.0018011

Dryad data packageWu D, Wu M, Halpern A, Rusch DB, Yooseph S, Frazier M,

Venter JC, Eisen JA (2011) Data from: Stalking the fourth domain in metagenomic data: searching for, discovering, and interpreting novel, deep branches in phylogenetic trees of phylogenetic marker genes. Dryad Digital Repository. doi:10.5061/dryad.8384

Page 19: Leveraging publication metadata to help overcome the data ingest bottleneck

• Integrated submission Currently integrated or in process: 20 All journals with Dryad content: >70 A minority require data prior to review

• Journals published by a variety of organizations Traditional (incl. Oxford University Press, Wiley-

Blackwell) Open Access (incl. BMC, BMJ Open) Society publishers (e.g. with Allen Press, or

independent)

Page 20: Leveraging publication metadata to help overcome the data ingest bottleneck

Dryad vs. Supplementary Online Materials

Dryad SOM

Article citations: reuse of data leads to article citations ✔ ✔

Identifiable: Data DOIs within articles serve as permanent, resolvable identifiers ✔ ✔/✗

Curated: quality control of data submissions and indexing metadata ✔ ✔/✗

Economy of scale: cost efficiency from shared infrastructure ✔ ✔/✗

Discoverable: indexed and exposed to both web and bibliographic search engines ✔ ✔/✗

Ease of deposit: streamlined deposit, allow large and complex datasets ✔/✗ ✔/✗

Formatted for reuse, i.e. not PDF ✔/✗ ✔/✗

Updatable: new versions of data files can be added, metadata can be enhanced ✔ ✗

Preservation planning: integrity audits, format migration, replication, etc. ✔ ?

Support for embargoes: can delay release of data in accordance with journal policy ✔ ?

Free reuse: no paywall, no unecessary IP restrictions/ambiguities ✔ ?

Data citations: reuse of data leads to data citations ? ✗

Page 21: Leveraging publication metadata to help overcome the data ingest bottleneck

612 downloads

Page 22: Leveraging publication metadata to help overcome the data ingest bottleneck

Investigator toolkit

Member nodes• Dryad, ORNL DAAC, Knowledge Network for Biocomplexity,

etc.

Coordinating nodes

Page 23: Leveraging publication metadata to help overcome the data ingest bottleneck

Why Dryad yearns for ORCIDs

• Replace name strings with identities Disambiguation of like names Clustering of synonymous names Confidently recognizing different data packages that

share an author

• Enabling Accurate author searches Internal and external author hyperlinks Aggregation of author contributions Inclusion of data records in the profiles of coauthors Propagation of ORCIDs with Dryad metadata

• Manual curation of names not feasible Only ~20% of Dryad authors in Library of Congress

name auth. file Manual control would explode curation costs

Page 24: Leveraging publication metadata to help overcome the data ingest bottleneck

How to get ORCIDs into Dryad

• Ideally sent to Dryad by integrated journals Pre-review/Pre-production: allows coauthors

to edit data packages Post-production: works for all other uses

• Non-integrated journals Lookup API based on article or affiliation

data

• To be avoided Authors required to enter ORCIDs during

submission Authors required to register during

submission

Page 25: Leveraging publication metadata to help overcome the data ingest bottleneck

What do we know about authors?

• Names Often abbreviated except for

corresponding or submitting author

• At least one article they have written Title, journal, volume, pages, DOI,

abstract

• Other identifiable information An email for submitting authors Sometimes: institutional affiliation

and contact information for corresponding authors

Page 26: Leveraging publication metadata to help overcome the data ingest bottleneck

Some requirements• Recognizing ORCIDs for authenticated

users Mapping to InCommon Silver profiles ORCIDs for organizations (e.g. consortia)

• Dspace support Curator interface for ORCID lookup/verification Lookup/registration option from submission

interface Allowing metadata relationships (e.g. of an

ORCID with a name)

• Mechanisms for curator to Flag duplicates and errors Register provisional ORCIDs Map to other profiles (e.g. InCommon)

Page 27: Leveraging publication metadata to help overcome the data ingest bottleneck

Business model issues• Dryad is (will be) supported by

subscriptions and deposit charges, primarily from journals. With a not-for-profit budget

• Feasibility requires wide adoption by publishers And manuscript-submission system

developers!

• Favored model Pay for use of automated lookup services,

with costs scaled by usage level Credit for curator contributions

Page 28: Leveraging publication metadata to help overcome the data ingest bottleneck

For more information:http://datadryad.org

http://blog.datadryad.orghttp://datadryad.org/wiki

http://code.google.com/p/dryad

[email protected]: Dryad

Twitter: @datadryad

"Cherish old knowledge that you may acquire new" The Analects of Confucius

Special thanks toElena FeinsteinJane GreenbergRyan Scherle

Page 29: Leveraging publication metadata to help overcome the data ingest bottleneck

Data PackageArticle

Datafile

• dc.identifier = doi of data file• dc.relation.isPartOf = doi of data

package• file-specific description: keywords,

authors, format, size, checksum, etc.• embargo information (type, end date)

• dc.identifier = doi of data package• dc.relation.hasPart = dois of data files• dc.references = handle of article

description record• dc.title = title of data package• dc.description (not article abstract,

optional)• dc.creator = authors of data package• dc.date (with refinements – dates

associated with submission to Dryad and archiving in the repository)

• dryad.external = GenBank accession number, TreeBASE identifier

• dc.relation = URL of related resource• dc.subject = general keywords• DarwinCore.ScientificName = taxon

keywords• dc.spatial = geographic keywords• dc.temporal = timespan keywords• dryad.curatorNote

• dc.identifier = doi of article• bibo.status = article publication status• dc.creator = authors of article• dc.issued = article publication date• dc.title = title of article• bibo.journal = journal title• bibo.issn and bibo.eissn• bibo.volume• bibo.issue• bibo.pageStart and bibo.pageEnd• dc.abstract = article abstract• dc.isReferencedBy = data package doi

Dryad Metadata Profile (v3.0)