building an nih data catalog: bit by bit
DESCRIPTION
OBJECTIVE The purpose of the project was to a) develop a set of core, minimal metadata elements that would be used to describe data sets, and b) carry out a study to identify data sets in NIH-funded articles from PubMed and PubMed Central (PMC) that do not provide an indication that their data is stored in a specific place like a repository or registry. These efforts will inform the BD2K initiative and a planned NIH Data Catalog. METHODS An analysis of the metadata schemas for all NIH data repositories was undertaken. Commonalities from these data repositories were identified, mapped to existing data-specific metadata standards from DataCite and Dryad, and then were integrated into MEDLINE XML metadata to attempt to establish a sustainable and integrated metadata schema. The second phase of this project identified data sets in articles from PubMed and PMC by searching specifically for NIH-funded articles from the year 2011. After excluding articles that contain mention of data sets being deposited in existing repositories, thirty staff members from NLM and B2DK were recruited to analyze a random sample of the results to identify how many, and what types of data sets were created per article. RESULTS A preliminary set of minimal metadata elements were developed that could sufficiently describe NIH-funded data sets and be integrated within MEDLINE’s schema, with minor additions. At present, results of the second phase to analyze PubMed and PMC articles for data sets are pending once all submissions from NLM staff are complete. CONCLUSION The efforts to develop a minimal set of metadata elements and identify the amount, and types of data sets that are produced from NIH funded articles will serve to inform the BD2K’s initiative to build an NIH Data Catalog going forward.TRANSCRIPT
![Page 1: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/1.jpg)
Building an NIH Data Catalog: Bit by Bit
Kevin ReadNLM Associate Fellowship Presentation
July 24, 2013
1
![Page 2: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/2.jpg)
NIH Big Data to KnowledgeFacilitating Broad Use of Biomedical Big Data
2
![Page 3: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/3.jpg)
NIH Data CatalogWhat is it designed to do?
3
![Page 4: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/4.jpg)
NIH Data Catalog
Data sets areCITABLE
Data sets areDISCOVERABL
E
Data sets areLINKED TO
THE LITERATURE
Data sets arePART OF THE RESEARCH
ECOSYSTEM
4
![Page 5: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/5.jpg)
NIH Data CatalogWhat do we need to know in order to build it?
Minimal Metadata Elements
How do current data repositories describe their
data?
Orphaned Data sets
How many data sets are not currently represented in a
data repository?
5
![Page 6: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/6.jpg)
Finding Common Metadata Elements
Exploring how NIH Data Repositories describe their data
6
![Page 7: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/7.jpg)
7
![Page 8: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/8.jpg)
Categorizing Metadata Descriptors
Common Metadata Elements
Authorship
Data Description
Title Information
8
![Page 9: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/9.jpg)
Identifying Metadata Variations
Date
Study Date
Date Processe
d
Release Date
Completion Date
Last Updated Date
Prepared on Date
Authorship
Authors
Creators
Data Provide
r
Principal
Investigator(s
)
Contributors
Data Author
s
9
![Page 10: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/10.jpg)
Mapping Metadata Commonalities to Existing Standards
Common Metadata Elements
Common Metadata Elements
10
![Page 11: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/11.jpg)
11
Mapping Metadata to MEDLINECommon Metadata
ElementsProposed Definition
Data Unique Identifier A unique ID string that identifies a data set within the catalog
Author Individuals involved in producing or contributing to data
Affiliation Affiliation of each author associated with the appropriate author occurrence
Data Title Name or title by which the data set is known
Data Location The name of the entity that holds, archives, publishes, distributes, releases, issues, or produces the data w/ its associated accession number.
Date The year, month and date when the data was made available
Data Description (structured narrative) Structured narrative description for efficient indexing
Data Descriptors Metadata describing data contents using controlled labels (e.g. Organism, Disease, Perturbation, Gender, Cell type)
PMID Identifier that will link dataset to associated article(s) AND be provided for the data catalog entry
Availability/Accessibility of Data Indication of whether the data is available to use and how to access it
Award Number Grant/award numbers associated with the data set
Related Data Data that was used in the creation of the new data set
![Page 12: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/12.jpg)
Data Catalog Citation
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Author
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Title
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Description Location
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Date of NIH Data Catalog issue
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
NIH Data Catalog Volume (Issue)
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
Data Unique Identifier
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
PMID Assigned to NIH Data Catalog Record
Secondary source ID (Link to actual dataset)
Marazita ML, Weynat RJ, Feingold E, Weeks D, Crout R, McNeill D. Dental Caries: Whole Genome Association and Gene x Environment Studies. NIH Data Catalog. 2014 Jan;1(1):DUID00001. PubMed PMID: 22123456.
SI: dbGaP/pht002543.v2.p1
12
![Page 13: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/13.jpg)
Searching for NIH-funded ‘Orphaned’ data sets in
PubMed and PubMed Central
13
![Page 14: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/14.jpg)
113,089
75,441
Remaining articles with orphaned data sets
NIH-funded articles for 2011:
88,592
78,901
Non-PMC Articles
Non-research Articles
Molecular Sequence Data MH71,91
3 SI Field
71,680
PMC Acknowledgements
69,857
XML
14
![Page 15: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/15.jpg)
SI Field Exclusions
Clinical-Trials.gov
PDB GEO GenBank PubChem RefSeq ISRCTN OMIM0
200
400
600
800
1000
1200
1400
1600
Excluded Articles
15
![Page 16: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/16.jpg)
16
PMC Acknowledgement Exclusions
PDB
Clinica
lTrials.
gov
GenBankGEO IRD
MGI
DIP
Flybase
dbGaPSRA
Worm
BaseM
PD
NURSARGD
ICPSR
VectorB
ase0
100
200
300
400
500
600
700
800
Excluded keywords
![Page 17: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/17.jpg)
17
XML Keyword Exclusions
GenBankPDB
GEOdbSNP
Clinica
lTrials.
govRGD
Flybase SRA DIP
dbGaP
Worm
Base MGI
BioGRID
VectorB
ase
Multiple Keyword
s0
100
200
300
400
500
600
Excluded keywords
FlyBase:GeneNetwork:Mouse Genome Informatics:Neuroscience Information
Framework:Rat Genome Database:WormBase:Zebrafish Model
Organism Database
GenBank:PDB
![Page 18: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/18.jpg)
Total # of articles collected
for 2011 after exclusion:
69,657
Random sample with 95% confid.
interval:
383
18
![Page 19: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/19.jpg)
383
What category of data set was used for the research described in the article?
Were live human or animal subjects
used in the collection of the
data?
What were the subject(s) of study (from which or whom the data was collected)?
If new data set(s) were created,
what type(s) of data were collected?
What existing data set(s) were used? If any?
How many data sets are there in
each article?19
![Page 20: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/20.jpg)
20
Measuring blood pressure in mice
Measuring left hemisphere of brain for growth factor
Staining and imaging
Analysis of images using software
![Page 21: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/21.jpg)
Preliminary Results‘Orphaned’ Data
50 articles
21
![Page 22: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/22.jpg)
Average number of data sets per article:
5.84
22
![Page 23: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/23.jpg)
% of data sets that use live subjects
51%
Human
60%Animal
40%
23
![Page 24: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/24.jpg)
% of data sets that were
considered to be new
74%% of data sets
that used existing data with mods or added value
12%
% of data sets that used
existing data as is
13%
% with no data
1%24
![Page 25: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/25.jpg)
25
% of articles that collected only new data:
56%
% of articles that used only existing data:
32%% of articles that used a
combination of data:
8%
% of articles that used no
data:
4%
![Page 26: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/26.jpg)
Data TypesIN
SUFFICIE
N
T
26
![Page 27: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/27.jpg)
Building an NIH Data Catalog
Questions to Consider
27
![Page 28: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/28.jpg)
What do we consider to be a data set?
All of the data created within a paper?
Multiple data sets of different data types within a paper?
Every individual collection of data within a paper?
28
![Page 29: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/29.jpg)
Where in the collection/processing
pipeline should data be described?
29
![Page 30: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/30.jpg)
Is there a convenient way to point to data sets
within an article?
Abstract? Labeled area?Reference list?
30
![Page 31: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/31.jpg)
How do we adequately describe data sets so
that they are discoverable?
Develop a strategy to create appropriate data descriptors31
![Page 32: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/32.jpg)
How do we adequately describe data sets so that they are
discoverable?
Is there a convenient way to point to data sets within an article?
Where in the data collection/processing pipeline
should data be described?
What do we consider to be a data set?
32
![Page 33: Building an NIH Data Catalog: Bit by Bit](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6c27a4a79593d7c8b45fe/html5/thumbnails/33.jpg)
Acknowledgements
Project SponsorsJerry Sheehan & Mike Huerta
Special ThanksLou Knecht & Jim Mork
AnnotatorsPreeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari Mohary, Olga Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn Sinnott
SupportKathel Dunn & David Gillikin
Library OperationsJoyce Backus & Dianne Babski
NLM LeadershipDonald Lindberg & Betsy Humphreys
All images are CC
33