introduction to dats v2.2 - nih may 2017
TRANSCRIPT
Supported by the NIH grant 1U24 AI117966-01 to UCSDPI , Co-Investigators at:
The annotated with schema.org
Susanna-Assunta Sansone,
Alejandra Gonzalez-Beltran, Philippe Rocca-SerraOxford e-Research Centre, University of Oxford, UK
bioCADDIE - DATS Workshop, Bethesda, 7 May 2017
What is ?
Like the JATS (Journal Article Tag Suite) is used by PubMed to index literature, a DATS (DatA Tag Suite) is needed for a scalable way to index data sources in
the DataMed prototype
Where do I find the documentation?
Like the JATS (Journal Article Tag Suite) is used by PubMed to index literature, a DATS (DatA Tag Suite) is needed for a scalable way to index data sources in
the DataMed prototype
Mar
15
Jun
15
Dec
15
Jun
16
Aug
15
May
16
Sep
16
Mar
17
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
May
17
Mar
15
Jun
15
Dec
15
Jun
16
Aug
15
May
16
Sep
16
Mar
17
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development Continued evaluation & consolidation
May
17
Evaluation & iterative refinement
Mar
15
Jun
15
Dec
15
Jun
16
Aug
15
May
16
Sep
16
Mar
17
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development• Use cases collection= George, ExcTeam• Method development (SOP); competency
questions; metadata mapping and definition= Susanna, Alejandra, Philippe
SOP and metadata strawman
<DATS> name
May
17
Metadata specification V1.0
with JSON schema
Use cases workshop
WG3 formed; telecons start;
dissemination viaWG7 formed; telecons start
Evaluation & iterative refinement Continued evaluation & consolidation
Mar
15
Jun
15
Dec
15
Jun
16
Aug
15
May
16
Sep
16
Mar
17
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development• Use cases collection= George, ExcTeam• Method development (SOP); competency
questions; metadata mapping and definition= Susanna, Alejandra, Philippe
SOP and metadata strawman
<DATS> name
DATS v1.1
May
17
DATS v2.0 (with access
metadata, WG7)
Metadata specification V1.0
with JSON schema
Use cases workshop
1st DATS workshop
WG3 formed; telecons start;
dissemination viaWG7 formed; telecons start
Evaluation & iterative refinement• DATS specification, serializations,
refinement= (Susanna), Alejandra, Philippe and CoreDevTeam, also based on community feedback
Continued evaluation & consolidation
Mar
15
Jun
15
Dec
15
Jun
16
Aug
15
May
16
Sep
16
Mar
17
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development• Use cases collection= George, ExcTeam• Method development (SOP); competency
questions; metadata mapping and definition= Susanna, Alejandra, Philippe
SOP and metadata strawman
<DATS> name
DATS v1.1
May
17
DATS v2.0 (with access
metadata, WG7)
DATS v2.1 (schema.org
JSON-LD)DATS v2.2
Metadata specification V1.0
with JSON schema
Use cases workshop
1st DATS workshop
WG3 formed; telecons start;
dissemination via2nd DATS workshop
WG7 formed; telecons start
WG12 formed; telecons start
Evaluation & iterative refinement• DATS specification, serializations,
refinement= (Susanna), Alejandra, Philippe and CoreDevTeam, also based on community feedback
Continued evaluation & consolidation• Alignment with other community efforts;
documentation and curation guidelines = Susanna, Alejandra, Philippe, Jared, ExcTeam and CoreDevTeam
Mar
15
Jun
15
Dec
15
Jun
16
Aug
15
May
16
Sep
16
Mar
17
Use cases workshop
bioCADDIE team’s work on DATS
Our community engagement: input, feedback and links
Phase 1 Phase 2 Phase 3
Design and development• Use cases collection= George, ExcTeam• Method development (SOP); competency
questions; metadata mapping and definition= Susanna, Alejandra, Philippe
Evaluation & iterative refinement• DATS specification, serializations,
refinement= (Susanna), Alejandra, Philippe and CoreDevTeam, also based on community feedback
Continued evaluation & consolidation• Alignment with other community efforts;
documentation and curation guidelines = Susanna, Alejandra, Philippe, Jared, ExcTeam and CoreDevTeam
1st DATS workshop
SOP and metadata strawman
WG3 formed; telecons start;
dissemination via
<DATS> name
DATS v1.1
May
17
2nd DATS workshop
DATS v2.0 (with access
metadata, WG7)
WG7 formed; telecons start
DATS v2.1 (schema.org
JSON-LD)DATS v2.2
primarily metadata modelersprimarily implementers
Metadata specification V1.0
with JSON schema
WG12 formed; telecons start
❖ Enabling discoverability: find and access datasets
❖ Focusing on surfacing key metadata descriptors, such as ✧ information and relations between authors, datasets, publication,
funding sources, nature of biological signal and perturbation etc.
✧ Not the perfect model to represent the experimental details✧ the level of details and metadata needed to ensure interoperability
and reusability are left to the indexed databases
❖ Better than just having keywords✧ we have aimed to have maximum coverage of use cases with
minimal number of data elements and relations
What is supposed to do and be?
Metadata elements identified by combining the two complementary approaches
USE CASES: top-down approach SCHEMAS: bottom-up approach
The development process in a nutshell
(v1.0, v1.1, v2.0, v2.1, v2.2)
Extracting requirements from use cases❖ Selected competency questions
✧ representative set collected from: use cases workshop, white paper, submitted by the community and from NIH and Phil Bourne’s ADDS office
✧ key metadata elements processed: abstracted, color-coded and terms binned binned as Material, Process, Information, Properties; relation identified
top-down approach
bottom-up approach
Standing on the shoulders of giants
❖ schema.org❖ DataCite❖ RIF-CS❖ W3C HCLS dataset descriptions (mapping of many models including DCAT, PROV, VOID, Dublin
Core)❖ Project Open Metadata (used by HealthData.gov is being added in this new iteration)❖ ……(full list in the DATS specification)
❖ ISA❖ BioProject❖ BioSample❖ ……(full list in the DATS specification)
❖ MiNIML❖ PRIDE-ml❖ MAGE-tab❖ GA4GH metadata schema❖ SRA xml❖ CDISC SDM / element of BRIDGE model❖ ……(full list in the DATS specification)
Convergence of elements extracted from competencyquestionsand existing (generic and biomedical) data models(incl. DataCite, DCAT, schema.org, HCLS dataset, RIF-CS, ISA-Tab, SRA-xml etc.)
model for scalable indexing
Adoption
of elements extracted from
and from
core entities
extended entities
❖ The descriptors for each metadata element (Entity), include
✧ Property (describing the Entity), Definition (of each Entity and Property), Value(s) (allowed for each Property)
Key features of
❖ The descriptors for each metadata element (Entity), include
✧ Property (describing the Entity), Definition (of each Entity and Property), Value(s) (allowed for each Property)
❖ We have defined a set of core and extended entities✧ Core elements are generic and applicable to any type of datasets, like the
JATS can describe any type of publication.
✧ Extended elements includes an additional elements, some of which are specific for life, environmental and biomedical science domains
✧ this set can be further extended as needed
Key features of
❖ The descriptors for each metadata element (Entity), include
✧ Property (describing the Entity), Definition (of each Entity and Property), Value(s) (allowed for each Property)
❖ We have defined a set of core and extended entities✧ Core elements are generic and applicable to any type of datasets, like the
JATS can describe any type of publication.
✧ Extended elements includes an additional elements, some of which are specific for life, environmental and biomedical science domains
✧ this set can be further extended as needed
❖ Entities are not mandatory, in both core and extended set
✧ An entity is used only when applicable to the dataset to be described
✧ In that case, only few of its properties are defined as mandatory
Key features of
❖ Dataset, a core entity catering for any unit of information✧ archived experimental datasets, which do not change after deposition to the
repository => examples available for dbGAP, GEO, ClinicalTrials.org
✧ datasets in reference knowledge bases, describing dynamic concepts, such as “genes”, whose definition morphs over time => examples available for UniProt
❖ Dataset entity is also linked to other digital research objects ✧ Software and Data Standard, which are also part of the NIH Commons, but
the focus on other discovery indexes and therefore are not described in detail in this model
General design of the
❖ What is the dataset about?✧ Material
❖ How was the dataset produced ? Which information does it hold?✧ Dataset / Data Type with its Information, Method, Platform,
Instrument❖ Where can a dataset be found?
✧ Dataset, Distribution, Access objects (links to License)❖ When was the datasets produced, released etc.?
✧ Dates to specify the nature of an event {create, modify, start, end...} and its timestamp
❖ Who did the work, funded the research, hosts the resources etc.?✧ Person, Organization and their roles, Grant
Core elements provide the basic info
also follows the W3C Data on the Web Best Practices
DATS follows these, which also recommend DatasetDistribution
https://www.w3.org/TR/dwbp
Serializations and use of schema.org
❖ DATS model in JSON schema, serialized as:✧ JSON* format, and ✧ JSON-LD** with vocabulary from schema.org
✧ serializations in other formats can also be done, as / if needed
❖ Benefits for DataMed and databases index by DataMed✧ Increased visibility (by both popular search engines), accessibility
(via common query interfaces) and possibly improved ranking
❖ Extending schema.org✧ Submitted to their tracker missing DATS core elements✧ Coordinating via the bioschemas.org initiative (ELIXIR is also part of)
the extension of schema.org for life science
* JavaScript Object Notation** JavaScript Object Notation for Linked Data