introduction to dats v2.2 - nih may 2017

27
Supported by the NIH grant 1U24 AI117966-01 to UCSD PI , Co-Investigators at: The annotated with schema.org Susanna-Assunta Sansone, Alejandra Gonzalez-Beltran, Philippe Rocca-Serra Oxford e-Research Centre, University of Oxford, UK bioCADDIE - DATS Workshop, Bethesda, 7 May 2017

Upload: susanna-assunta-sansone

Post on 21-Jan-2018

362 views

Category:

Data & Analytics


6 download

TRANSCRIPT

Supported by the NIH grant 1U24 AI117966-01 to UCSDPI , Co-Investigators at:

The annotated with schema.org

Susanna-Assunta Sansone,

Alejandra Gonzalez-Beltran, Philippe Rocca-SerraOxford e-Research Centre, University of Oxford, UK

bioCADDIE - DATS Workshop, Bethesda, 7 May 2017

The model

What is ?

Like the JATS (Journal Article Tag Suite) is used by PubMed to index literature, a DATS (DatA Tag Suite) is needed for a scalable way to index data sources in

the DataMed prototype

Where do I find the documentation?

Like the JATS (Journal Article Tag Suite) is used by PubMed to index literature, a DATS (DatA Tag Suite) is needed for a scalable way to index data sources in

the DataMed prototype

Mar

15

Jun

15

Dec

15

Jun

16

Aug

15

May

16

Sep

16

Mar

17

bioCADDIE team’s work on DATS

Our community engagement: input, feedback and links

May

17

Mar

15

Jun

15

Dec

15

Jun

16

Aug

15

May

16

Sep

16

Mar

17

bioCADDIE team’s work on DATS

Our community engagement: input, feedback and links

Phase 1 Phase 2 Phase 3

Design and development Continued evaluation & consolidation

May

17

Evaluation & iterative refinement

Mar

15

Jun

15

Dec

15

Jun

16

Aug

15

May

16

Sep

16

Mar

17

bioCADDIE team’s work on DATS

Our community engagement: input, feedback and links

Phase 1 Phase 2 Phase 3

Design and development• Use cases collection= George, ExcTeam• Method development (SOP); competency

questions; metadata mapping and definition= Susanna, Alejandra, Philippe

SOP and metadata strawman

<DATS> name

May

17

Metadata specification V1.0

with JSON schema

Use cases workshop

WG3 formed; telecons start;

dissemination viaWG7 formed; telecons start

Evaluation & iterative refinement Continued evaluation & consolidation

Mar

15

Jun

15

Dec

15

Jun

16

Aug

15

May

16

Sep

16

Mar

17

bioCADDIE team’s work on DATS

Our community engagement: input, feedback and links

Phase 1 Phase 2 Phase 3

Design and development• Use cases collection= George, ExcTeam• Method development (SOP); competency

questions; metadata mapping and definition= Susanna, Alejandra, Philippe

SOP and metadata strawman

<DATS> name

DATS v1.1

May

17

DATS v2.0 (with access

metadata, WG7)

Metadata specification V1.0

with JSON schema

Use cases workshop

1st DATS workshop

WG3 formed; telecons start;

dissemination viaWG7 formed; telecons start

Evaluation & iterative refinement• DATS specification, serializations,

refinement= (Susanna), Alejandra, Philippe and CoreDevTeam, also based on community feedback

Continued evaluation & consolidation

Mar

15

Jun

15

Dec

15

Jun

16

Aug

15

May

16

Sep

16

Mar

17

bioCADDIE team’s work on DATS

Our community engagement: input, feedback and links

Phase 1 Phase 2 Phase 3

Design and development• Use cases collection= George, ExcTeam• Method development (SOP); competency

questions; metadata mapping and definition= Susanna, Alejandra, Philippe

SOP and metadata strawman

<DATS> name

DATS v1.1

May

17

DATS v2.0 (with access

metadata, WG7)

DATS v2.1 (schema.org

JSON-LD)DATS v2.2

Metadata specification V1.0

with JSON schema

Use cases workshop

1st DATS workshop

WG3 formed; telecons start;

dissemination via2nd DATS workshop

WG7 formed; telecons start

WG12 formed; telecons start

Evaluation & iterative refinement• DATS specification, serializations,

refinement= (Susanna), Alejandra, Philippe and CoreDevTeam, also based on community feedback

Continued evaluation & consolidation• Alignment with other community efforts;

documentation and curation guidelines = Susanna, Alejandra, Philippe, Jared, ExcTeam and CoreDevTeam

Mar

15

Jun

15

Dec

15

Jun

16

Aug

15

May

16

Sep

16

Mar

17

Use cases workshop

bioCADDIE team’s work on DATS

Our community engagement: input, feedback and links

Phase 1 Phase 2 Phase 3

Design and development• Use cases collection= George, ExcTeam• Method development (SOP); competency

questions; metadata mapping and definition= Susanna, Alejandra, Philippe

Evaluation & iterative refinement• DATS specification, serializations,

refinement= (Susanna), Alejandra, Philippe and CoreDevTeam, also based on community feedback

Continued evaluation & consolidation• Alignment with other community efforts;

documentation and curation guidelines = Susanna, Alejandra, Philippe, Jared, ExcTeam and CoreDevTeam

1st DATS workshop

SOP and metadata strawman

WG3 formed; telecons start;

dissemination via

<DATS> name

DATS v1.1

May

17

2nd DATS workshop

DATS v2.0 (with access

metadata, WG7)

WG7 formed; telecons start

DATS v2.1 (schema.org

JSON-LD)DATS v2.2

primarily metadata modelersprimarily implementers

Metadata specification V1.0

with JSON schema

WG12 formed; telecons start

❖ Enabling discoverability: find and access datasets

❖ Focusing on surfacing key metadata descriptors, such as ✧ information and relations between authors, datasets, publication,

funding sources, nature of biological signal and perturbation etc.

✧ Not the perfect model to represent the experimental details✧ the level of details and metadata needed to ensure interoperability

and reusability are left to the indexed databases

❖ Better than just having keywords✧ we have aimed to have maximum coverage of use cases with

minimal number of data elements and relations

What is supposed to do and be?

Metadata elements identified by combining the two complementary approaches

USE CASES: top-down approach SCHEMAS: bottom-up approach

The development process in a nutshell

(v1.0, v1.1, v2.0, v2.1, v2.2)

Extracting requirements from use cases❖ Selected competency questions

✧ representative set collected from: use cases workshop, white paper, submitted by the community and from NIH and Phil Bourne’s ADDS office

✧ key metadata elements processed: abstracted, color-coded and terms binned binned as Material, Process, Information, Properties; relation identified

top-down approach

bottom-up approach

Standing on the shoulders of giants

❖ schema.org❖ DataCite❖ RIF-CS❖ W3C HCLS dataset descriptions (mapping of many models including DCAT, PROV, VOID, Dublin

Core)❖ Project Open Metadata (used by HealthData.gov is being added in this new iteration)❖ ……(full list in the DATS specification)

❖ ISA❖ BioProject❖ BioSample❖ ……(full list in the DATS specification)

❖ MiNIML❖ PRIDE-ml❖ MAGE-tab❖ GA4GH metadata schema❖ SRA xml❖ CDISC SDM / element of BRIDGE model❖ ……(full list in the DATS specification)

Convergence of elements extracted from competencyquestionsand existing (generic and biomedical) data models(incl. DataCite, DCAT, schema.org, HCLS dataset, RIF-CS, ISA-Tab, SRA-xml etc.)

model for scalable indexing

Adoption

of elements extracted from

and from

core entities

extended entities

❖ The descriptors for each metadata element (Entity), include

✧ Property (describing the Entity), Definition (of each Entity and Property), Value(s) (allowed for each Property)

Key features of

❖ The descriptors for each metadata element (Entity), include

✧ Property (describing the Entity), Definition (of each Entity and Property), Value(s) (allowed for each Property)

❖ We have defined a set of core and extended entities✧ Core elements are generic and applicable to any type of datasets, like the

JATS can describe any type of publication.

✧ Extended elements includes an additional elements, some of which are specific for life, environmental and biomedical science domains

✧ this set can be further extended as needed

Key features of

❖ The descriptors for each metadata element (Entity), include

✧ Property (describing the Entity), Definition (of each Entity and Property), Value(s) (allowed for each Property)

❖ We have defined a set of core and extended entities✧ Core elements are generic and applicable to any type of datasets, like the

JATS can describe any type of publication.

✧ Extended elements includes an additional elements, some of which are specific for life, environmental and biomedical science domains

✧ this set can be further extended as needed

❖ Entities are not mandatory, in both core and extended set

✧ An entity is used only when applicable to the dataset to be described

✧ In that case, only few of its properties are defined as mandatory

Key features of

❖ Dataset, a core entity catering for any unit of information✧ archived experimental datasets, which do not change after deposition to the

repository => examples available for dbGAP, GEO, ClinicalTrials.org

✧ datasets in reference knowledge bases, describing dynamic concepts, such as “genes”, whose definition morphs over time => examples available for UniProt

❖ Dataset entity is also linked to other digital research objects ✧ Software and Data Standard, which are also part of the NIH Commons, but

the focus on other discovery indexes and therefore are not described in detail in this model

General design of the

core and extended elements

Of the 20 core elements none is mandatory

Only few properties of the 20 core elements are mandatory

❖ What is the dataset about?✧ Material

❖ How was the dataset produced ? Which information does it hold?✧ Dataset / Data Type with its Information, Method, Platform,

Instrument❖ Where can a dataset be found?

✧ Dataset, Distribution, Access objects (links to License)❖ When was the datasets produced, released etc.?

✧ Dates to specify the nature of an event {create, modify, start, end...} and its timestamp

❖ Who did the work, funded the research, hosts the resources etc.?✧ Person, Organization and their roles, Grant

Core elements provide the basic info

Interlinking to other indexes

also follows the W3C Data on the Web Best Practices

DATS follows these, which also recommend DatasetDistribution

https://www.w3.org/TR/dwbp

Serializations and use of schema.org

❖ DATS model in JSON schema, serialized as:✧ JSON* format, and ✧ JSON-LD** with vocabulary from schema.org

✧ serializations in other formats can also be done, as / if needed

❖ Benefits for DataMed and databases index by DataMed✧ Increased visibility (by both popular search engines), accessibility

(via common query interfaces) and possibly improved ranking

❖ Extending schema.org✧ Submitted to their tracker missing DATS core elements✧ Coordinating via the bioschemas.org initiative (ELIXIR is also part of)

the extension of schema.org for life science

* JavaScript Object Notation** JavaScript Object Notation for Linked Data