standardization of the hipc data templates: the story so far

17
Standardization of the HIPC Data Templates: The Story So Far Ahmad C. Bukhari, Ph.D. , Kei-Hoi Cheung, Ph.D. and Steven H. Kleinstein, Ph.D. Yale University, School of Medicine User Group (HIPC)

Upload: ahmad-c-bukhari

Post on 28-Jan-2018

172 views

Category:

Science


2 download

TRANSCRIPT

Standardization of the HIPC Data Templates: The Story So Far

Ahmad C. Bukhari, Ph.D., Kei-Hoi Cheung, Ph.D. and Steven H. Kleinstein, Ph.D.

Yale University, School of MedicineUser Group

(HIPC)

● An important resource for raw data and protocols from clinical trials, mechanistic studies and novel methods for cellular and molecular measurements

● Provides templates and standard operating procedures to facilitate data representation and transfer.

● Provides a variety of tools for data access and manipulation

ImmPortSQL Dump for localhosting

Human Immunology Project Consortium (HIPC)● Well-characterized human cohorts are studied using a variety of modern

analytic tools including multiplex transcriptional, cytokine, and proteomic assays.

● HIPC submitted data is an important subset of the ImmPort database● Submitted HIPC data is not standardized.

● Inconsistent naming and data reporting

Our aim is to make HIPC data FAIR

● Findability○ Finding a large variety of related datasets is an important step to knowledge discovery

● Accessibility○ A growing number of datasets are being submitted to public repositories such as ImmPort.

These datasets can accessed through different methods including web-based search, bulk

download and API access

● Interoperability○ Data mining/analysis often requires multiple datasets to be integrated within a single repository

or across multiple repositories

● Reusability○ Entering enough metadata as part of the data submission process facilitates data reuse

❖ FAIR a set of Digital Object Compliance principles that describes the properties of digital objects defined under NIH Commons initiative

Current practices towards data FAIRness

● Minimum information standards (checklists) specify the minimum amount of information (metadata) needed for reporting results in a reproducible and reusable fashion. For example,

○ MIAME: Minimum information about a microarray experiment○ MIAPE: Minimum Information About a Proteomics Experiment

● Scientific communities have developed templates incorporating detailed checklists of the metadata needed to describe about the particular types of experimental data sources.

● Standard identifiers/terminologies/ontologies have been created for different domains

We propose an ontological mapping for the ImmPort data submission templates.● Ontology term mapping allows to achieve semantic normalization across

different repositories.

● Ontologically annotated datasets allow context-aware queries and data integration

● Mapping to controlled vocabularies, relationships and rules facilitates run-time data validation.

● These help achieve data FAIRness.

Ontology mapping of templates

Ontology Recommender

OBI, OBO, Cell, PR

13

2

4

6 5

Incorporate into CEDAR and ImmPort Retrieve annotation (concept Uri, defns, etc)

A collection of ontologies

Expert Verification

Finalizing Mapping

Suggested Alteration

Terms Suggestion

Concept mapper

Concept mapper uses NCBO web services to suggest suitable mapping

Our mapping strategy• For certain value sets such as cell populations and cytokines, CM maps

the values to domain specific ontologies such as Cell Ontology (CL) and Protein Ontology (PR)

• For other elements, CM maps them to the terms in Ontology for Biomedical Investigations (OBI)

• For elements that do not have matches in OBI, we map these elements to terms in top-ranked ontologies by OBO Foundry

• For elements that do not have any ontology term matches, we perform manual search in Bioportal and other available repos for these missing terms.

• We work closely with individual ontology groups (e.g., CL, OBI) to fill the gap

Template elements mapped to ontologies• Assay types (e.g., gene expression, flow cytometry, ELISA,

HAI, Luminex )

• Template types (e.g., human subject, biosample)

• Column names (e.g., biosample type, measurement

technique)

• Value sets (e.g., set of cell populations, set of measurement

techniques)

Assay Type # Templates # Sub-Templates # Concept # Value Set

Microarray gene expression

6 10 113 209

Flowcytometry 6 - 67 262

ELISA 2 - 39 602

HAI 2 - 37 117

Luminex 7 - 102 1032

General 6 - 115 190

Mapping Statistics

OBI

OBIOBI

Newly added

A device that moves charged particles through a .... OBI_0001121

A cytometry assay in which the presence of molecules OBI_0002115

CEDAR helps to generate ontology-linked metadata

Use case: CEDAR immunology data submission templates

CEDAR has employed our suggested mapping

Map to cell term in cell ontology

Manual Mapping to “assay”In OBI Automatic mapping with NCIT

https://cedar.metadatacenter.net

Automatic mapping with OBI

Future plan• Refine mapping of new assay types with updated

algorithm.• Mapping of clinical metadata with ontology terms.• Incorporate our ontology-term mapping approach into

CEDAR and ImmPort• Submit missing terms to relevant ontologies (e.g., OBI)

Acknowledgment • ImmPort

• Jeff Wiser, Patrick Dunn

• Yale• Hailong Meng, Subhasis Mohanty

• Cell Ontology• Alex Diehl

• NCBO BioPortal and CEDAR• Mark Musen, John Graybeal, Martin O’connor

• OBI• Bjoern Peters