alignment of ontologies for biological research judith a. blake, ph.d. bioinformatics and...

27
Alignment of Ontologies for Biological Research Judith A. Blake, Ph.D. Bioinformatics and Computational Biology The Jackson Laboratory

Post on 19-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Alignment of Ontologies for Biological Research

Judith A. Blake, Ph.D.Bioinformatics and Computational Biology The Jackson Laboratory

Dagstuhl - 2007

What is my perspective?

Biological data is voluminous and complex

Data integration is hard work

Bio-ontologies provide semantic structure and standards that aid in data analysis and hypothesis generation .

There are many challenges to the effective use of bio-ontologies (in addition to challenges to the development of ontologies)

Dagstuhl - 2007

What is my approach?

Goal is to facilitate ‘translational research’ through effective integration of experimental data from mouse models of human conditions with human clinical data from disease studies

Bio-ontologies provide a mechanism to support comprehensive data integration and analysis

Dagstuhl - 2007

Interesting….

- Refine Relations Ontology (RO) - Identify critical datasets - Focus on bottlenecks - Create views

Dagstuhl - 2007

Phenotype• mutant allele definitions

• QTL

• strain characteristics

• phenotype vocabularies

• disease models (human)

• comparative phenotypes

Genes & Gene Products• nomenclature

• gene characterization

• transcripts, proteins, gene products

• functional annotation

• orthologs & paralogs

Sequences & Maps• sequence representation

• C57BL/6J genomic sequence

• SNPs and strain variants

• adding biological context to computational gene models

Gene Expression• mouse anatomy

• time, tissue, level of expression

• range of assays & results

• emphasis on embryonic stages

Tumor Biology• tumor classifications & descriptions

• strain incidence

• histopathology images

• tumor genetics

Overview of Mouse Genome Informatics

Dagstuhl - 2007

Data acquisition is constantLoad Program Summary of Data Loaded

Mouse EntrezGene EntrezGene IDs for mouse markers. Plus marker-to-sequence associations from EntrezGene not already in MGD

Human/Rat EntrezGene Nomenclature, map position and other data regarding human and rat genes. OMIM associations for human.

GenBank Seq Mouse sequence records from GenBank

RefSeq Seq Mouse sequence records from RefSeq

UniProt/TrEMBL Seq Mouse sequence records from UniProt and TrEMBL

TIGR/DoTS/NIA Seq Mouse consensus sequence records from TIGR/DoTS/NIA clusters

TIGR/DoTS/NIA Association Associations between TIGR/DoTS/NIA cluster sequences and markers.

Ensembl Gene Model Ensembl gene model sequences, coordinates, & associations between these & markers

NCBI Gene Model NCBI gene model sequences, coordinates, & associations between these & markers

UniProt Association UniProt/TrEMBL IDs and additional GenBank IDs for mouse markers. Plus GO and InterPro annotations

UniGene Association UniGene cluster IDs for mouse markers.

EST cDNA Clone Mouse IMAGE, NIA, MGC, Riken, cDNAs and EST sequence associations

MGC Association MGC IDs and associations between MGC full length sequences and MGC cDNAs

RPCI Clone RPCI 23/24 BAC clones and sequence associations

GO Vocabulary Updated Gene Ontology (GO) vocabularies from the central GO site.

OMIM Vocabulary Updated OMIM disease terms

MP Vocabulary Updated MP vocabulary (from OBO-Edit)

Anatomy Updated adult mouse anatomy ontology (from OBO-Edit)

Mapping panel JAX, EUCIB, Copeland-Jenkins and many others

PIRSF Mouse PIR superfamily terms and associations to markers

SNPs Mouse SNPs from dbSNP and associations between SNPs & markers.

Dagstuhl - 2007

Snapshot of MGI data contentMGI data statistics March, 2007

   

Number of genes with sequence data 28,292

Number of genes (incl. unmapped mutants) 35,733

Number of markers (including genes) 69,639

Number of markers mapped 65,345

Number of genes with protein sequence information 24,293

Number of genes with GO annotations 17,664

Number of mouse/human orthologies 16,127

Number of mouse/rat orthologies 15,802

Number of genes with one or more phenotypic alleles 6,979

Number of cataloged phenotypic alleles 17,494

Number of references 113,508

Number of integrated mouse nucleotide sequences (+ ESTs) 8,3574,701

Dagstuhl - 2007

Build 36: Ensembl and NCBI

28807 24237

Unification(Exon Overlap Detection)

221826910 2646

Unique to EnsemblUnique to NCBI Equivalent

1:1 1:n n:1 n:m

20663 365 874 280

Dagstuhl - 2007

Multiple Controlled Vocabularies in MGI

Gene Nomenclature Gene/Marker Type Allele Type Developmental and

Adult Anatomies Assay Type

Expression Mapping

Molecular Mutation Inheritance Mode

Gene Ontology Mammalian Phenotype

Ontology Tissue Types Cell Types Cell Lines Units

Cytogenetic Molecular

ES Cell Line Strain Nomenclature

Dagstuhl - 2007

Mammalian Phenotype Ontology

Compositional terms ‘working’ ontology Projected xref to ‘core’ ontologies

Anatomy GO

Built with attention to ontological principles but with primary goal of supporting annotation of diverse experimental results from many research groups and perspectives

Dagstuhl - 2007

Dagstuhl - 2007

We are exploring ontological representations that relate human clinical data with mouse phenotypes

Create compositional view for annotation of mouse models and human clinical data

Provide xref / RO back to core ontologies

Support both annotation and ontology alignment efforts

Develop tools to support complex queries

Dagstuhl - 2007

We modeled gangliosidoses as a test case. Two types of gangliosidoses are Sandoff and Tay-Sachs diseases.

Dagstuhl - 2007

Curators use controlled terms from structured vocabularies (ontologies) to curate complex biological systems described in the literature

The knowledge is in the details

Dagstuhl - 2007

The knowledge is in the details

Dagstuhl - 2007

Including the relationship to human disease

Dagstuhl - 2007

More mouse models – Tay Sachs

Dagstuhl - 2007

a

DopamineCHEBI:18243

Chemical Ontology

a

Cell Type Ontology

Dopaminergic NeuronCL:0000700

Biological Process

Synaptic transmissionGO:0007268

a

BrainMA:0000168

Anatomical Dictionary

Different core ontologies need to be combined to describe complex biological systems

Dagstuhl - 2007

Dilemma: No formal links currently existbetween the separate ontologies

Solution? Solution?

1. Generate cross-products (compositional 1. Generate cross-products (compositional terms) as necessary for annotations of terms) as necessary for annotations of characteristics of disease cases and disease characteristics of disease cases and disease models; models;

2. Annotate specific instances of human cases 2. Annotate specific instances of human cases and mouse models; and mouse models;

3. Visualize and mine co-annotated data3. Visualize and mine co-annotated data

Dagstuhl - 2007

Dagstuhl - 2007

a

aa

HumanMouseBothIs aDescribes

Abnormal neuron morphology

Dagstuhl - 2007

Dagstuhl - 2007

Dagstuhl - 2007

Dagstuhl - 2007

Next Steps

Perspective (views) Lung Cancer Provide Disease Ontology Build compositional view

Mouse Data Curate comprehensive annotations for genes

implicated in lung phenotypes Human Data

Curate clinical data for ontology annotation Data Analysis

Use ontological structures to facilitate data exploration and hypothesis generation

Dagstuhl - 2007

Next conference?

“enabling technologies for ontological access to clinical and animal model data”

A hands-on problem solving workshop – a problem use case

Dagstuhl - 2007

Gene Ontology www.geneontology.org

MGI projects are supported by NIH [NHGRI, NICH, and NCI].

Bar Harbor, Maine, USA

Mouse Genome Informaticswww.informatics.jax.org

GO Consortium is supported by NIH-NHGRI and by the European Union RTD Programme