ontologically modeling sample variables in gene expression data

26
Ontologically Modeling Sample Variables in Gene Expression Data James Malone [email protected] EBI, Cambridge, UK

Upload: romney

Post on 29-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Ontologically Modeling Sample Variables in Gene Expression Data. James Malone [email protected] EBI, Cambridge, UK. Overview. Application Background Motivation for ontologies – questions we to answer Methodology Ontology and application Future work/things we’d like to do. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ontologically Modeling Sample Variables in Gene Expression Data

Ontologically Modeling Sample Variables in Gene Expression Data

James [email protected], Cambridge, UK

Page 2: Ontologically Modeling Sample Variables in Gene Expression Data

Overview

• Application Background• Motivation for ontologies – questions we to answer• Methodology• Ontology and application• Future work/things we’d like to do

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 3: Ontologically Modeling Sample Variables in Gene Expression Data

Gene Expression: Archive to Atlas

AE/GEO acquire

>250,000 Assays

>10,000 experiment

s

Re-annotate & summarizeATLAS

ArrayExpress

Curation Curation

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 4: Ontologically Modeling Sample Variables in Gene Expression Data

4

Gene Expression Sample Variable Annotations

 Annotations Archive Atlas

Species 330 9

Samples 238,000 34,650

Annotations on samples 860,700 101830

Unique sample annotations 37,500 6600

Assays (Hybridizations) 246,000 30,000

Annotations on assays 569,700 67,000

Unique assay annotations 25,000 4000

Page 5: Ontologically Modeling Sample Variables in Gene Expression Data

Use Cases

• Query support (e.g, query for 'cancer' and get also ‘leukemia')• Data visualisation – e.g., presenting an ontology tree to the user of

what is in the database• Data integration by ontology terms – e.g., we assume that 'kidney' in

independent studies roughly means the same, so we can count how many kidney samples we have in the database

• Intelligent template generation for different experiment types in submission or data presentation

• Summary level data • Nonsense detection – e.g. telling us that something marked as

cancer can not be marked as healthy

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 6: Ontologically Modeling Sample Variables in Gene Expression Data

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Questions we want to answer

• Diverse nature of annotations on data• Need to support complex queries which contain semantic

information• E.g. which genes are under-expressed in brain samples in

human or mouse

• If we annotate with do we get this data?

cancer

adenocarcinoma

Page 7: Ontologically Modeling Sample Variables in Gene Expression Data

Primary Question: Where to place our semantics?

cancer

adenocarcinoma

Atlas/AE

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 8: Ontologically Modeling Sample Variables in Gene Expression Data

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Decoupling knowledge from data

Atlas/AE

Page 9: Ontologically Modeling Sample Variables in Gene Expression Data

Methodology: Reference vs Application Ontology• Debate in community about difference, here is our thesis• A reference ontology describes a knowledge space; an

explicitly delineated part of a domain.

Cell type

HumanAnatomy GO Process

Biomedicine

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 10: Ontologically Modeling Sample Variables in Gene Expression Data

Methodology: Reference vs Application Ontology• An application ontology describes an application or data space; an

explicitly delineated part of a domain.• Should consume reference ontologies to meet application needs

Cell type

HumanAnatomy GO Process

Biomedicine

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 11: Ontologically Modeling Sample Variables in Gene Expression Data

04/22/2311Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Building the Experimental Factor Ontology• We consume parts of reference ontologies from domain• Construct new classes and relations to answer our use cases• Aim is reuse of existing resources, shared frameworks and mapping of equivalencies where they

exist

EFO

Disease Ontology Anatomy Reference Ontology

Ontology Biomedical Investigations

Chemical Entities of Biological Interest

(ChEBI)

Various Species Anatomy

Ontologies

Relation Ontology

Text mining

Page 12: Ontologically Modeling Sample Variables in Gene Expression Data

Identify Upper Level Structure

• Taken a BFO-lite approach, hiding labels from users for application purposes and sometimes different definition

information content entity (IAO)

site (BFO)

material entity (BFO)

processual entity (BFO)

specifically dependent continuant (BFO)

Specifically dependent continuant: A continuant [snap:Continuant] that inheres in or is borne by other entities. Every instance of A requires some specific instance of B which must always be the same.

Material property: A property or characteristic of some other entity. For example, the mouse has the colour white.

Page 13: Ontologically Modeling Sample Variables in Gene Expression Data

Adding New Classes @ www.ebi.ac.uk/efo/tools• We wish to maximise our interoperability• Submitters and other groups use many ontologies• Trade-off: open to their data and preferences vs imposing a more

ordered view on semantics• Our goal:

Where orthognality exists we aim to import only that classs. Where it does not, we perform ‘mappings’ in our EFO classes via annotation property references (in similar way to xrefs)

• E.g. chebi classes, import chebi URIfor ‘cancer’, create an EFO class and add multiple mappings

Page 14: Ontologically Modeling Sample Variables in Gene Expression Data

Creating Class Mappings

• For overlapping ontologies, we aim to create a ‘mapping class’• Use semi-automated text mining “double-metaphone” algorithm• Perform matching of our values in database to ontology class labels and definitions.• Also perform mappings from EFO to other ontologies, so that EFO: cancer = NCI: cancer, DO: cancer et al.• Sanity checking over mappings before adding to ontology

Page 15: Ontologically Modeling Sample Variables in Gene Expression Data

Keeping Up To Date with External Classes• Use of tool to automatically update metadata every

release (monthly)• Uses BioPortal web services to access latest

definition,synonyms

Class URI/ID

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 16: Ontologically Modeling Sample Variables in Gene Expression Data

Detecting Change in External Ontologies

• Bubastis tool for detecting axiomatic changes between two ontologies (in our case 2 versions of same ontology)

• @todo: detect annotation property changes• We also detect missing annotation properties with

Watchman tool (not released yet) – mainly used for labels presently

Page 17: Ontologically Modeling Sample Variables in Gene Expression Data

Creating Relations and Equivalent Classes

cell line (Hela)

organism part (cervix)

cell type (epithelial)

disease (cervical adenocarcinoma)

species (human)

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 18: Ontologically Modeling Sample Variables in Gene Expression Data

Structure for queries

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 19: Ontologically Modeling Sample Variables in Gene Expression Data

Gene Expression Atlas

• Linking data to the ontology

Assay Table

Sample Table

Ontology Term Table

QueryOWL Model

Database formulated query

Page 20: Ontologically Modeling Sample Variables in Gene Expression Data

Gene Expression Atlas @ www.ebi.ac.uk/gxa

Query for Cell adhesion genes in all ‘organism parts’

‘View on EFO’

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]

Page 21: Ontologically Modeling Sample Variables in Gene Expression Data

ArrayExpress Archive@ www.ebi.ac.uk/arrayexpress

Page 22: Ontologically Modeling Sample Variables in Gene Expression Data

Developing an Ontology from the Application [email protected]

Future Work: Linked DataLinking data by dereferenceable URI for human and machine

http://www.ebi.ac.uk/gxa/Experiment12345http://www.ebi.ac.uk/gxa/Experiment12345

Page 23: Ontologically Modeling Sample Variables in Gene Expression Data

Future Work: RDF Triple Store@ www.ebi.ac.uk/efo/semanticweb/atlas

• Q: Is an RDF Triple store SPARQL query quicker than a SPARQL translated into SQL?

OWL Ontology

Atlas Data

RDFizerSPARQL

RDF Triple Store

SQL Translation

Layer

Page 24: Ontologically Modeling Sample Variables in Gene Expression Data

Future Work: Data Integration• Consuming reference ontologies and mapping to multiple ontologies

where overlap exists offers us maximum interoperability• The advantage of triple stores is not immediate yet• Impetus required: “should we champion this technology”

Rdf triple

Rdf triple

Rdf triple

Rdf triple

Rdf triple

Rdf triple

QUERY

Atlas

SwissProt

Amino Acid Ontology

Page 25: Ontologically Modeling Sample Variables in Gene Expression Data

Summary

• We have created a sustainable approach to consuming multiple reference ontologies

• Tooling solutions to expedite process• We consider EFO to be a ‘view’ of such ontologies for our application

needs• The primary aim of this work is to enable novel research with the

experimental data we have• Specifically, we can answer new questions, integrate across our data

resources, visualise and summarise the data• Our belief is describing such data should be the driving force behind

ontology development• Future work will look at linked data and rdf triple stores

Page 26: Ontologically Modeling Sample Variables in Gene Expression Data

Acknowledgements• Ontology creation:

• James Malone, Tomasz Adamusiak, Ele Holloway, Helen Parkinson, Jie Zheng (U Penn)

• Ontology Mapping tools and text mining evaluation:• Tim Rayner, Holly Zheng, Margus Lukk

• GUI Development• Misha Kapushesky, Pasha Kurnosov, Anna Zhukova. Nikolay Kolesinkov

• External Review and anatomy:• Jonathan Bard, Jie Zheng

• ArrayExpress Production Staff• EBI Rebholz Group (Whatizit text mining tool)• Many source ontologies for terms and definitions esp. Disease Ontology, Cell Type

Ontology, FMA, NCIT, OBI• Funders: EC (Gen2Phen,FELICS, MUGEN, EMERALD, ENGAGE, SLING), EMBL,

NIH• Eric Neumann, Joanne Luciano and Alan Ruttenberg

W3C & HCLS Group - Eric Prud'hommeaux and Scott Marshall• OBI developers

Ontologically Modeling Sample Variables in Gene Expression Data [email protected]