ontologies neo4j-graph-workshop-berlin
Post on 16-Mar-2018
126 views
Embed Size (px)
TRANSCRIPT
Building a repository of biomedical
ontologies with Neo4j
Simon Jupp
Samples, Phenotypes and Ontologies Team
European Bioinformatics Institute
Cambridge, UK.
The challenge - thousands of data
attributes
European Archive for molecular data
ENA, EVA, EGA, BioSample, ArrayExpress
How do we make sense of the data?
SPOT team builds tools to support the mapping of this data to ontologies and other standards
Why we need terminology standards (or
ontologies)
Dyschromatopsia
Search PubMed for color blindness
Search PubMed for Dyschromatopsia
Search PubMed for "abnormality of the eye"
The ontology of color blindness
HP:0011518 (Dichromacy )HP:0011518 (Eye)
HP:0000551 (Abnormality of color vision )
HP:0007641 (Dyschromatopsia)
Is-a
Is-aDisease-location
Ontology powered applications
Query expansion in the Gene Expression Atlas searching eye disease finds
genes expressed in Turner syndrome
https://www.ebi.ac.uk/gxa/home
https://www.ebi.ac.uk/gxa/home
Ontology powered applications
Visualising Gene-Disease associations in Open Targets
https://www.opentargets.org
https://www.opentargets.org/
Ontology powered applications
SNP trait
associations in the
GWAS catalog
All traits mapped to
disease, phenotype
and measurements
in EFO
https://www.ebi.ac.uk/gwas/
Cardiovascular disease traits
https://www.ebi.ac.uk/gwas/
11
Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue /
enzyme source
Development
Anatomy
Phenotype
Plasmodium
life cycle
-Sequence types
and features
-Genetic Context
- Molecule role
- Molecular Function
- Biological process
- Cellular component
-Protein covalent bond
-Protein domain
-UniProt taxonomy
-Pathway ontology
-Event (INOH pathway
ontology)
-Systems Biology
-Protein-protein
interaction
-Arabidopsis development
-Cereal plant development
-Plant growth and developmental stage
-C. elegans development
-Drosophila development FBdv fly
development.obo OBO yes yes
-Human developmental anatomy, abstract
version
-Human developmental anatomy, timed version
-Mosquito gross anatomy
-Mouse adult gross anatomy
-Mouse gross anatomy and development
-C. elegans gross anatomy
-Arabidopsis gross anatomy
-Cereal plant gross anatomy
-Drosophila gross anatomy
-Dictyostelium discoideum anatomy
-Fungal gross anatomy FAO
-Plant structure
-Maize gross anatomy
-Medaka fish anatomy and development
-Zebrafish anatomy and development
-NCI Thesaurus
-Mouse pathology
-Human disease
-Cereal plant trait
-PATO PATO attribute and value.obo
-Mammalian phenotype
- Human phenotype
-Habronattus courtship
-Loggerhead nesting
-Animal natural history and life history
eVOC (Expressed
Sequence Annotation
for Humans)
Ontologies for life sciences
Ontology Lookup Service
Ontology search engine
Ontology term history tracking
Ontology visualisation
Powerful RESTful API
Repository of over 160 pre-selected biomedical ontologies (4.5 million terms, 11
million relationships)
http://www.ebi.ac.uk/ols
Provides unified mechanism to access multiple ontologies
Large community of users (~5000 p/m, 100s of millions of hits p/m)
Open source and dockerised
http://www.ebi.ac.uk/ols
Ontology visualisation tools
Build process
Nightly crawl of
all registered
ontologies
Multiple indexes created
with standalone Spring Boot
applications
API and website
run with Spring data
https://ebispot.github.io
Open Source Software
https://ebispot.github.io
Loading ontologies into Neo4j
Ontologies usually published in W3C OWL format
RDF based (so already a graph)
but not a very friendly graph for our
use-cases (more on this this afternoon)
Primary OLS use-cases for a graph
Term hierarchy (parent/child)
Simple view over other relationships
Part of, develops from
Extracting subgraphs/subsets
e.g. taxon specific subsets
OWL to Neo4j schema
Every term is a node with an label for each ontology
Each relationship and subset relation is labeled (is-a, part-of, develops-from etc..)
Powerful yet simple queries
Get the transitive closure for heart following parent and partonomy relations from the UBERON anatomy ontology
MATCH path = (n:Class)-[r:SUBCLASSOF|RelatedTree*]
->(parent)
Ontology Mappings
We now have too many ontologies!! with overlapping scope
Millions of mappings exists to interlink the ontologies
Datasource 1 Datasource 2
Human
Phenotype
Ontology
SNOMED-CTMappings
Xref
Ontology Mapping Service (OxO)
New database of mappings built with Neo4j
Crawls OLS ontologies and UMLS for mappings and provides UI and API to access all known mappings
* Went live March 2017
http://www.ebi.ac.uk/spot/oxo *
http://www.ebi.ac.uk/spot/oxo
Exploring the Xref graph
We build a graph in Neo4j of known xrefs
Direct mappings to NCIt Retoinoblastoma from Disease ontology (DO) and EFO
Discover new mappings
If we traverse 1 hop in the graph we can infer more mappings
1 hop
Problems with mappings
But exposes inconsistencies in public mapping
Use this as basis for fixing and confirming mappings
Conclusion
Neo4j being adopted in multiple projects across this institute
Liked as provides simple and effective solution to some of our data modelling challenges
Neo4j is a good fit for working with ontologies and taxonomic data
Excellent developer integration for building applications e.g. Spring-data-neo4j
Ontology team
Helen ParkinsonTony Burdett
Sira SarntivijaiOlga Vrousgou Thomas Liener
Funding
EMBL
CORBEL This project receives funding from the
European Unions Horizon 2020 research and
innovation programme under grant agreement No
654248.
EXCELERATE ELIXIR-EXCELERATE is funded by
the European Commission within the Research
Infrastructures programme of Horizon 2020, grant
agreement number 676559.
Predicting annotation
We do a lot of data curation with ontologies
Need better support for mapping prediction
E.g. Sample likes these are usually annotated with these
terms
Need species specificity e.g. only mapping plant samples
with plant ontology terms
Input from submission Ontology class
2-deoxy-5-azacytidine 5-aza-2-deoxycytidine
Ovarian Cancer ovarian carcinoma
Anterior tibialis tibialis anterios
Endothelium, Vascula cardiovascular system endothelium
Tagging with ontologies
We have built a large corpus of known mappings between data values and ontology terms
Piloting building a recommendation engine for our curation tools with Neo4j