2016 bmdid-mappings
Post on 15-Apr-2017
209 Views
Preview:
TRANSCRIPT
ISWC2016:::BMDID::Dumontier1
ONTOLOGY MAPPING FOR LIFE SCIENCE LINKED DATA
Amrapali Zaveri and Michel Dumontier
Stanford Center for Biomedical Informatics ResearchStanford University
2
Large and growing network of Linked Data
ISWC2016:::BMDID::DumontierLinking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"
ISWC2016:::BMDID::Dumontier
Linked Data for the Life Sciences
3
Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF.
chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications
• 11B+ interlinked statements from 35 biomedical datasets and 400+ ontologies
• dataset description, provenance & statistics• A growing interoperable ecosystem with the EBI,
NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers
ISWC2016:::BMDID::Dumontier4
Biomedical Linked Data
ISWC2016:::BMDID::Dumontier5
the lack of coordination to a global schema makes Linked Data chaotic and unwieldy
6
Federated queries require intimate knowledge of each dataset schema
Get all protein catabolic processes (and more specific GO terms) in biomodels
SELECT ?go ?label count(distinct ?x) WHERE { service <http://bioportal.bio2rdf.org/sparql> { ?go rdfs:label ?label . ?go rdfs:subClassOf+ ?tgo ?tgo rdfs:label ?tlabel . FILTER regex(?tlabel, "^protein catabolic process") } service <http://biomodels.bio2rdf.org/sparql> { ?x <http://bio2rdf.org/biopax_vocabulary:identical-to> ?go . ?x a <http://www.biopax.org/release/biopax-level3.owl#BiochemicalReaction> . }}
ISWC2016:::BMDID::Dumontier
ISWC2016:::BMDID::Dumontier7
uniprot:P05067
uniprot:Protein
is a
sio:gene
is a is a
Previous work involved manual mappings between Bio2RDF types and relations and the Semanticscience
Integrated Ontology (SIO)
dataset
ontology
Knowledge Base
pharmgkb:PA30917
refseq:Protein
is a
is a
omim:189931
omim:Gene pharmgkb:Gene
Querying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo and Michel Dumontier. Bio-ontologies 2012.
8 ISWC2016:::BMDID::Dumontier
Semanticscience Ontology (SIO)An effective upper level ontology.1500+ classes207 object properties (inc. inverses)1 datatype property
9
Bio2RDF and SIO powered SPARQL federated query: Find chemicals (from CTD) and proteins (from SGD) that
participate in the same process (from GOA)SELECT ?chem, ?prot, ?procFROM <http://bio2rdf.org/ctd>WHERE { SERVICE <http://ctd.bio2rdf.org/sparql> {
?chemical a sio:chemical-entity. ?chemical rdfs:label ?chem.?chemical sio:is-participant-in ?process. ?process rdfs:label ?proc.
FILTER regex (?process, "http://bio2rdf.org/go:") }
SERVICE <http://sgd.bio2rdf.org/sparql> {?protein a sio:protein . ?protein sio:is-participant-in ?process. ?protein rdfs:label ?prot .
}}
ISWC2016:::BMDID::Dumontier
ISWC2016:::BMDID::Dumontier
Many vocabularies, ontologies and community-based standards
are now available
10
ISWC2016:::BMDID::Dumontier11
PubChem uses multiple terminologies
ISWC2016:::BMDID::Dumontier12
Existing limitations with Bio2RDF mappings
• New datasets have been added• Existing datasets have changed• The target ontology (SIO) has changed• The target ontology (SIO) is incomplete and there
may be better ontologies to use• These ontologies are evolving, today’s mappings
may be invalid or imprecise tomorrow• Manual process -> not easy and not reproducible
-> must automate
ISWC2016:::BMDID::Dumontier13
Goal
Develop a semi-automated procedure to generate high quality mappings between Bio2RDF and SIO.
ISWC2016:::BMDID::Dumontier14
approach
distance metrics
graph-based
instance-based
BioPortal
crowdsourcing
previous work*Our work
Automated Manual
ISWC2016:::BMDID::Dumontier
Idea: Create mappings between SIO and Bio2RDF using ontologies in BioPortal
15
Bio2RDF
NCBO Annotator/Recommender
SIO
ISWC2016:::BMDID::Dumontier
Bio2RDF-SIO mappings via transitive closure through BioPortal ontologies
16
Bio2RDF
SIO
Super Class
Mapped Class
match
ISWC2016:::BMDID::Dumontier
Results
17
319 (of 6093) classespruned
1 NCBO Annotator 174 Bio2RDF classesmatched directly and exactly to SIO
2 NCBO Recommender94 Bio2RDF classes matched toBioPortal ontologies
Bio2RDFremove blank nodes, general resources, OWL vocabulary & non-Bio2RDF types/relations.
ISWC2016:::BMDID::Dumontier
Results
18
SIO1500 classes
475 BioPortalOntologies3
393 BioPortal ontologiesmatched to SIO
ISWC2016:::BMDID::Dumontier
Results
19
Bio2RDF319 classes
4 Traverse hierarchySIO1500 classes
393 BioPortal ontologiesmatched to SIO
94 Bio2RDF classes matched toBioPortal ontologies
ISWC2016:::BMDID::Dumontier
Results
20
Bio2RDF319 classes
4 Traverse hierarchy
SIO1500 classes
393 BioPortal ontologiesmatched to SIO
94 Bio2RDF classes matched toBioPortal ontologies
71 matches
Mapped class
Super class
ISWC2016:::BMDID::Dumontier
Results — Example
21
Bio2RDFclass
clinicaltrials:Clincial-Study
Super class
Edda:Study_Design
Mapped class
edda:clinical_trial
SIOclass
sio:001041| (study design)
skos:broader
ISWC2016:::BMDID::Dumontier
Mappings often occurred to more than one class
22
sider:Drug-Indication-Association
sio:010038 (drug)
sio:010299 (disease)
sio:000897 (association)
ISWC2016:::BMDID::Dumontier
Manual validation of mappings
23
Bio2RDF Class SIO Class Annotation
drugbank:Biotech no match
clinicaltrials:Organization sio:00012 (organization) exact
drugbank:toxicity sio:001008 (toxicity) exact
sgd:GlycineCount sio:000794 (count) partial – is-a
wormbase:Genetic-Interaction sio:010035 (gene) partial – part-of
clinicaltrials:Serious-Event sio:000614 (attribute) incorrect
drugbank:Source sio:000510 (model) incorrect
All results available at https://goo.gl/eiijmQ
ISWC2016:::BMDID::Dumontier
Conclusion
• Developed a semi-automated methodology to map Bio2RDF classes to SIO via BioPortal ontologies
• 245 of 319 Bio2RDF classes matched to SIO
24
ISWC2016:::BMDID::Dumontier
Limitations
• Unmatched classes: neither SIO nor other ontologies have complete coverage
• Overly general concepts: Semantically incompatible classes
• Incorrect mappings: Matches to part of the class
• Mappings are insufficient to precisely to retrieve data across different datasets
25
ISWC2016:::BMDID::Dumontier
Future Work
• Extend SIO to include classes that are ultimately not found
• Explore mid-level portion of SIO to eliminate root level mappings
• Scalable validation by via crowdsourcing• Pursue query rewriting
26
ISWC2016:::BMDID::Dumontier27
dumontierlab.commichel.dumontier@stanford.edu
Website: http://dumontierlab.com
top related