bio2rdf and beyond!
DESCRIPTION
The Bio2RDF project aims to transform silos of bioinformatics data into a distributed platform for biological knowledge discovery. Initial work focused on building a public database of open-linked data with web-resolvable identifiers that provides information about named entities. This involved a syntactic normalization to convert open data represented in a variety of formats (flatfile, tab, xml, web services) to RDF-based linked data with normalized names (HTTP URIs) and basic typing from source databases. Bio2RDF entities also make reference to other open linked data networks (e.g. dbPedia) thus facilitating traversal across information spaces. However, a significant problem arises when attempting to undertake more sophisticated knowledge discovery approaches such as question answering or symbolic data mining. This is because knowledge is represented in a fundamentally different manner, requiring one to know the underlying data model and reconcile the artefactual differences when they arise. In this talk, we describe our data integration strategy that makes use of both syntactic and semantic normalization to consistently marshal knowledge to a common data model while leveraging explicit logic-based mappings with community ontologies to further enhance the biological knowledgescope.TRANSCRIPT
EBI : 14-01-101
Bio2RDF and Beyond! Large Scale, Distributed Biological Knowledge Discovery
Michel Dumontier, Ph.D.Associate Professor of Bioinformatics
Carleton University
Department of BiologySchool of Computer Science
Institute of BiochemistryOttawa Institute of Systems Biology
Ottawa-Carleton Institute of Biomedical Engineering
EBI : 14-01-102 Carole Goble (ISWC 2005)
Web-based Knowledge Discovery a very painful process
EBI : 14-01-103
Syntactic Web…It takes a lot of digging to get answers
EBI : 14-01-104
Portals provide structured informationand give better results
EBI : 14-01-105
Surface web:167 terabytes
Deep web:91,000 terabytes
545-to-one
We need to expose the deep web
EBI : 14-01-106
Data silos – not made for sharing
EBI : 14-01-107
How do we integrate these resources?
EBI : 14-01-108
We want to simultaneously
query the 1000+ biological databases
EBI : 14-01-109
The Semantic Web is a web of knowledge.
It is about standards for publishing, sharing and querying knowledge drawn from diverse sources
It enables the answering of sophisticated questions
EBI : 14-01-1010
A growing web of linked data
EBI : 14-01-1011
Life Science Data Contributors
• HCLS (LODD)• Neurocommons• Bio2RDF
EBI : 14-01-1012
Resource Description Framework (RDF)
Uniform Resource Identifier (URI) can be used as entity names
Bio2RDF specifies the naming convention
http://bio2rdf.org/uniprot:P05067
is a name for Amyloid precursor protein
http://bio2rdf.org/omim:104300
is a name for Alzheimer disease
uniprot:P05067
omim:104300
Allows one to talk about anything
EBI : 14-01-1013
Resource Description Framework (RDF)
uniprot:Protein
is a
A RDF statement consists of:– Subject: resource identified by a URI– Predicate: resource identified by a URI– Object: resource or literal
uniprot:P05067
Allows one to express statements
EBI : 14-01-1014
Multi-Source Data Integration
uniprot:P05067 go:Membrane
uniprot:Proteinis a
located in
uniprot:P05067
uniprot:P05067 uniprot:P05067interacts with
UniProt
Gene Ontology
uniprot:P05067
has name
located in
interacts with
Unified view
+
+
iRefIndex
depends on consistent naming
go:Membrane
uniprot:Protein
uniprot:P05067
EBI : 14-01-1015
Building statements creates knowledge
uniprot:P05067
Protein
is a
omim:104300
Disease
is a
is involved in
Amyloid precursor
protein
label
AlzheimerDisease
label
EBI : 14-01-1016
RDF/XML<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:u="http://bio2rdf.org/uniprot:"
<rdf:Description rdf:about=“&u;Q16665"> <rdf:type rdf:resource=“&u;Protein"/> </rdf:Description></rdf:RDF>
PREFIX u: <http://bio2rdf.org/uniprot:>
<u:Q16665> a <u:Protein> .
RDF/N3
RDF has multiple representations
EBI : 14-01-1017
Bio2RDF is a framework to create and provision linked data networks
Francois Belleau, Laval UniversityMarc-Alexandre Nolin, Laval University
Peter Ansell, Queensland University of TechnologyMichel Dumontier, Carleton University
EBI : 14-01-1018
Bio2RDF’s RDFized data fits together
EBI : 14-01-1019
Bio2RDF now serving over 5 / 15 billion triples of linked biological data
EBI : 14-01-1020
Bio2RDF linked data
Bioinformatics Discovery Registry• SharedName initiative to provide stable URI patterns for data records.• We added the relationship between entities and records
Directory Service• ~1700 datasets & dozens of resolvers.
Discovery Service• Registry links entities to data records, their formats (RDF/XML, HTML, etc)
and provider (Bio2RDF, Uniprot)
Redirection Service• Automatic redirection to data provider document
EBI : 14-01-1022
something you can lookup or search for with rich descriptions
EBI : 14-01-1023
Bio2RDF: Raw Data!
EBI : 14-01-1024
SPARQL is the new cool kid on the query block
SQL SPARQL
EBI : 14-01-1025
Bio2RDF’s describe service uses SPARQL
CONSTRUCT {?s ?p ?o .
}WHERE {?s ?p ?o .FILTER(?s = <http://bio2rdf.org/ns:id>).
}
Sent to http://ns.bio2rdf.org/sparql?query=...
http://bio2rdf.org/ns:id
EBI : 14-01-1026
Bio2RDF’s search service uses SPARQLhttp://bio2rdf.org/search/hexokinase
kegguniprot
gene
bio2rdf.org
EBI : 14-01-1027
Bio2RDFScalable, Decentralized Data ProvisionGlobally Mirrored and Point Provision
Customizable Query Resolution
EBI : 14-01-1028
Customizable Configuration (in N3)Single Query, Single Provider
EBI : 14-01-1029
Query Resolution
EBI : 14-01-1030
EBI : 14-01-1031
700,000 queries in November 2009
EBI : 14-01-1032
Yai for data!
But how do we discover more than what was in the data?
EBI : 14-01-1033
Ontology as Strategy
EBI : 14-01-1034
uniprot:P05067
Uniprot:Protein
is a
chebi:PolyatomicEntity
is a
is a
Reasoning and Inference through Semantics
fact
ontology
Knowledge base
EBI : 14-01-1035
The Web Ontology Language (OWL) Has Explicit Semantics
Can therefore be used to capture knowledge in a machine understandable way
Over 170 bio-ontologies
EBI : 14-01-1036
From linked data to linked knowledge through syntactic and semantic normalization.
Multiple Ways To Represent Knowledge
Three ways to model the relationship between a protein and the volume it occupies.
EBI : 14-01-1039
Web-based Knowledge DiscoverySome of our queries need services
EBI : 14-01-1040
The Holy Grail:
Align the promoters of all serine threonine kinases involved exclusively in the regulation of cell sorting during wound healing in blood vessels.
Retrieve and align 2000nt 5' from every serine/threonine kinase in Mus musculus expressed exclusively in the tunica [I | M |A] whose expression increases 5X or more within 5 hours of wounding but is not activated during the normal development of blood vessels, and is <40% similar in the active site to kinases known to be involved in cell-cycle regulation in any other species.
EBI : 14-01-1041
Semantic Automated Discovery and Integration
http://sadiframework.org
Mark Wilkinson, UBCMichel Dumontier, Carleton UniversityChristopher Baker, UNB
SADI – described oriented service matching based on
registered predicates
EBI : 14-01-1043
EBI : 14-01-1044
What pathways does UniProt protein P47989 belong to?
PREFIX pred: <http://sadiframework.org/ontologies/predicates.owl#>PREFIX ont: <http://ontology.dumontierlab.com/>PREFIX uniprot: <http://lsrn.org/UniProt:>
SELECT ?gene ?pathway WHERE {
uniprot:P47989 pred:isEncodedBy ?gene . ?gene ont:isParticipantIn ?pathway .
}
EBI : 14-01-1045
SADI
• Describe the input and output using OWL-DL classes• Subject of input and output must be the same• Web services correspond to predicates• Biocatalogue to register SADI-compliant services• Simplified migration path for existing web services (java, perl)
EBI : 14-01-1046
Build aknowledge basefrom a series of questions
EBI : 14-01-1047
You want to join the knowledge web
EBI : 14-01-1048
Share your data
EBI : 14-01-1049
Build semantic web services
EBI : 14-01-1050Get to where you want to be … faster!
EBI : 14-01-1051
Next Steps
Service and Data Buildout Formal Partnerships
Applications
EBI : 14-01-1052
EBI : 14-01-1053
We’re interested in Personalized Medicine
The ability to offer • The Right Drug• To The Right Patient• For The Right Disease• At The Right Time• With The Right Dosage
Genetic and metabolic data will allow drugs to be tailored to patient subgroups
54 EBI : 14-01-10
EBI : 14-01-1055
PHARMGKB is an emerging resource for pharmacogenomics
+ Role of genes, gene variants , drugs + pharmacokinetics + pharmacodynamics + clinical outcomes. + Links to publications
- Natural language descriptions- Variant details in publications
EBI : 14-01-1056
contains statements from 11/40 relevant publications involving 45 genes / gene variants, 57 drugs annotated with 19 classes of antidepressants, 45 drug treatments, 47 drug-gene interactions, 29 clinical outcomes, 10 drug-induced side-effects, and 8 gene-disease interactions.
PHARMACOGENOMICS OF DEPRESSION KNOWLEDGE BASE
EBI : 14-01-1057
Nortriptyline induced side effects for ABCB1 gene variants
‘side effect’ that ‘is realized by’ some (‘drug treatment’ that ‘involves’ some ‘nortriptyline’ and
‘involves’ some (‘variant of’ some ‘ABCB1’))
QUERYING THE PDKBProtégé 4, FaCT++, DL Query Tab
postural hypotension is a side effect of nortriptyline treatment of depression for individuals presenting the 3435C>T genotype