introduction to rdf and the semantic web for the life sciences
TRANSCRIPT
Introduction to RDF and the Semantic Web for the life sciences
Simon Jupp
Sample Phenotypes and Ontologies Team
European Bioinformatics Institute
Day 2 practical session
• Exploring EBI RDF platform
• Querying EBI resources
• Federated queries from one SPARQL endpoint to another
Exercise 17
• Explore the EBI RDF platform at http://www.ebi.ac.uk/rdf
• A) On the ChEMBL endpoint get ChEMBL activities, assays and targets for the drug Clotrimazole (CHEMBL104)
• B) On the Atlas endpoint find expression for ENSDARG00000042641 (Cyp51)
• B2) filter the results by property type contains “organism_part”
• C) On the Reactome endpoint find pathways that references Cyp51 (http://purl.uniprot.org/uniprot/Q1JPY5)
• D) Query the UniProt endpoint to describe http://purl.uniprot.org/uniprot/Q1JPY5
Exercise 17 solution A
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#> PREFIX chembl_molecule: <http://rdf.ebi.ac.uk/resource/chembl/molecule/> SELECT ?activity ?assay ?target ?targetcmpt ?uniprot WHERE { ?activity a cco:Activity ; cco:hasMolecule chembl_molecule:CHEMBL104 ; cco:hasAssay ?assay . ?assay cco:hasTarget ?target . ?target cco:hasTargetComponent ?targetcmpt . ?targetcmpt cco:targetCmptXref ?uniprot . ?uniprot a cco:UniprotRef }
Exercise 17 solution B
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?diffValue . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSDARG00000042641 . }
Exercise 17 solution B1
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?diffValue . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSDARG00000042641 . FILTER regex (?propertyType, "organism_part") }
Exercise 17 solution C
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#> SELECT DISTINCT ?pathway ?pathwayname WHERE {?pathway rdf:type biopax3:Pathway . ?pathway biopax3:displayName ?pathwayname . ?pathway biopax3:pathwayComponent ?reaction . ?reaction rdf:type biopax3:BiochemicalReaction . { {?reaction ?rel ?protein .} UNION { ?reaction ?rel ?complex . ?complex rdf:type biopax3:Complex . ?complex ?comp ?protein . }} ?protein rdf:type biopax3:Protein . ?protein biopax3:entityReference <http://purl.uniprot.org/uniprot/Q1JPY5> } LIMIT 100
Exercise 17 solution D
DESCRIBE <http://purl.uniprot.org/uniprot/Q1JPY5>
Federated querying
• One of the biggest advantages of SPARQL and triples stores is the ability to federate queries across endpoints
• Integrating data at query time with SPARQL
GO annotation Expression value
Uniprot Protein
Uniprot GXA
Federated SPARQL
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?experiment ?description WHERE {
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?experiment a atlasterms:Experiment .
?experiment dcterms:identifier ?id .
?experiment dcterms:description ?description .
FILTER regex(?id, "E-GEOD-2852")
}
}
http://tinyurl.com/o9kvvzn
We can execute this query from any other endpoint using the SPARQL SERVICE keyword
Exercise 19
• Execute the following federated query on
• 1. The UniProt SPARQL endpoint
• 2. Your Sesame workbench SPARQL endpoint
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?experiment ?description WHERE {
SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?experiment a atlasterms:Experiment .
?experiment dcterms:identifier ?id .
?experiment dcterms:description ?description .
FILTER regex(?id, "E-GEOD-2852")
}
}
Constructing a Federated query
• Basic query to get genes out of our dataset
• How can we integrate this with data from the EMBL-EBI Gene Expression Atlas?
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX efo:<http://www.ebi.ac.uk/efo/> SELECT DISTINCT ?geneid ?label WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . }
Querying the Atlas
SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?diffValue . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSMUSG00000034450 . } ORDER BY ASC (?pvalue)
SELECT DISTINCT ?geneid ?label WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . }
Our query
Example query 3 (http://www.ebi.ac.uk/rdf/services/atlas/sparql)
Integration point
Querying the Atlas
SELECT distinct ?diffValue ?expUri ?propertyType ?propertyValue ?pvalue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?diffValue . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSMUSG00000034450 . } ORDER BY ASC (?pvalue) PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> SELECT DISTINCT ?geneid ?label ?probe WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?probe atlasterms:dbXref ?geneid } }
Our query
Example query 3 (http://www.ebi.ac.uk/rdf/services/atlas/sparql)
1st gotcha
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> SELECT ?geneid ?label ?probe ?value WHERE { ?result mydata:dbXref ?geneid . ?geneid rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?geneid } }
This should work but there is an issue with querying the EBI RDF Platform with this version of Sesame (fix coming soon!)
1st gotcha
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> SELECT ?label ?probe ?value WHERE { ?result mydata:dbXref <http://identifiers.org/ensembl/ENSMUSG00000024673> . <http://identifiers.org/ensembl/ENSMUSG00000024673> rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref <http://identifiers.org/ensembl/ENSMUSG00000024673> } }
Bind on gene <http://identifiers.org/ensembl/ENSMUSG00000024673>
Exercise 20
• A) Using the previous query, extend it to query the Atlas endpoint to also return the Experiment id and factors (property values) where Ms4ai (ENSMUSG00000024673) is expressed
• B) Filter those results to only include experiments where the factor contains “liver”
Exercise 20 solution A) PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT ?label ?expUri ?propertyValue WHERE { ?result mydata:dbXref identifiers:ENSMUSG00000024673 . identifiers:ENSMUSG00000024673 rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSMUSG00000024673 } }
Exercise 20 solution B) PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> PREFIX mydata:<http://www.mydomain.com/mydata#> PREFIX atlasterms:<http://rdf.ebi.ac.uk/terms/atlas/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT ?label ?expUri ?propertyValue WHERE { ?result mydata:dbXref identifiers:ENSMUSG00000024673 . identifiers:ENSMUSG00000024673 rdfs:label ?label . SERVICE <http://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value atlasterms:hasFactorValue ?factor . ?factor atlasterms:propertyValue ?propertyValue . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref identifiers:ENSMUSG00000024673 } FILTER regex(?propertyValue, "liver", "i") }
Alzheimer’s Use Case – EBI RDF platform
• EFO term for Alzheimer’s: EFO_0000249
• Get Genes diff expressed for Alzheimer’s
• Get proteins encoded for those genes
• GO annotations from UniProt for those genes
• Get pathways form Reactome in which those proteins are involved
• Get drugs that target proteins within those pathways
Q1. Get Ensembl genes diff expressed for Alzheimer’s
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX dcterms:<http://purl.org/dc/terms/> PREFIX efo: <http://www.ebi.ac.uk/efo/> PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?expressionValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:hasFactorValue ?factor . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?dbXref . ?dbXref rdf:type atlasterms:EnsemblDatabaseReference . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?factor rdf:type efo:EFO_0000249 . }
Q2. Get UniProt proteins for those genes
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX dcterms: <http://purl.org/dc/terms/> PREFIX efo: <http://www.ebi.ac.uk/efo/> PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?expressionValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:hasFactorValue ?factor . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?dbXref . ?dbXref rdf:type atlasterms:UniprotDatabaseReference . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?factor rdf:type efo:EFO_0000270 . }
Q3. Get UniProt GO Annotations for those genes
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX efo:<http://www.ebi.ac.uk/efo/> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX upc:<http://purl.uniprot.org/core/> PREFIX identifiers:<http://identifiers.org/ensembl/> SELECT distinct ?valueLabel ?goid ?golabel WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?expressionValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:hasFactorValue ?factor . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?dbXref. ?dbXref rdf:type atlasterms:EnsemblDatabaseReference . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?factor rdf:type efo:EFO_0000249 . ?value rdfs:label ?valueLabel . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?uniprot . SERVICE <http://beta.sparql.uniprot.org/sparql> { ?uniprot a upc:Protein . ?uniprot upc:classifiedWith ?keyword . ?keyword rdfs:seeAlso ?goid . ?goid rdfs:label ?golabel . } }
Q4. get pathways from Reactome for those proteins PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX efo: <http://www.ebi.ac.uk/efo/> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#> SELECT DISTINCT ?pathway ?dbXref WHERE { ?expUri atlasterms:hasAnalysis ?analysis . ?analysis atlasterms:hasExpressionValue ?value . ?value rdfs:label ?expressionValue . ?value atlasterms:pValue ?pvalue . ?value atlasterms:hasFactorValue ?factor . ?value atlasterms:isMeasurementOf ?probe . ?probe atlasterms:dbXref ?dbXref . ?dbXref rdf:type atlasterms:UniprotDatabaseReference . ?factor atlasterms:propertyType ?propertyType . ?factor atlasterms:propertyValue ?propertyValue . ?factor rdf:type efo:EFO_0000270 . SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql> {?pathway rdf:type biopax3:Pathway . ?pathway biopax3:displayName ?pathwayname . ?pathway biopax3:pathwayComponent ?reaction . ?reaction rdf:type biopax3:BiochemicalReaction . { {?reaction ?rel ?protein .} UNION { ?reaction ?rel ?complex . ?complex rdf:type biopax3:Complex . ?complex ?comp ?protein . }} ?protein rdf:type biopax3:Protein . ?protein biopax3:entityReference ?dbXref } }
Q5. Get drugs that target proteins within those pathways PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>!PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>!PREFIX efo:<http://www.ebi.ac.uk/efo/>!PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>!PREFIX biopax3:<http://www.biopax.org/release/biopax-level3.owl#>!PREFIX cco:<http://rdf.ebi.ac.uk/terms/chembl#>!SELECT distinct ?dbXrefProt ?pathwayname ?moleculeLabel ?expressionValue ?propertyValue!WHERE {!#Get differentially expressed genes (and proteins) where factor is asthma!?value atlasterms:pValue ?pvalue .!?value atlasterms:hasFactorValue ?factor .!?value rdfs:label ?expressionValue .!?value atlasterms:isMeasurementOf ?probe .!?probe atlasterms:dbXref ?dbXrefProt .!?dbXrefProt a atlasterms:UniprotDatabaseReference .!?factor atlasterms:propertyType ?propertyType .!?factor atlasterms:propertyValue ?propertyValue .!?factor rdf:type efo:EFO_0000249 .!#Compunds target them!SERVICE <http://www.ebi.ac.uk/rdf/services/chembl/sparql> {! ?act a cco:Activity ;! cco:hasMolecule ?molecule ;! cco:hasAssay ?assay .! ?molecule rdfs:label ?moleculeLabel .! ?assay cco:hasTarget ?target .! ?target cco:hasTargetComponent ?targetcmpt .! ?targetcmpt cco:targetCmptXref ?dbXrefProt .! ?targetcmpt cco:taxonomy <http://identifiers.org/taxonomy/9606> .! ?dbXrefProt a cco:UniprotRef .!}!SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql> {!! ?protein rdf:type biopax3:Protein .! ?protein biopax3:memberPhysicalEntity! ! ![biopax3:entityReference ?dbXrefProt] .! ?pathway biopax3:displayName ?pathwayname .! ?pathway biopax3:pathwayComponent ?reaction .! ?reaction ?rel ?protein!}!}!
Summary
• Why there is a need for new technologies in the life sciences
• Why RDF is a good fit for some of the problems
• The role of ontologies
• Generating RDF triples from data
• Working with an RDF database
• How to write a SPARQL query
• How the EBI is using RDF
Conclusions
• Generating RDF triples is relatively easy
• Extracting the schema from your data can be tricky
• Avoid over modeling – have good use cases
• Look for appropriate ontologies, reuse terms where possible
• Good tooling now available
• RDF APIs for most programming language
• Lots of scalable triples stores
• SPARQL is a powerful query language for RDF
• Also very unforgiving; debugging queries is hard
• Treat the same as you would SQL, not for your average user
Conclusions cont..
• Lots of interest in Linked Data and RDF
• See LOD clouds and DBpedia
• Big name companies using/generating RDF content (Facebook, Google, Oracle)
• Some good examples of applications
• Pharma industry (OpenPhacts project), Semantic publishing (BBC), Government data (data,gov.uk)
• Tread cautiously
• This technology is still maturing
• Not a panacea
• Good solutions for some problems
Thinking beyond RDF and SPARQL
• Selling SPARQL endpoints to biologists is hard i.e. near impossible
• Entry level is too high and advantages too intangible
• Let programmers code against SPARQL
• Let everyone else use more familiar modes through Apps
RDFApps
• Our first RDFApp targets the existing community of R users – an ArrayExpress R package already exists
• Goal is to expose the power of the Atlas RDF+SPARQL behind a conventional R interface
• Enables those working with raw data to also use power of Atlas
Codefest
• Got an idea for an RDF App? Join us at Codefest 2014
• http://www.open-bio.org/wiki/Codefest_2014
• 18th/19th September, Cambridge, UK
Interesting RDF resources for biology
• EBI RDF (http://www.ebi.ac.uk/rdf )
• Bio2RDF (http://bio2rdf.org )
• BioPortal (https://bioportal.bioontology.org )
• OpenPhacts (https://www.openphacts.org )
• PubChem RDF (https://pubchem.ncbi.nlm.nih.gov/rdf/ )
• Identifiers.org (http://identifiers.org )
• Wikipathways (http://wikipathways.org )
• DisGeNet (http://ibi.imim.es/web/DisGeNET/v01/ )
• W3C Healthcare and Life Sciences Working Group (HCLS - http://www.w3.org/blog/hcls/ )
Acknowledgments
• Samples Phenotypes and Ontologies Group and Functional Genomics Production Team
• James Malone, Robert Petryszak, Tony Burdett, Helen Parkinson
• EBI RDF platform
• Andy Jenkinson, Mark Davies, Marco Brandizi, Sarala Wimalaratne, Leyla Garcia, Jerven Bolleman
Funding
Components of the RDF platform pilot are supported by a number of sources, including:
• EMBL
• European Commission:
• BioMedBridges [284209]
• Diachron [601043]
• OpenPhacts (Innovative Medicines Initiative)
• National Institutes of Health
Questions?
Sign up for our mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/rdf-announce