bio2rdf : a biological knowledge base for the semantic web

Post on 10-May-2015

3.426 Views

Category:

Health & Medicine

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

A presentation given at the University of Toronto on June 18, 2009 describing the current state of Bio2RDF with respect to biological knowledge representation on the semantic web as linked data with services to describe and answer questions.

TRANSCRIPT

Bio2RDF: A biological knowledge base for the Semantic Web

Michel Dumontier, François Belleau, Marc-Alexandre Nolin, Peter Ansell

Web search for biological informationis hit or miss

something you can lookup and search for with rich descriptions

Introducing...

Surface web:167 terabytes

Deep web:91,000 terabytes

545-to-one

Bio-Portals provide Database accessgive better results

We want to simultaneously

query the 1000+ biological databases

Data silos – not made for sharing

How do we integrate these resources?

Bio2RDF provides the methodology to create and glue these different networks.

Bio2RDF is building the linked data web for biological data

Contributing to a growing linked data web

What is the semantic web?

The Semantic Web is a web of knowledge.

It is about standard formats forrepresenting and querying

knowledge drawn from diverse sources and

making statementsabout real

objects.

Goals for the Semantic Web

• Provide a common knowledge representation • syntax & semantics

• Facilitate publishing, data integration and information retrieval

• Make possible semantically interoperable web applications and services

• Enable the answering of questions across global repositories of knowledge

Resource Description Framework (RDF)

• Allows one to express propositions, and reason about them

• Uniform Resource Identifier (URI) are entity names• i.e http://purl.uniprot.org/uniprot/Q16665

• A RDF statement consists of:– Subject: resource identified by a URI– Predicate: resource identified by a URI– Object: resource or literal

u:Q16665

Protein

rdf:type

Q16665

Protein

rdf:type

Molecule

rdfs:subClassOf

rdf:type

Semantic Knowledge Base

fact

ontology

Knowledge base

16

RDF/XML<?xml version="1.0"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:u="http://purl.uniprot.org/uniprot/"

<rdf:Description rdf:about=“&u;Q16665"> <rdf:type rdf:resource=“&u;Protein"/> </rdf:Description></rdf:RDF>

PREFIX u: <http://purl.uniprot.org/uniprot/> .

<u:Q16665> a <u:Protein> .

N3

Syntactic Data Integration

u:Q16665 go:nucleus

HIF1-alphahas name

located in

u:Q16665

u:Q16665 u:vhlinteracts with

UniProt

Gene Ontology

u:Q16665

HIF1-alphahas name

go:nucleuslocated in

u:vhlinteracts with

Unified view

+

+

BIND

depends on consistent naming

Semantic Data Integration

Protein

U:Q16665

rdf:type

depends on accurate typing

u:vhl

Linked Data

http://www.w3.org/DesignIssues/LinkedData

Bio2RDF Design Principles

http://bio2rdf.wiki.sourceforge.net/Banff+Manifesto

Over 1800 namespaces

Compiled From: NAR, BioMoby, UniProt, NCBI, SRS

Naming Convention

http://bio2rdf.org/namespace:identifier

http://bio2rdf.org/pdb:1AM0

http://bio2rdf.org/gi:99

Bio2RDF network = 2.3 BT

Namespace Domain Updated Triples Topics Namespaces SPARQL

Affymetrix Probeset loading 45560115 1708777 20affymetrix

BIND Network information 09/04/1930 bindBioCYC Pathway/BioPAX 4418699622 + xref biocycChEBI@EBI Chemistry 09/03/2025 4764030 50377 25chebiCPD@KEGG Chemistry 09/04/2014 177199 14071 10keggcPath Pathway/BioPAX 09/04/2007 28052098 51cpathDBpedia Encyclopedia 09/03/2023 190790 0 21dbpediaDR@KEGG Drug 09/04/2014 116822 8117 8drEC@KEGG Enzyme 09/04/2014 556888 4245 4ecEC@UniProt Enzyme 09/04/2014 36109 enzymeGeneID@NCBI Gene loading 1.73E+08 86geneidGL@KEGG Chemistry 09/04/2014 94148 10965 2keggGO Ontology 09/03/2015 8188649 804979 144goHGNC Genome 09/03/2025 1085662 125256 14hgnc

HomoloGene@NCBI Homolog 09/03/1931 6598206 7homologeneIProClass@PIR Protein loading 1.92E+08 19 iproclassMGI Genome 09/03/2025 3089976 12mgiOBO Ontology 09/03/2027 4507016 4954332 165oboOMIM@NCBI Disease 09/03/2024 1048053 32102 7omimPath@KEGG Pathway 09/03/2028 50793314 keggPDB Protein 09/03/2021 1215254 44569 2pdbPubmed@UniProt Article 09/03/1931 pubmedPubmed@NCBI Article 09/03/1931 pubmedReactome Pathway/BioPAX 09/04/2015 57527092 22reactomeRN@KEGG Pathway 09/04/2015 110971 7755 5keggSGD Genome 09/04/2015 1437648 13sgdTaxonomy@UniProt Taxon 09/04/2014 3230933 taxonomyUniParc@UniProt Sequence 09/04/2009 5.59E+08 53uniparc

UniPathway@UniProt Pathway 09/04/2014 8508 unipathwayUniProtKB@UniProt Protein 09/04/2016 4.56E+08 135uniprotUniRef@UniProt Homolog 09/04/2008 3.9E+08 5unirefUniSTS@NCBI Marker 09/03/1931 7542235 7unists

Mouse and Human Atlas (65 MT)

Free, Open Source software

Bio2RDF Software

• http://sourceforge.net/projects/bio2rdf/• Virtuoso Triple Store gives SPARQL endpoint• Bio2RDF software transforms URIs to SPARQL

queries directed to one or more endpoints• RDFizers – transform legacy data into RDF– OMIM, KEGG

• SW DBs – rules to create Bio2RDF URI’s– Dbpedia, BioPAX

SPARQL Endpointshttp://ns.bio2rdf.org/sparql

http://atlas.bio2rdf.org/sparql

Describe service

http://bio2rdf.org/ns:id

Corresponding SPARQL query :CONSTRUCT {

?s ?p ?o .}WHERE {

?s ?p ?o .FILTER(?s = <http://bio2rdf.org/ns:id>).

}

Sent to http://ns.bio2rdf.org/sparql?query=... DNS subdomain resolution service

Search Servicehttp://bio2rdf.org/search/hexokinase

Virtuoso 6.0 Facet Browsinghttp://lod.openlinksw.com/

Multiple Ways To Represent Knowledge

Fig. 2. Three ways to model the relationship between a protein and the volume it occupies.

Fig. 1. From linked data to linked knowledge through syntactic and semantic normalization.

Ontology as Strategy

OWL Has Explicit Semantics

Can therefore be used to captured knowledge in a machine understandable way

A generalized Biological Data Model

Semantic normalization will improve facet browsing and question answering

You want to join the knowledge web

Share your data

Bridge your data with others in semantic communities (data networks).

Time-sensitive or frequently updated data is one way to encourage more visits.

Bioinformatics Discovery Registry• Part of SharedName initiative to provide stable URI patterns for

data records.• We add the relationship between entities and records

Discovery Service• Registry links entities to data records, their formats (RDF/XML,

HTML, etc) and provider (Bio2RDF, Uniprot)http://registry.semanticscience.org/ns:id

Redirection Service• Automatic redirection to data provider document

http://registry.semanticsience.org/doc/provider/format/ns:id

Build aknowledge basefrom a series of questions

Carole Goble (ISWC 2005)

Web-based Knowledge Discovery a very painful process

The Knowledge Web

• Merging data & services

• Reasoning & question answering

• Persistent (RESTful)

• Trust & Security

Data consumers must be able to rely upon your data to use it as a foundation for their own applications.

2009 Goals

• Add more data!– Standardize RDFizers– Enrichment from small producer data!

• Design more RESTful services (Workflow)• Start using Virtuoso 6 cluster• Add mirrors• Approval from data providers to distribute RDF

dump and publish SPARQL endpoints– Confirmed: UniProt, BioCyc, Pathway Commons, BIND

Triplified Data and Virtuoso DB

http://quebec.bio2rdf.org/download

RDFizer Cookbook

http://bio2rdf.wiki.sourceforge.net/

BIO2RDF Materials

Thanks To:

• The Bio2RDF community• Dumontier Lab – Alex De Leon, Jose Cruz, Natalia Villanueva-Rosales

• Quebec Reseachers– Francois Belleau, Marc-Alexandre Nolin

• Australian Researchers– Peter Ansell

• Openlink Virtuoso Team

dumontierlab.commichel_dumontier@carleton.ca

top related