scientific rdf databases
DESCRIPTION
Scientific RDF Databases. Michael Mertens K.U.Leuven. Outline. Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism. Outline. Introduction to RDF RDF Databases Advantages for scientific R&D In practice Criticism. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Scientific RDF Databases
Michael MertensK.U.Leuven
Outline
• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism
2
Outline
• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism
3
RDF: Resource Description Framework
• Originally: metadata data model
• Now: General method for conceptual description for web resources (Semantic Web)
Introduction
4
• Traditional Web in 2009:
Introduction
• Sharing documents• URL as retrieval mechanism• HTML standard format• Hypertext links
Image taken from “The Emerging Web of Linked Data”, Chris Bizer5
> Semantic Web
• Data on the web– HTML describes documents and links between them
– Semantic web:• Publish data in RDF, OWL, XML, ..• Describe arbitrary things: people, books, events, ..• Link between these concepts• Machine-readable, web-accessible databases
Introduction
6
> Semantic Web
• Tim-Berners Lee: LINKED DATA• Connected structured data• 3 simple principles:– URLs for conceptual things– Returns useful data about that thing– Relationships link to other URLs
Introduction
7
> Semantic Web > Linked Data
Introduction
8
• Before: Scientific data usually not shared• Pharmaceutical Drug Discovery – A lot of spread out data
• Drug Bank, ClinicalTrial.gov, Health Care and Life Science– Genomics data, Protein data, ..
• A question nobody examined before:“What Proteins are involved in signal transduction AND
are related to pyramidal neurons?”
Example taken from “Tim Berners-Lee on the next Web”
> Semantic Web > Linked Data > Example
Introduction
9
• The web: 223,000 hits, 0 results
Example taken from “Tim Berners-Lee on the next Web”
> Semantic Web > Linked Data > Example
Introduction
10
• Linked Data: 32 hits, 32 results
Example taken from “Tim Berners-Lee on the next Web”
DRD1, 1812 adenylate cyclase activationADRB2, 154 adenylate cyclase activationADRB2, 154 arrestin mediated desensitization of G-protein coupled … DRD1IP, 50632 dopamine receptor signaling pathwayDRD1, 1812 dopamine receptor, adenylate cyclase activating pathwayDRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathwayGRM7, 2917 G-protein coupled receptor protein signaling pathwayGNG3, 2785 G-protein coupled receptor protein signaling pathwayGNG12, 55970 G-protein coupled receptor protein signaling pathwayDRD2, 1813 G-protein coupled receptor protein signaling pathwayADRB2, 154 G-protein coupled receptor protein signaling pathwayCALM3, 808 G-protein coupled receptor protein signaling pathwayHTR2A, 3356 G-protein coupled receptor protein signaling pathwayDRD1, 1812 G-protein signaling, coupled to cyclic nucleotide second… SSTR5, 6755 G-protein signaling, coupled to cyclic nucleotide second… MTNR1A, 4543 G-protein signaling, coupled to cyclic nucleotide …
HTR6, 3362 G-protein signaling, coupled to cyclic nucleotide second …GRIK2, 2898 glutamate signaling pathwayGRIN1, 2902 glutamate signaling pathwayGRIN2A, 2903 glutamate signaling pathwayGRIN2B, 2904 glutamate signaling pathwayADAM10, 102 integrin-mediated signaling pathwayGRM7, 2917 negative regulation of adenylate cyclase activityLRP1, 4035 negative regulation of Wnt receptor signaling pathwayADAM10, 102 Notch receptor processingASCL1, 429 Notch signaling pathwayHTR2A, 3356 serotonin receptor signaling pathwayADRB2, 154 transmembrane receptor protein tyrosine kinase … PTPRG, 5793 transmembrane receptor protein tyrosine kinase … EPHA4, 2043 transmembrane receptor protein tyrosine kinase … NRTN, 4902 transmembrane receptor protein tyrosine kinase … CTNND1, 1500 Wnt receptor signaling pathway
> Semantic Web > Linked Data > Example
Introduction
11Example taken from “Tim Berners-Lee on the next Web”
PREFIX go: <http://purl.org/obo/owl/GO#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX mesh: http://purl.org/commons/record/mesh/SELECT ?genename ?processnameWHERE{ graph http://purl.org/commons/hcls/pubmesh { ?paper ?p mesh:D017966 . ?article sc:identified_by_pmid ?paper. ?gene sc:describes_gene_or_gene_product_mentioned_by ?article.} graph <http://purl.org/commons/hcls/goa> { ?protein rdfs:subClassOf ?res. ?res owl:onProperty ro:has_function. ?res owl:someValuesFrom ?res2. ?res2 owl:onProperty ro:realized_as. ?res2 owl:someValuesFrom ?process. graph <http://purl.org/commons/hcls/20070416/classrelations> {{?process <http://purl.org/obo/owl/obo#part_of> go:GO_0007166} union { ?process rdfs:subClassOf go:GO_0007166 }} ?protein rdfs:subClassOf ?parent. ?parent owl:equivalentClass ?res3. ?res3 owl:hasValue ?gene.} graph <http://purl.org/commons/hcls/gene> { ?gene rdfs:label ?genename } graph <http://purl.org/commons/hcls/20070416> { ?process rdfs:label ?processname}}
> Semantic Web > Linked Data > Example
Related to Pyramidal Neurons
Part of Signal Transduction
Used 4 sources
Introduction
12
> Semantic Web > Linked Data
Introduction
13
> Semantic Web > Linked Data
• What do we need?– Identifiers: URIs– Linking mechanism: HTTP– Vocabulary: Web Ontology Language (OWL)– Serialization: RDF/XML
Introduction
14
> Semantic Web > Linked Data
• Identifiers: URIs– Use of HTTP URL– Link to “Resources”– Possibly many documents per resource– Shift to non-information resources:
Introduction
15
> Semantic Web > Linked Data
http://dbpedia.org/resource/London
HTML: http://dbpedia.org/page/LondonRDF: http://dbpedia.org/data/London.rdfN3: http://dbpedia.org/data/London.ntriples
• Linking mechanism: HTTP– Accessible through generic data browsers– Allowing to be crawled by search engines – Connecting different sources
– In contrast, Web APIs use different interfaces
Introduction
16
> Semantic Web > Linked Data
• Vocabulary: Web Ontology Language (OWL)– Knowledge representation language– Designed to be interpreted by computers– Describes data, based on individuals (classes) and
property assertions (relationships)
Introduction
17
> Semantic Web > Linked Data
<owl:Class rdf:ID="Money"> <rdfs:subClassOf rdf:resource="http://www.w3.org/2002/07/owl#Thing"/></owl:Class><owl:DatatypeProperty rdf:ID="currency"> <rdfs:domain rdf:resource="#Money"/> <rdfs:range rdf:resource="http://www.w3.org/2001/XMLSchema#string"/></owl:DatatypeProperty>
• Vocabulary: Web Ontology Language (OWL)– Knowledge representation language– Designed to be interpreted by computers– Describes data, based on individuals (classes) and
property assertions (relationships)– URIs about the same thing: ‘owl:sameAs’
Introduction
18
> Semantic Web > Linked Data
• Based on triples– Subject, predicate, object
• Resources identified by URI• URIs allow to look up RDF information• RDF information links to other URIs
RDF: Resource Description Framework
19
< http://dbpedia.org/resource/London,http://dbpedia.org/ontology/country,
http://dbpedia.org/resource/United_Kingdom >
20
RDF: Resource Description Framework
21
RDF: Resource Description Framework
22
RDF: Resource Description Framework
This looks a lot like XML..
Why don’t we just use XML??
RDF: <Page, author, Name>
XML: <document href=“Page”> <author>Name</author> </document>
<document> <details> <uri>Page</uri> <author>Name</author> </details></document>
<author> <uri>Page</uri> <name>Name</name></author> ...
RDF vs XML
23
• RDF/XML: proposed by W3C
• N3 or Turtle: human-readability
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Tony_Benn"> <dc:title>Tony Benn</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF>
@prefix dc: <http://purl.org/dc/elements/1.1/>. <http://en.wikipedia.org/wiki/Tony_Benn> dc:title "Tony Benn"; dc:publisher "Wikipedia".
RDF: Serialization
24
Outline
• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism
25
• Also called “Triple Store”• Data in the form of triples:
Subject – predicate – object• Dominant query language: SPARQL
RDF Databases
26
PREFIX abc: <nul://sparql/exampleOntology#> . SELECT ?capital ?country WHERE {
?x abc:cityname ?capital ; abc:isCapitalOf ?y. ?y abc:countryname ?country ; abc:isInContinent abc:Africa.
}
• Built on W3C’s “Linked Data” • Subset of “Graph databases” • Nodes (entities), edges (relationships),
properties
Directed, labeled graph structure (Predicate URI as label)
RDF Databases
27
Graph View
28Image taken from w3.org
• Only standarised NoSQL database• In contrast to normal RDBMS:– Very flexible data model• Do not require fixed table schema
– Information as most basic building blocks• Enabling improvement on data-intensive
operations
• Examples: Ebay, Facebook, digg, ..
RDF Databases
29
• Scalable: Distributed design• Self-Documenting Data – Vocabulary identified in OWL or RDFS definitions– Allows multiple schemata
• Open– Discover new data sources at run-time
• Often weak consistency guarantees– Solved with additional middleware
RDF Databases
30
Limitations of Relational Databases:
• Not directly visible to web-agents• Primary-foreign key relationships– Meaning is implicit, unspecified semantics
• No relationships across seperate databases• Parent-child relationship are not natural– “Self-joins” for each level in hierarchy
31
RDF Databases
Outline
• Introduction to RDF• RDF Databases • Advantages for scientific R&D• Criticism• In practice
32
Advantages for Scientific R&D
• Studies continue to show that research in all fields is increasingly collaborative
• Example: genomic research– Complex data distributed over many datasets• Entrez Gene (EG), Gene Ontology (GO), Swiss_Prot,
GenBank, ..
33
• Problem = Lack of well defined standards– Integration Nightmare: • data scattered, different formats, lacking information• synonyms, ambiguity
– Changing models: • maintenance not feasible
– Understanding and reasoning • need for connecting ontologies
• Challenge: Syntatic and Semantic heterogeneity
34
Advantages for Scientific R&D
• Localization of resources– Identify relevant webresources
• Data formats– Resources are represented in HTML, TXT, images, ..
• Synonyms– Researchers can name their own data differently
35
Integration of Databases > Challenges
• Ambiguity– E.g. “insulin” can represent a drug, protein, gene, ..
• Relations– One-to-one / One-to-many between identifiers
• Granularity– Can cause missing data, ..
36
Integration of Databases > Challenges
• Data Warehouse Approach– Translate data in one local database– Eliminate unavailability & slow response– Allow data processing and optimalization– Maintenance problem • evolution of content and structure
– Examples: BioWarehouse, Biozon, DataFoundry
37
Integration of Databases > Approaches
• Federated Database Approach– Translate queries for individual sources– Easier to maintain (e.g. Adding new source)– Poor performance
– Examples: BioKleisli, DiscoveryLink, QIS
38
Integration of Databases > Approaches
• Semantic Web Approach– No need to map data models– Rely on standarized ontologies
– Less work, better performance– But only if sources comply
39
Integration of Databases > Approaches
Outline
• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism
40
In Practice
• Scientists need:– Access to data
– Ability to utilize data
– Handle uncertainty
41
In Practice
• Linked Open Data:– “We all need the same databases, for different
decisions or applications”– Complements data in internal/licensed sources– Stimulates cross scientific sharing
42
• Biological data: Human Genome Project– Increase in web-accessible databases• GenBank, Gene Ontology, UniProt, PhenoDB, ..
– Integration is key problem
– Increase in RDF availability
43
Examples
• YeastHub– Registration of web-accessible database• Metadata according to Dublin Core standards using
RSS1.0 to describe an ontology– Data Conversion• XML or RDB to RDF conversion
– (eg Unique ID = RDF ID , rest of columns are properties)
– Data Integration• Ad hoc RDF queries• Form-based queries (supervised)
44
Examples
Outline
• Introduction to RDF• RDF Databases • Advantages for scientific R&D• In practice• Criticism
45
• Feasability– Human behavior and personal preferences
• ‘Database hugging’– Organizations tend to keep data for themselves
• Censorship and Privacy
46
Criticism
• Published data reusable in research?– Requires:• Provenance information• Quality• Attribution• Consistency• ...
– Out-of context data fails to respect scientific research methodology
47
Criticism
• Bringing Web 2.0 to bioinformatics2008, Zhang Zhang, Kei-Hoi Cheung and Jeffrey P. Townsend
• Semantic web approach to database integration in life sciences2006, Kei-Hoi Cheung, Andrew K. Smith, Kevin Y.L. Yip, Christopher J.O. Baker and Mark B. Gerstein
• Integrating large biomedical knowledge resources with RDF2007, Satya S. Sahoo, Olivier Bodenreider, Kelly Zeng, Amit Sheth
• RDF/RDFS-based Relational Database Integration2006, Huajun Chen , Zhaohui Wu , Heng Wang , Yuxin Mao
48
References
• Has anyone ever worked with linked (RDF) data before? What are your experiences?
• Will the semantic web grow to become the Giant Global Graph?
• Why haven’t RDF databases taken off like Relational Databases?
49
Discussion