what is linked open data (lod)? a brief primer ecoinformatics international technical collaboration...
TRANSCRIPT
What is Linked Open Data (LOD)? A Brief Primer
Ecoinformatics International Technical Collaboration Partnership
International Web Meeting - Linked Open Data and Environmental Information
December 6 and 7, 2010
Bruce Bargmeyer & Kevin KeckLawrence Berkeley National LaboratoryTel: +1 [email protected]
1
Topics
What is LOD – the big picture Potential applications What is LOD – EPA examples
EPA data in LOD form Data descriptions in voID
Extensions to voID using ISO/IEC 11179
Linkage using ontology
Ecoinformatics challenges
2
Potential Applications
Environmental data UNEP, EEA, EPA data … Ecoinformatics Eye on Earth Summit, Abu Dhabi
Technology Infrastructure Workgroup
Health Energy …
3
What is Linked Data?
A term Sir Tim Berners-Lee uses to describe HTTP-based Data Access for the Web
A linking mechanism for the Web that takes us from hypertext links (Document to Document) to hyperdata links
4
Web of Documents
Based on HTTP & HTML Analog: a global file system Primary objects
documents Hypertext links between documents (or sub-parts of)
Degree of structure in objects fairly low
Semantics of content and links implicit
Designed for human consumption
5
Web of Data
Based on Linked Data (HTTP, RDF, RDFa) Analog: a global database Primary objects
things (or descriptions of things) Links between things
Degree of structure in (descriptions of) things high
Semantics of content and links explicit
Needs metadata, e.g., Vocabulary of Interlinked Datasets (voiD) Designed for machines to better assist humans
6
A Bit of a Distinction
Linked Data – a standards based approach—e.g., HTTP, RDF and RDFa—for making data available on the WWW
Open Data – The notion that data should be made openly available, with appropriate use license
Linked Open Data combines the two Governments in US and Europe in the lead
7
8
Web Information Sharing between Data Creators and Data Users
Users Data Creators
UsersUsers
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text
ambienteagriculturatiemposalud hunanoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
ambienteagriculturatiemposalud hunoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038
3268082513485038270800002178
text data
Publish the Documents and Data
9
Web Information SharingLots of Progress with Documents (HTTP & HTML)
Still Problems for Data
Users Data Creators
UsersUsers
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text
ambienteagriculturatiemposalud hunanoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
ambienteagriculturatiemposalud hunoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038
3268082513485038270800002178
text data
Data problems - a confusion of platforms, interfaces, file formats, …
10
Web Information Sharing between Data Creators and Data Users
Users Data Creators
UsersUsers
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text
ambienteagriculturatiemposalud hunanoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
data
environagricultureclimatehuman healthindustrytourismsoilwaterair
123345445670248591308
123345445670248591308
3268082513485038270800002178
3268082513485038270800002178
text data
ambienteagriculturatiemposalud hunoindustriaturismotierraaguaaero
123345445670248591308
123345445670248591308
3268082513485038
3268082513485038270800002178
text data
Publish the Data using standardsPublish the Metadata
W3C, Web 2.0, Web 3.0View
Suppose Sir TBL gave a presentation at the EOE Summit – Likely topic: Linked Data (Linked Open Data) – spirited inspirational
presentation like he recently gave at TED and Gov 2.0 conferences.See: http://www.ted.com/index.php/search?q=berners+lee
Publishing Linked Data involves 3 basic steps1. Assign URIs to the entities described by the data set and provide for dereferencing these URIs over the HTTP protocol into RDF representations.2. Set RDF links to other data sources on the Web, so that clients can navigate the Web of Data as a whole by following RDF links.3. Provide metadata about published data, so that clients can assess the quality of published data and choose between different means of access.
11
Berners-Lee “five star system” for Linked Open Data
★ make your stuff available on the web (whatever format)
★★ make it available as structured data (e.g. excel instead of image scan of a table)
★★★ use non-proprietary format (e.g. csv instead of excel)
★★★★ use URLs to identify things, so that people can point at your stuff
★★★★★ link your data to other people’s data to provide context
Presented by TBL at TED and other conferencesFor examples and benefits of each star level see: http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/
12
Linking Data
IntegratedTaxonomicInformationSystem (ITIS)
MytilusITIS TSN 79452
Gulf ofMaineData
NOAANationalBenthicIndicatorData
Databases that commit toITIS
For example13
Linking Data
IntegratedTaxonomicInformationSystem (ITIS)
National Center for Biotechnology Information (NCBI) Taxonomy
MytilusITIS TSN 79452
Gulf ofMaineData
NOAANationalBenthicIndicatorData
Databases that commit toITIS
More databases that map toITIS
14
A Practical Example
Provided by Pasky Pascual, EPA Inspired by his article: “Evidence-based
decisions for the wiki world”, Pasky Pascual, International Journal of Metadata, Semantics and Ontologies (IJMSO) Volume 4 - Issue 4 – 2009 DOI: 10.1504/IJMSO.2009.029232
He provided data for the Gulf of Maine to LBNL
LBNL is using this to demonstrate environmental linked data and voiD files
15
Toxicity Data for Mainein Excel Format
Mytilus
16
Creation of LOD filesFrom Gulf of Main Toxicity Data Files (.xls)
Shared ontology:
@base <http://xmdr.org/ont/toxicity.owl> .@prefix obs: <http://xmdr.org/ont/observations.owl> .<> an owl:Ontology; owl:imports <http://xmdr.org/ont/observations.owl> .:Toxicity a owl:Class; owl:subClassOf obs:Observation .:species a owl:ObjectProperty; rdf:domain :Toxicity;
rdf:range <http://purl.bioontology.org/ontology/NCBI_NMO/Species> ....
RDF data:
@prefix tox: <http://xmdr.org/ont/toxicity.owl> ....<> an owl:Ontology; owl:imports <http://xmdr.org/ont/toxicity.owl> .<#Place-1> a geo:Point ; geo:lat 43.1 ; geo:long -70.77 ; rdf:label “Spinney Creek” .<#_2> a tox:Toxicity ; w5h:where <#Place-1> ; w5h:when “1985-04-16”^^xsd:Date ;
tox:Species ncbi:Mytilus ; rdf:Value -58 . 17
Transformation to LODFit into W5H Ontology
Observations are Events Who: collecting agency What: observable measured When, Where How: method/protocol use rdf:value for measurement
18
Creation of LOD filesFrom Gulf of Main Toxicity Data Files (.xls)
Where
When
What
Value
19
Mytilusin Integrated Taxonomic Information System (ITIS)
20
NCBI Taxonomy Browser Shows Link between NCBI Taxonomy and ITIS
for Mytilus
21
Linking the Gulf of Maine Data to ITISTaxonomy.
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE rdf:RDF [ <!ENTITY ITIS "http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value="> <!ENTITY xsd "http://www.w3.org/2001/XMLSchema#">]><rdf:RDF xmlns:w5h="http://samo.lbl.gov/ont/W5H.owl#" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:owl="http://www.w3.org/2002/07/owl#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <ToxicityScore rdf:nodeID="sheet1_row2"> <w5h:what><Genus rdf:resource="&ITIS;79452"> <rdfs:label>Mytilus</rdfs:label></Genus> </w5h:what> <w5h:where rdf:nodeID="station1"> <rdfs:label>Spinney Creek</rdfs:label> <geo:lat rdf:datatype="&xsd;decimal">43.1</geo:lat> <geo:long rdf:datatype="&xsd;decimal">-70.77</geo:long> </w5h:where> <w5h:when rdf:datatype="&xsd;date">1985-04-16</w5h:when> <rdf:value rdf:datatype="&xsd;int">-58</rdf:value> </ToxicityScore> <ToxicityScore rdf:nodeID="sheet1_row3"> <w5h:what rdf:resource="&ITIS;79452"> <w5h:where rdf:nodeID="station1"/> <w5h:when rdf:datatype="&xsd;date">1985-06-10</w5h:when> <rdf:value rdf:datatype="&xsd;int">197</rdf:value> </ToxicityScore></rdf:RDF> 22
Linking Pasky’s Gulf of Maine Data (Mytilus)To NOAA National Benthic Inventory Data
Through ITIS
23
Metadata for LOD: voiD
“In order to support clients in choosing the most efficient way to access Web data for the specific task they have to perform, data publishers can provide additional technical metadata about their data set and its interlinkage relationships with other data sets …. The Vocabulary Of Interlinked Datasets … defines terms and best practices to categorize and provide statistical metainformation about data sets as well as the linksets connecting them.”
-- Tim Berners-Lee, Massachusetts Institute of Technology, et al
voiD is a vocabulary and a set of instructions that enables the discovery and usage of linked datasets. A dataset is a collection of data, published and maintained by a single provider, available as RDF, and accessible, for example, through dereferenceable HTTP URIs or a SPARQL endpoint. Based on the voiD vocabulary this document explains how to use voiD in a practical setup, for both data consumers and data providers. -- from voiD Guide
24
voiD: vocabulary of interlinked Datasets
Motivation– Effective Dataset Selection– Efficient Discovery of Datasets, by search engines or data publishers– SPARQL query optimisation and query federation
• Two high-level concepts– Dataset: a dataset is published and maintained by a single provider and
accessible on the Web through de-referenceable URIs or a SPARQL endpoint
– Linkset: a subset of a void:Dataset; store triples to express the interlinking relationship between dataset
• voiD Vocabulary, http://rdfs.org/ns/void/html• voiD User's Guide, http://rdfs.org/ns/void-guide
Source: Kei Cheung, Yale Center for Medical Informatics25
voiD File with Linkage StatisticsMaine Dataset Links to 29 Taxons in ITIS
@prefix void: <http://rdfs.org/ns/void#> .@prefix scovo: <http://purl.org/NET/scovo#> .@prefix : <#> .
<http://www.itis.gov/> a void:Dataset .:toxicity a void:Dataset ; void:vocabulary <http://samo.lbl.gov/ont/w5h.owl> ; void:subset :ME_toxicity .:ME_toxicity void:subset :ME_linkset .:ME_linkset void:linkPredicate <http://samo.lbl.gov/ont/w5h.owl#what> ; void:subjectsTarget :ME_toxicity ; void:objectsTarget <http://www.itis.gov/> ; void:statItem [ scovo:dimension void:numberOfDistinctObjects ; rdf:value 29 ] . 26
Dataset Description in voiD Format
:senselabontology a void:Dataset ; dcterms:title "SenseLab Neuron Ontology" ; dcterms:description "Neuroscience ontology derived from the SenseLab NeuronDB database."; dcterms:license <> ; # TODO foaf:homepage <http://neuroweb.med.yale.edu/senselab/> ; void:exampleResource <http://purl.org/science/owl/sciencecommons/identified_by_pmid> ; void:exampleResource <http://purl.org/ycmi/senselab/neuron_ontology.owl#has_Receptor> ; void:exampleResource <http://purl.org/ycmi/senselab/neuron_ontology.owl#NMDA> ; dcterms:creator :senselab ; ## this organization can be further defined dcterms:source <http://purl.org/ycmi/senselab/neuron_ontology.owl#> ; dcterms:subject <http://purl.org/ycmi/senselab/neuron_ontology.owl#Receptor> ; dcterms:subject <http://dbpedia.org/resource/Receptor_(biochemistry)> ; dcterms:subject <http://dbpedia.org/resource/Neurotransmitter_receptor> ; dcterms:subject <http://dbpedia.org/resource/Sensory_receptor> ; dcterms:source <doi:10.1093/bib/bbm018> ; void:feature :owl ; ## this technical feature can be further defined void:sparqlEndpoint <http://hcls.deri.org:8080/> ; void:vocabulary <http://www.obofoundry.org/ro/ro.owl> .
Source: Adapted from Kei Cheung, Yale Center for Medical Informatics
27
voiD is Extensible
The voiD vocabulary is extensible. It may be useful to extend it as needed for
evaluating and documenting data for environmental decision making. E.g. EPA data standards ISO/IEC 11179 data descriptions
28
voiD Deployment
Deploy a voiD file (in either Turtle, RDF/XML or RDFa format) onto the Web server
Make it accessible to search engines, such as Sindice (http://sindice.com/) Publish a Semantic Sitemap file (sitemap.xml) on the server
“...... allows Data publishers to state where documents containing RDF data are located, and to advertise alternative means to access it ......” [1] Use the datasetURI property in the sitemap.xml to point to the voiD
description of a dataset, e.g., http://neuroweb.med.yale.edu/senselab/senselab-void.ttl#senselabontology
[1] http://sw.deri.org/2007/07/sitemapextension/
Source: Kei Cheung, Yale Center for Medical Informatics
29
Some Questions/Challenges
Additional metadata for provenance W3C Provenance Vocabulary
Very similar to Open Provenance Model
ISO/IEC 11179 – can use to extend voiD Standard Measurement units ontology?
NIST? OMG? SWEET? OBO?
How to manage the terminologies Which Management tools for terminologies/ontologies
Open Ontology Repository BioPortal based implementation
30
Challenge: Managing Terminologies/Ontologies
31
Some Questions/Challenges
Standard RDF version of ITIS? Use LSIDs, or not?
Vocabulary for data collection purpose? Critical for determining comparability
Standardize on usage of W5H for datasets? Standard for numerical ranges (e.g., <44)?
c.f. GoodRelations ….
32
Acknowledgements
Kevin Keck, LBNLMark Musen, Natasha Noy, et al, StanfordPasky Pascual, EPAOthers as noted on slides
This material is based upon work supported by the National Science Foundation, under Grant No. 0637122, by USEPA and by DOE. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, DOE, or USEPA .
33