lecture linked data cloud & sparql
TRANSCRIPT
1
COMP3725Knowledge Enriched Information
Systems
Lecture 11-12: Linked Data & SPARQL
Dhavalkumar Thakker (Dhaval)School of Computing, University of Leeds
2
Reading & Reflections
Bizer, et al. Linked Data – The Story so far• What is Linked Data?
– Is it same as Web of Data?
• What excited you most about linked data while reading this article? OR what did you find most interesting?
• Is Linked Data happening in real life? Have you seen this anywhere?
3
Outline
• What is Linked Data?• Why Linked Data?• How to publish as part of Linked Data
– Linked Data Principles– Finding existing sources– Possible software architectures– Query Language: SPARQL
5
About:• United States• Barack Obama• Presidential Election (Past)• Some relevance to currently
held• Democrats & Republicans• Winner & Looser• Chicago• Etc..
Web of Documents
THINGSAbout:• Location, Event, Places,
Persons, Groups, Abstract concepts (winning, losing)
6
..people can parse documents and extract meaning
7
The web of documents
• Analogy– Global file system
• Designed for– Human consumption
• Primary objects– documents
• Links between– documents (or sub-parts of)
• Semantics– implicit
8
The web of documents: Issues
• Web of Documents but primarily About Data– But the connection is implicit
• Integration & Querying– Show me all the news stories by US Presidents
coming from Chicago?
9
Semantic Web
• We need to help machines to understand the web..so machines can help us to understand things.
• If machines have access to the data about things (i.e. knowledge) then they can do better job while processing documents
10
Linked Data
An introduction to Linked Data- Tim Heath, Talis
Linking Things
Thing
relationshiplinks
Thing
Thing
Thing
Thing
Thing Thing
Thing
Thing
Thing
relationshiplinks
relationshiplinks
relationshiplinks
11
Linked Data…
• …. is about creating global database of linked things
• …refers to a set of best practices for publishing and interlinking data on the Web…
• ….is a method of publishing data [on the Web], so that it can be interlinked and become more useful.
12
The Web of Linked Data
• Analogy– a global database
• Designed for– machines first, Humans later
• Primary objects– things (or descriptions of things)
• Links between– things
• Semantics– explicit
13
Linked Data: Technologies
• Pre-requisite– URIs– HTTPs– RDF– (RDFS/OWL)
14
Linked Data Technologies : URIs
• Like URLs but not just for Web pages– For things (cars, people, places, organisations,
coursework, etc.)• “A Uniform Resource Identifier (URI) provides a
simple and extensible means for identifying a resource.” -- RFC 3986
• Many different schemes – http://, ftp://, mailto:• Examples:
http://imash.leeds.ac.uk/ontologies/foaf/dhaval/me.rdfhttp://dbpedia.org/resource/University_of_Leeds
15
HTTP
• Data access mechanism between web browsers (client) and servers
• HTTP messages consists of requests from client to servers and responses from servers to clients
• HTTP request/response methods: GET, POST, etc.
16
RDF
• Data format to describe things and their interrelations
• is based on triples• Subject, predicate, object• <The sky> <has the colour> <blue>
17
RDF
dt:dhaval foaf:Personrdf:type
Dhaval Thakkerfoaf:name
dbpedia:Leedsfoaf:based_near
Prefixesdt: < http://imash.leeds.ac.uk/ontologies/foaf/dhaval/ me.rdf#>rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>foaf: <http://xmlns.com/foaf/0.1/>dbpedia: <http://dbpedia.org/resource/>
From my profile in RDF
18
Data Merging with RDF
dt:dhaval foaf:Personrdf:type
Dhaval Thakkerfoaf:name
dbpedia:Leedsfoaf:based_near
dbp-prop:population
dbp-prop: is part of dbpedia:West_Yorkshire
751,500
dbpedia:Leeds
From Dbpedia
From my profile in RDF
Prefixesdt: < http://imash.leeds.ac.uk/ontologies/foaf/dhaval/ me.rdf#>rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>foaf: <http://xmlns.com/foaf/0.1/>dbpedia: http://dbpedia.org/resource/dbp-prop: <http://dbpedia.org/ontology/>
19
Data Merging with RDF
dt:dhaval foaf:Personrdf:type
Dhaval Thakkerfoaf:name
dbpedia:Leedsfoaf:based_near
Prefixesdt: < http://imash.leeds.ac.uk/ontologies/foaf/dhaval/ me.rdf#>rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>foaf: <http://xmlns.com/foaf/0.1/>dbpedia: http://dbpedia.org/resource/dbp-prop: <http://dbpedia.org/ontology/>
dbp-prop:population
dbp-prop: is part of dbpedia:West_Yorkshire
751,500
dbpedia:Leeds
From Dbpedia
From my profile in RDF
20
Linked Data Principles
• Use URIs as names for things– anything, not just documents
• Use HTTP URIs– globally unique names, distributed ownership– allows people to look up those names
• Provide useful information in RDF– when someone looks up a URI
• Include RDF links to other URIs– to enable discovery of related information
Tim Berners-Lee 2007http://www.w3.org/DesignIssues/LinkedData.html
21
Linked Data Principles
• Use URIs as names for things– anything, not just documents
• Use HTTP URIs– globally unique names, distributed ownership– allows people to look up those names
Tim Berners-Lee 2007http://www.w3.org/DesignIssues/LinkedData.html
22
Linked Data Principles
• Use URIs as names for things– anything, not just documents
• Use HTTP URIs– globally unique names, distributed ownership– allows people to look up those names
• Provide useful information in RDF– when someone looks up a URI
Tim Berners-Lee 2007http://www.w3.org/DesignIssues/LinkedData.html
23
Provide useful information in RDF
dt:me foaf:Personrdf:type
Dhaval Thakkerfoaf:name
dbpedia:Leedsfoaf:based_near
Prefixesdt: < http://imash.leeds.ac.uk/ontologies/foaf/dhaval/ me.rdf#>rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>foaf: <http://xmlns.com/foaf/0.1/>dbpedia: <http://dbpedia.org/resource/>
From my profile in RDF
http://imash.leeds.ac.uk/ontologies/foaf/dhaval/ me.rdf#me
24
RDF is Data Model, Not Serialisation Format
• RDF Serialisation Formats : RDF/XML, Turtle, N-Triples
– RDF/XML
<rdf:RDF
xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#
xmlns:foaf=http://xmlns.com/foaf/0.1 />
<foaf:Person rdf:ID="me">
<foaf:name>Dhavalkumar Thakker</foaf:name>
<foaf:title>Dr</foaf:title>
<foaf:based_near rdf:resource="http://dbpedia.org/resource/Leeds"/>
25
RDF is Data Model, Not Serialisation Format
• RDF Serialisation Formats : RDF/XML, Turtle, N-Triples
– Turtle
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix dt: < http://imash.leeds.ac.uk/ontologies/foaf/dhaval/me.rdf#>
dt:me rdf:type foaf:Person ;
foaf:name “Dhavalkumar Thakker" ; foaf:title “Dr" .
26
RDF is Data Model, Not Serialisation Format
• RDF Serialisation Formats : RDF/XML, Turtle, N-Triples
– N-Triples
< http://imash.leeds.ac.uk/ontologies/foaf/dhaval/me.rdf#me> <xmlns:foaf=http://xmlns.com/foaf/0.1#name> “Dhavalkumar Thakker”.
< http://imash.leeds.ac.uk/ontologies/foaf/dhaval/me.rdf#me>
< http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <xmlns:foaf=http://xmlns.com/foaf/0.1#Person>.
27
Linked Data Principles
• Use URIs as names for things– anything, not just documents
• Use HTTP URIs– globally unique names, distributed ownership– allows people to look up those names
• Provide useful information in RDF– when someone looks up a URI
• Include RDF links to other URIs– to enable discovery of related information
Tim Berners-Lee 2007http://www.w3.org/DesignIssues/LinkedData.html
28
Including Links to other Things: Relationship Links
• Relationship Links point at related things in other data sources, for instance, other people, places or genes.
• For example, relationship links enable people to point to background information about the place they live, or to bibliographic data about the publications they have written.
29
Including Links to other Things: Relationship Links
dt:dhaval foaf:Personrdf:type
Dhaval Thakkerfoaf:name
dbpedia:Leedsfoaf:based_near
Prefixesdt: < http://imash.leeds.ac.uk/ontologies/foaf/dhaval/ me.rdf#>rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>foaf: <http://xmlns.com/foaf/0.1/>dbpedia: http://dbpedia.org/resource/dbp-prop: <http://dbpedia.org/ontology/>
dbp-prop:population
dbp-prop: is part of dbpedia:West_Yorkshire
751,500
dbpedia:Leeds
From Dbpedia
From my profile in RDF
30
Including Links to other Things: Identity Links
• Different URIs may refer to the same object
<URI1> in one dataset
is same as
<URI2> defined somewhere else
<http://dbpedia.org/resource/Kirkgate_Markets> <owl:sameAs> <http://rdf.freebase.com/ns/guid.9202a8c04000641f8000000000c5f680>
• Such a need exists due to:– Different opinions. – Traceability. – No central points of failure.
31
Including Links to other Things: Vocabulary Links
• Reusing existing Vocabularies to further specify yours
<htttp://mydomain.co.uk/myvocab/enterprise#SmallMediumEnterprise>
rdfs:subClassOf
<http://dbpedia.org/ontology/Company>;
rdfs:subClassOf
<http://umbel.org/umbel/sc/Business> ;
rdfs:subClassOf
<http://rdf.freebase.com/ns/m/0qb7t>.
32
Linked Data Principles: Summary
• Use URIs as names for things– anything, not just documents
• Use HTTP URIs– globally unique names, distributed ownership– allows people to look up those names
• Provide useful information in RDF– when someone looks up a URI
• Include RDF links to other URIs– to enable discovery of related information
RDF serialisation formats: RDF/XML, N-Triples & Turtle
Include Links:Relationship, Vocabulary & Identity Links
33
Finding Existing Datasets or Vocabularies
• All of the scenarios about including links to other things assume some sort of knowledge of existing vocabularies/datasets
• Where to Find such datasets?• How to Find such datasets?
– Two steps:• Find datasets/vocabularies that contain certain
Things or Concepts• Once found, how to inspect the coverage and
suitability
34
Where to Find: Web of Data
• A significant number of individuals and organisations have adopted Linked Data as a way to publish their data
• The result is a global data space we call the Web of Data
• The Web of Data forms a giant global graph consisting of billions of RDF triples from numerous sources covering all sorts of topics
35
Web of Data
http://richard.cyganiak.de/2007/10/lod/
36
Statistics about Web of Data (2011)Domain
Number of datasets
Triples (Out-)Links %
Media 25 1,841,852,061 50,440,705 10.01 %
Geographic 31 6,145,532,484 35,812,328 7.11 %
Government 49 13,315,009,400 19,343,519 3.84 %
Publications 87 2,950,720,693 139,925,218 27.76 %
Cross-domain 41 4,184,635,715 63,183,065 12.54 %
Life sciences 41 3,036,336,004 191,844,090 38.06 %
User-generated content
20 134,127,413 3,449,143 0.68 %
295 31,634,213,770 503,998,829
More statistics from: http://www4.wiwiss.fu-berlin.de/lodcloud/state/
37
Step1: Finding existing datasets and vocabularies: publishing sites-> Data Hub
Available from: http://datahub.io/
38
Step 1: Finding existing datasets and vocabularies: search engines-> Sindice
Available from: http://sindice.com/
39
Step 1: Finding existing datasets and vocabularies: search engines-> Sindice
40
Step 1: Finding existing datasets and vocabularies: search engines-> Sindice
41
Step 1: Finding existing datasets and vocabularies: search engines-> Falcon
Available from: http://ws.nju.edu.cn/falcons/conceptsearch/index.jsp
42
Finding existing datasets and vocabularies: search engines-> Watson
Available from: http://kmi-web05.open.ac.uk/WatsonWUI/
43
Finding existing datasets and vocabularies: search engines-> Swoogle
Available from: http://swoogle.umbc.edu/
44
Step 1: Finding existing datasets and vocabularies: search engines-> SWSE
Available from: http://swse.deri.org/
45
Step 2: Once found, how to inspect further for coverage, suitability
• Linked Data sources usually provides SPARQL endpoint for their dataset(s)
• SPARQL endpoint is an end point to dataset(s) that can receive query, and return results
• If you have used MySQL, you might be familiar with PhPMyAdmin– SPARQL endpoint are in similar in nature and
its functionality
46
Web of Data
http://richard.cyganiak.de/2007/10/lod/
http://en.wikipedia.org/wiki/Calgary
http://dbpedia.org/resource/Calgary
dbpedia:native_name Calgary”;
dbpedia:altitude “1048”;
dbpedia:population_city “988193”;
dbpedia:population_metro “1079310”;
mayor_name
dbpedia:Dave_Bronconnier ;
governing_body
dbpedia:Calgary_City_Council;
...
Dbpedia: Extracting Infobox
Dbpedia: SPARQL EndpointWeb address: dbpedia.org/sparql
49
SPARQL
• Query Language for RDF– Based on RDF Data Model
• Possible to write complex joins of disperate datasets
• Implemented by all major RDF databases
See more: http://www.w3.org/TR/rdf-sparql-query/
50
Structure of a SPARQL Query
51
#prefix declaration
prefix dbp-ont: <http://dbpedia.org/ontology/>
#result clause
SELECT *
#dataset definition
FROM <http://dbpedia.org>
#query pattern
WHERE {
dbp-ont:Person ?p ?o.
}
SELECT query: Find everything about Concept of “Person” as in Dbpedia
52
#prefix declaration
prefix dbp-ont: <http://dbpedia.org/ontology/>
#result clause
SELECT *
#dataset definition
FROM <http://dbpedia.org>
#query pattern
WHERE {
dbp-ont:Person ?p ?o.
}
SELECT query: Find everything about Concept of “Person” as in Dbpedia
53
#prefix declaration
prefix dbp-ont: <http://dbpedia.org/ontology/>
Prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
#result clause
SELECT ?o
#dataset definition
FROM <http://dbpedia.org>
#query pattern
WHERE {
dbp-ont:Person rdfs:subClassOf ?o.
}
SELECT query: Find superclasses of Concept of “Person” as in Dbpedia
54
#prefix declaration
prefix dbp-ont: <http://dbpedia.org/ontology/>
Prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
#result clause
SELECT ?s
#dataset definition
FROM <http://dbpedia.org>
#query pattern
WHERE {
?s rdf:type dbp-ont:Person .
}
SELECT query: Find all persons in Dbpedia
55
#prefix declaration
prefix dbp-ont: <http://dbpedia.org/ontology/>
Prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
#result clause
SELECT ?s
#dataset definition
FROM <http://dbpedia.org>
#query pattern
WHERE {
?s rdf:type dbp-ont:Person .
?s rdf:type dbp-ont:Astronaut.
}
SELECT query: Find specific types of persons in Dbpedia
Some one who is Person & Astronaut
56
#prefix declaration
prefix dbp-ont: <http://dbpedia.org/ontology/>
Prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
#result clause
SELECT ?s
#dataset definition
FROM <http://dbpedia.org>
#query pattern
WHERE {
?s rdf:type dbp-ont:Person .
?s rdf:type dbp-ont:Astronaut.
?s dbp-ont:status "Retired"@en.
}
SELECT query: Find specific types of persons in Dbpedia
Some one who is Person & Astronaut& Retired
57
#prefix declaration
prefix dbp-ont: <http://dbpedia.org/ontology/>
Prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
#result clause
SELECT ?s
#dataset definition
FROM <http://dbpedia.org>
#query pattern
WHERE {
?s rdf:type dbp-ont:Person .
?s rdf:type dbp-ont:Astronaut.
?s dbp-ont:status "Retired"@en.
}
LIMIT 10
SELECT query: Find 10 of this, LIMIT
Some one who is Person & Astronaut& Retired
58
#prefix declaration
prefix dbp-ont: <http://dbpedia.org/ontology/>
Prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
#result clause
SELECT *
#dataset definition
FROM <http://dbpedia.org>
#query pattern
WHERE {
?s rdf:type dbp-ont:Person .
?s rdf:type dbp-ont:Astronaut.
?s dbp-ont:status "Retired"@en.
?s dbp-ont:birthDate ?date
} ORDER BY ?date,
LIMIT 10
SELECT query: Find 10 of this and order it by date: ORDER BY
Some one who is Person & Astronaut& Retired & youngest first
59
Mathematical operations & Filtering results• Find me all landlocked countries with a population greater
than 15 million , with the highest population country first
PREFIX type: <http://dbpedia.org/class/yago/>
PREFIX prop: <http://dbpedia.org/property/>
SELECT ?country_name ?population
WHERE
{ ?country a type:LandlockedCountries .
?country rdfs:label ?country_name .
?country prop:populationEstimate ?population .
FILTER (?population > 15000000 && langMatches(lang(?country_name), "EN")) . }
ORDER BY DESC(?population)
60
ASK query: Is India a Landlocked country?
• Is India a Landlocked country?• ASK query:
PREFIX yago: <http://dbpedia.org/class/yago/>
PREFIX prop: <http://dbpedia.org/property/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
ASK
{ <http://dbpedia.org/resource/India> rdf:type yago:LandlockedCountries.}
Replace with Afghanistan
DO NOT HAVE TO SPECIFY “WHERE”
61
Exercise: Write a SPARQL query
• Write a SPARQL query to retrieve all the bands that are of genre rock bands from Republic of Ireland.
Prefix dbpedia: <http://dbpedia.org/resource/>
Prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
Prefix dbp-onto: <http://dbpedia.org/ontology/>
Use following classes or properties
dbp-onto:Band, dbp-onto : genre. dbpedia:Rock_music, dbpedia:Republic_of_Ireland, dbp-ont:hometown
62
Exercise: Write a SPARQL query
• Write a SPARQL query to retrieve all the bands that are of genre rock bands from Republic of Ireland.
Prefix dbpedia: <http://dbpedia.org/resource/>
Prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
Prefix dbp-onto: <http://dbpedia.org/ontology/>
Select * where {
?s rdf:type dbp-onto:Band.
?s dbp-onto:genre dbpedia:Rock_music.
?s dbp-onto:hometown dbpedia:Republic_of_Ireland
}
63
Summary: Finding existing datasets/vocabularies
• Use of search engines to find a dataset • Use of SPARQL endpoints to inspect the
dataset further• SPARQL queries
– SELECT query for selecting a set of results to display
– ASK query to ask a specific question about something
– Variations in terms of LIMIT, ORDER BY
64
Publishing Linked Data: Software Architecture Patterns
• Follow linked data principles– They are good practice principles NOT norms
or rules
• The software architecture needs to support such way of publication– Existing architectures using structured or
unstructured data– doing it from scratch – publishing linked data – different from when working with existing
applications and infrastructure already in place
65
Architecture scenarios
66
Architecture scenarios
67
Type of data
• Structured data– Database tables– XML documents
• Unstructured data– Textual documents
• News stories, reports, textual descriptions – as textual files
Name Address Post code Author of
A ---- ------- Book B
68
Architecture scenarios
69
Query-able Structured Data to Linked Data
• Example: A movie business that has movie database in a relational database
• published relatively easily as Linked Data through the use of relational database to RDF wrappers.
• Maps database schemas to RDF schemas• Wrappers
– Virtuoso RDF Views – Triplify
70
Architecture scenarios
71
Static Structured Data to Linked Data
• A UK government department that has performance data of each department in excel sheets
• must undergo a conversion process that outputs static RDF files or loads converted data directly into an RDF store.
• RDFizing tools – http://www.w3.org/wiki/ConverterToRdf– Tools to convert data from various format to
RDF
72
RDF store
• Also called “triple store” or “semantic repository” • They are engines similar to the DBMS- they allow
for storage, querying, and management of structured data. Major differences:– they use ontologies as semantic schemata. This allows
them to automatically reason about the data.– they work with flexible and generic physical data
models (e.g. graphs). This allows them to easily interpret and adopt "on the fly" new ontologies or metadata schemata.
• Available RDF stores: OWLIM, Allegrograph, Virtuoso, Sesame, Jena TDB
73
Architecture scenarios
74
From Text Documents to Linked Data
• Example: News publisher with a corpus of news stories produced in the last month
• it is possible to pass these documents through a Linked Data entity extractor such as Open Calais(http://www.opencalais.com/), or DBpedia Spotlight(http://dbpedia-spotlight.github.com/demo/index.html) which annotate documents with the Linked Data URIs of entities referenced in the documents.
75
From Text Documents to Linked Data
• Publishing these annotations together with the documents – increases the discoverability of the documents – enables applications to use the referenced Linked Data
sources as background knowledge to display complementary information on web pages
– or to enhance information retrieval tasks, for instance, offer faceted browsing instead of simple full-text search.
• Applications like this to be presented in next lecture(s)
76
Summary
• Linked Data is a way of publishing and interlinking structured data on the web
• Linked Data principles to follow to create such data
• How to find existing datasets: Web of Data• How to query existing datasets: SPARQL• Possible software architecture patterns
77
Next Lecture
• Consuming Linked Data– Linked Data Applications
• What datasets they use from Web of Data• What software architecture they follow
– Benefits• Integration – for organisations• Browsing and interaction – for users
78
References
• Tom Heath, An Introduction to Linked Data, Linked Data Tutorial, Austin, Texas, 2009.
• Raimond et al., A skim-read introduction to linked data
• Tom Heath, Christian Bizer: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web, Morgan & Claypool Publishers 2011
• Cambridge Semantics, SPARQL by example
79
TED talk from Tim Berners Lee on Linked Data
• http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html