samad paydar [email protected] wtlab research group ferdowsi university of mashhad an...
TRANSCRIPT
Samad [email protected] Research GroupFerdowsi University of Mashhad
An Introduction to Linked Data,
Its Applications and Challanges
2nd October 2009
Outline
The Web of Documents vs. the Web of DataLinked DataLinking Open Data ProjectLinked Data Technology StackLinking Data Applications Outlook Similar Developments Challenges
2
The Web of Documents vs. the Web of Data
3
The Web of Documents
4
Traditional Web, Hypertext WebAnalogy
A global filesystemDesigned for
Human consumptionPrimary objects
DocumentsLinks
UntypedBetween documents (or parts of documents)
Degree of structure in objectFairy low
Semantics of content and linksimplicit
The Web of Documents
5
The Web of Documents : Challenges
The Web has radically altered the way people share knowledge
By lowering the barrier to publishing and accessing documents
But it is not so about applications and dataTraditionally, data on the Web is published as formats like
HTML tables, CSV or XML files, … Much of the structure and semantic of data is sacrificed.
6
The Web of Documents : Challenges
7
Data integration “Show me all the publications from Semantic Web-related
conferences in 2007”Querying across data sources
“Which WWW2008 papers have been written by people from companies of less than 100 people?”
Note that all the data required to answer the above questions might be available on the Web.
The Web of Data
8
AnalogyA global data space
Designed forMachines first, humans later
Primary objectsThings (description of things)
LinksTypedBetween things
Degree of structure in objectsHigh
Semantic of content and linksExplicit
The Web of Linked Data
9
Linked Data
10
Linked Data
11
Is about using the Web to create typed links between data from different sources
Refers to data published on the Web in such a way thatIt is machine-readableIts meaning is explicitly definedIt is linked to other datasetsIt can be linked to from external datasets
Linked Data and Web of Data
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions - the Web of Data.
12
Properties of the Web of Data
13
It is genericCan contain any type of data
Data about anythingAnyone can publish dataNo constraints on choice of vocabularies entities are
connected by RDF links
A Taste of Linked Data
14
A Taste of Linked Data
15
A Taste of Linked Data
16
Linked Data
17
Linking Open Data Project
18
LOD Project
19
Linking Open Data ProjectA community projectFounded in January 2007Supported by W3C Semantic Web Education and Outreach
GroupGoal: to bootstrap the Web of Data by identifying existing
datasets that are available under open licenses, converting them to RDF (according to Linked Data principles), interlink them with other datasets, and publishing then on the Web
LOD Cloud : May 2007
20
LOD Cloud
21
The image shows only datasets that are published based on Linked Data Principles and are interlinked with at least one other dataset in the cloud
Each circle represents a dataset Size of the circle corresponds to the number of triples Arrows represent the links between datasets Thickness of arrows indicates number of links between
datasets Some datasets act as hub
E.g. DBpedia, Geonames, …
DBpedia
22
Extract structured information from Wikipedia and making it available on the Web under an open license
Geonames
23
Contains over eight million geographical names6.5 million unique features2.2 million populated places and 1.8 million alternate namesfeatures categorized into one out of nine feature classesfurther subcategorized into one out of 645 feature codes
Geonames
24
LOD Cloud : July 2007
25
LOD Cloud : August 2007
26
LOD Cloud : November 2007
27
LOD Cloud : February 2008
28
LOD Cloud : March 2009
29
LOD Cloud : July 2009
30
LOD Cloud
31
Content of the cloud is diverseData about geographic locations, people, companies, books,
scientific publications, companies, books, films, music, TV programs, genes, proteins, …
Some statisticsThe Web of Data currently consists of 4.7 billion RDF
triples, interlinked around 142 million RDF links (May 2009)
A Programmer’s Point of View
Semantic technologies like Linked Data, decouple applications from data through the use of a simple, abstract data model
Any application that understands the model, can consume any data source published based on the model
32
Don’t Miss books
To really feel it, I recommend to study
33
Linked Data Technology Stack
34
Linked Data Technology Stack
35
Linked Data Principles
36
Berners-Lee, 20061. Use URIs as names for things2. Use HTTP URIs so that people can lookup those names3. When someone looks up a URI, provide useful
information4. Include links to other URIs, so that they can discover more
things
URI: Uniform Resource Identifier
37
“URI provides a simple and extensible means for identifying a resource” RFC 3986
URL: for documents and other entities that can be located on the Web
URI is a more generic means to identify any entity existing in the world
HTTP
38
Provides URI dereferencing: A simple mechanism for retrieving resources that can be serialized as a stream of bytes
E.g. picture of a dogDescriptions of entities that cannot themselves be sent
across networkE.g. the dog itself
RDF
39
HTML provides a means to structure and link documentsRDF provides a generic, graph-based data model to
structure and link data that describes thingsA triple [subject, predicate, object]
Subject: a URIObject: a URI or a string literalPredicate: a URI
RDF Link
RDF Link: take the form of RDF triples, where the subject of the triple is a URI reference in the namespace of one data set, while the object of the triple is a URI reference in the otherS: http://data.linkedmdb.org/resource/film/77P: http://www.w3.org/2002/07/owl#sameAsO: http://dbpedia.org/resource/Pulp_Fiction_%28film%29
Allow client applications to navigate between data sources to discover additional data
40
RDFS / OWL
41
Provide a basis for creating vocabularies that can be used to describe entities in the world and how they are related
Linked Data
42
Linked Data employsHTTP URIs to identify resourcesHTTP Protocol to retrieve resourcesRDF data model to represent resources
Therefore, it is built on the general architecture of the Web
Linking Data Applications
43
Current Applications
Numerous efforts are underway to research and build applications that exploit this Web of data. At present, these efforts can be broadly classified into three categories:1. Linked Data browsers2. Linked Data search engines and indexes3. Domain-specific Linked Data applications
44
Linked Data Applications
45
Linked Data BrowsersBrowse things, not just documentsBrowse and navigate between data E.g. Disco, Tabulator, Marbles
46
Data about Berlin on DBpedia is linked to data about Berlin on Geonames
Linked Data Search Engines and Indexes
47
Crawl Linked Data from the Web and provide query capabilities over aggregated dataHuman-oriented
E.g. Falcon, SWSEApplication-oriented
E.g. Swoogle, Watson,
Domain-Specific Applications
48
RevyuDbpedia MobileTalis AspireBBC Programmes and BBC Music
Revyu
49
DBpedia Mobile
50
Uses Dbpedia, Revyu, and Flickr
Outlook
51
Future Queries
Which European city has the greatest concentration of works by Caravaggio?and has direct flights from my home town?with an airline that is rated good or excellent?by me? ...by my friends?
Whereabouts near my home can I see buildings by architects who were influenced by the Bauhaus?On a Monday?and with a student discount?
52
53
Similar Developments
RDFa
A common serialization format for Linked Data is RDF/XML
Alternatively, Linked Data can also be serialized as RDFaA way of annotating XHTML Web pages with RDF dataIdea: to publish content once, mixing the human readable
and machine readable content together
54
RDFa
<body> .... Toby's nickname is: kiwitobes ...</body>
55
Simple HTML human readable
RDFa
56
<body>
....
Toby's nickname is:
<span xmlns:foaf="http://xmlns.com/foaf/0.1/"
about="http://kiwitobes.com/toby.rdf#ts"
property="http://xmlns.com/foaf/0.1/nick">
kiwitobes
</span>
...
</body>
RDFa embedded in HTML both human readable and machine readable
Microformats
57
Aim at extending traditional Web with structured data Define a set of simple data formats that are embedded into
HTML via class attributed Differences with Linked Data
Linked Data is not limited in the vocabularies, vocabulary development is completely open
Microformats are restricted to a set of vocabularies developed by a specific community
Microformats
58
Data items that are included in HTML pages via Microformats do not have their own identifier. This prevents assertions about the relationships between data items and to connect data items between pages and sites. By using URIs as global identifiers and RDF to represent relationships, Linked Data does not have these limitations.
Microformats
…<div> Toby Segaran </div><div> Organization: The Semantic Programmers
</div><div> Tel: 919-555-1234 </div>…
59
Simple HTML human readable
Microformats : hCard
…<div class="vcard"><div class="fn">Toby Segaran</div><div class="org">The Semantic
Programmers</div><div class="tel">919-555-1234</div>…
60
Semantic embedded in HTML both human readable and machine readable
Web APIs
61
Many major Web data sources such as Amazon, eBay, Yahoo!, and Google provide access to their data via Web APIs.Web APIs are accessed using a wide range of different
mechanisms, and data retrieved from these APIs is represented using various content formats.
E.g. return JSON, XML, RDF, …Most Web APIs do not assign globally unique identifiers to
data items. Therefore it is not possible to set links between items in different data sources in order to connect data into a global data space.
Web APIs slice the Web into Walled Gardens
62
Semantic Web
63
“The first step is putting data on the Web in a form that machines can naturally understand, or converting it to that form. This creates what I call a Semantic Web – a web of data that can be processed directly or indirectly by machines” TBL, 2000
Semantic Web, or Web of Data, is the goal or the end result of this process, Linked Data provides the means to reach that goal.
Over time, with Linked Data as a foundation, some of the more sophisticated proposals associated with the Semantic Web vision, such as intelligent agents, may become a reality.
64
Challenges
Challenges
65
HCI-related issuesApplication architectures for linked data access
Crawling and caching issues Search engines On-the-fly link traversalFederated querying
Challenges
66
Link ManagementLink Discovery, Automatic link generations
Linking algorithms String matching (lexical distance between labels)Common key matching (ISBN, Musicbrainz IDS)Property-based matching
link validation, link maintenance,
Challenges
67
Schema mapping and data fusionToday, most Linked Data applications display data from different
sources alongside each other but do little to integrate it further. To do so does require mapping of terms from different vocabularies to the applications target schema, as well as fusing data about the same entity from different sources, by resolving data conflicts.
Data sources can publish correspondences between their local terminology and the terminology of related data sources on the Web of Data. RDFS and OWL define basic terminology like owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, rdfs:subPropertyOf that can be used to publish basic correspondences.Languages are required to specify more fine-grained schema mappings
Challenges
68
Data fusion the process of integrating multiple data items representing the same
real-world object into a single, consistent, and clean representation.Conflict resolution challenge
Challenges
69
Licensing Applications that consume data from the Web must be able
to access explicit specifications of the terms under which data can be reused and republished.
Trust, Quality and Relevanceto ensure the data most relevant or appropriate to the user's
needs is identified and made availablecontent-, context-, and rating-based techniques can be used
to heuristically assess the relevance, quality and trustworthiness of data is given in
Challenges
70
Equivalents to the PageRank algorithm will likely be important in determining coarse-grained measures of the popularity or significance of a particular data source, as a proxy for relevance or quality of the data, however such algorithms will need to be adapted to the linkage patterns that emerge on the Web of Data.
(Berners-Lee, 1997) proposed that browser interfaces should be enhanced with an “Oh, yeah?” button to support the user in assessing the reliability of information encountered on the Web. Whenever a user encounters a piece of information that they would like to verify, pressing such a button would produce an explanation of the trustworthiness of the displayed information.
Challenges
71
Privacy protection One problematic area are the opportunities to violate privacy
that arise from integrating data from distinct sources.
LDOW Workshops
Linking Data On the Web The goal of the LDOW workshop is to provide a forum for
the Linked Data community2008, 2009
72