scalable integration and processing of linked data · rdf recommendation in 1999, update in 2004...

65
Tutorial at ISWC 2011, http://sild.cs.vu.nl/ Scalable Integration and Processing of Linked Data Andreas Harth, Aidan Hogan, Spyros Kotoulas, Jacopo Urbani

Upload: others

Post on 22-Oct-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

  • Tutorial at ISWC 2011, http://sild.cs.vu.nl/

    Scalable Integration and Processing of Linked Data Andreas Harth, Aidan Hogan, Spyros Kotoulas, Jacopo Urbani

  • 2

    Outline

    Session 1: Introduction to Linked Data Foundations and Architectures Accessing Linked Data Introduction to SPARQL

    Session 2: Integrating Web Data with Reasoning Introduction to RDFS/OWL on the Web Introduction and Motivation for Reasoning

    Session 3: Distributed Reasoning: Because Size Matters Problems and Challenges MapReduce and WebPIE

    Session 4: Putting Things Together (Demo) The LarKC Platform Implementing a LarKC Workflow

    24.10.2011

  • 3

    INTRODUCTION

    24.10.2011

  • 4

    Motivation

    With increased use of computers more and more data is being stored Organisations rely on data for business decisions Data drives policy decisions in government Individuals rely on data from the Web for information and communication

    Data volumes explode More and more data available on the Web is represented in Semantic Web standards Linking Open Data (LOD) initiative

    Semantic Web technologies facilitate the integration of data from multiple sources Combining data from multiple sources enables insights

    24.10.2011

  • 5

    Linked Data Now!

    24.10.2011

    http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html

  • 6

    Linked Data on the Web

    24.10.2011

    2007-10

  • 7

    Linked Data on the Web

    24.10.2011

    2007-11

  • 8

    Linked Data on the Web

    24.10.2011

    2008-02

  • 9

    Linked Data on the Web

    24.10.2011

    2008-03

  • 10

    Linked Data on the Web

    24.10.2011

    2008-09

  • 11

    Linked Data on the Web

    24.10.2011

    2009-03

  • 12

    Linked Data on the Web

    24.10.2011

    2009-07

  • 13

    Linked Data on the Web

    24.10.2011

    2010-09

  • 14

    Linked Data on the Web

    24.10.2011

    2011-09

  • 15

    Types of Data in the Linking Open Data Cloud

    24.10.2011

    http://www4.wiwiss.fu-berlin.de/lodcloud/state/ (Sept 2010)

  • 16

    Scenario Overview

    Semantic Technologies facilitate access to data Q: data about Berlin? Q: famous people that died in Berlin? Q: data about Hegel? Q: Hegel’s publications? Q: data about Marlene Dietrich? Q: Dietrich’s songs?

    24.10.2011

    1. Query

    2. Answer

    ? !

  • 17

    DBpedia

    Linked Data version of Wikipedia Scripts that extract data (text, links, infoboxes) from Wikipedia Published as Linked Data Interlinking hub in the Linked Data web Berlin

    http://dbpedia.org/resource/Berlin Hegel

    http://dbpedia.org/resource/Georg_Wilhelm_Friedrich_Hegel Marlene Dietrich

    http://dbpedia.org/resource/Marlene_Dietrich

    24.10.2011

  • 18

    BBC Music

    Data about BBC (radio) programmes, artists, songs… Combination of BBC-internal data (playlists), MusicBrainz (artists, albums), Wikipedia (artists) Underpinning the BBC Music website Data published according to Linked Data principles Marlene Dietrich

    http://www.bbc.co.uk/music/artists/191cba6a-b83f-49ca-883c-02b20c7a9dd5

    24.10.2011

  • 19

    Virtual International Authority File (VIAF)

    Joint project of national libraries and related organisations 21 institutions, among them the Deutsche Nationalbibliothek

    Provide access to “authority files” Matching and interlinking collections from participating institutions Hegel

    http://viaf.org/viaf/89774942 Marlene Dietrich

    http://viaf.org/viaf/97773925

    24.10.2011

  • 20

    LINKED DATA PRINCIPLES

    24.10.2011

  • 21

    Semantic Technologies

    Semantic Web technologies, standardised by the W3C, are mature:

    RDF recommendation in 1999, update in 2004 RDFa (RDF in HTML) note in 2008 RDFS recommendation in 2004 SPARQL recommendation in 2008 OWL recommendation in 2004, update in 2009

    Linked Data is a subset of the Semantic Web stack, including web architecture:

    IRI (IETF RFC 3987, 2005) HTTP (IETF RFC 2616, 1999)

    24.10.2011

  • 22

    Linked Data Principles

    1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the

    standards (RDF*, SPARQL) 4. Include links to other URIs. so that they can discover more things.

    24.10.2011

    http://www.w3.org/DesignIssues/LinkedData

  • 23

    1. Use URIs as Names for Things

    Use a unique identifier to denote things URIs are defined in RFC 2396 Hegel, Georg Wilhelm Friedrich

    http://dbpedia.org/resource/Georg_Wilhelm_Friedrich_Hegel http://viaf.org/viaf/89774942 …

    Hegel, Georg Wilhelm Friedrich: Gesammelte Werke / Vorlesungen über die Logik

    urn:isbn:978-3-7873-1964-0

    24.10.2011

  • 24

    Names for Things

    24.10.2011

  • 25

    2. Use HTTP URIs

    Enables “lookup” of URIs Via Hypertext Transfer Protocol (HTTP) Piggy-backs on hierarchical Domain Name System to guarantee uniqueness of identifiers Uses established HTTP infrastructure Connects logical level (thing) with physical level (source) Important: distinction between name/“thing URI” and location/“source URI” („other resource“/„non-information resource“ vs. „information resource“)

    24.10.2011

  • 26

    Information Resources vs. Other Resources

    24.10.2011

    Name? Creator? Birth date? Last change date? License? Copyright? …

    Marlene Dietrich, the person

    File containing data about Marlene Dietrich

  • 27

    Correspondence between thing-URI and source-URI („hash URIs“)

    24.10.2011

    User Agent

    Web Server

    HTTP GET

    RDF

    http://www.bbc.co.uk/music/artists/191cba6a-b83f-49ca-883c-02b20c7a9dd5#artist

    http://www.bbc.co.uk/music/artists/191cba6a-b83f-49ca-883c-02b20c7a9dd5.rdf

  • 28

    Hypertext Transfer Protocol (HTTP)

    $ curl -H "Accept: application/rdf+xml" -v http://www.w3.org/People/Berners-Lee/card#i

    > GET /People/Berners-Lee/card HTTP/1.1 > User-Agent: curl/7.21.0 > Host: www.w3.org > Accept: application/rdf+xml < HTTP/1.1 200 OK < Date: Mon, 28 Mar 2011 17:16:30 GMT < Server: Apache/2 < Content-Location: card.rdf < Last-Modified: Wed, 29 Sep 2010 15:39:28 GMT < Content-Type: application/rdf+xml; qs=0.9 < Connection: close

    24.10.2011

    REQ

    UES

    T R

    ESPO

    NSE

    http://www.w3.org/People/Berners-Lee/card�

  • 29

    Correspondence between thing-URI and source-URI („slash URIs“)

    24.10.2011

    User Agent

    Web Server

    http://dbpedia.org/resource/Marlene_Dietrich

    http://dbpedia.org/data/Marlene_Dietrich

    HTTP GET

    303 HTTP GET

    RDF

    http://dbpedia.org/page/Marlene_Dietrich

  • 30

    3. Provide Useful Information

    When somebody looks up a URI, return data using the standards (RDF*, SPARQL) Resource Description Framework, a format for encoding graph-structured data (with URIs to identify nodes/vertices and links/edges)

    24.10.2011

  • 31

    Resource Description Framework

    Directed, labeled graph triple(subject, predicate, object)

    subject: URI (or blank node) predicate: URI object: URI (or blank node) or RDF literal (string, integer, date…)

    RDF/XML is the most widely deployed serialisation Other serialisations possible (N-Triples, Turtle, Notation3…) Quadruples (or quads) used as internal representation when integrating data quad(subject, predicate, object, context)

    context: URI (used to store origin of triple)

    24.10.2011

  • 32

    Merging Data with RDF

    24.10.2011

    +

    =

  • 33

    4. Link to Other URIs

    Enable people (and machines) to jump from server to server External links vs. internal links (for any predicate) Using external vocabularies enables linking Vocabularies might be interlinked, too Special owl:sameAs links to denote equivalence of identifiers (useful for data merging)

    24.10.2011

  • 34

    Equivalences via owl:sameAs

    http://viaf.org/viaf/89774942 http://dbpedia.org/resource/Georg_Wilhelm_Friedrich_Hegel http://www.idref.fr/026917467/id http://libris.kb.se/resource/auth/190350 http://d-nb.info/gnd/118547739

    http://www.bbc.co.uk/music/artists/191cba6a-b83f-49ca-883c-02b20c7a9dd5#artist http://dbpedia.org/resource/Marlene_Dietrich

    http://viaf.org/viaf/97773925 http://dbpedia.org/resource/Marlene_Dietrich . http://d-nb.info/gnd/118525565 http://libris.kb.se/resource/auth/238817 http://www.idref.fr/027561844/id

    http://dbpedia.org/resource/Berlin http://mpii.de/yago/resource/Berlin http://data.nytimes.com/N50987186835223032381 - Berlin (Germany) http://www4.wiwiss.fu-berlin.de/flickrwrappr/photos/Berlin http://data.nytimes.com/16057429728088573361 - Gaspe Peninsula (Quebec) (?) 24.10.2011

  • 35

    Benefits of Linked Data

    Explicit, simple data representation Common data representation (Resource Description Framework, RDF) hides underlying technologies and systems

    Distributed System Decentralised distributed ownership and control facilitates adoption and scalability

    Cross-referencing Allows for linking and referencing of existing data, via reuse of URIs

    Loose coupling with common language layer Large scale systems require loose coupling, via HTTP as common access protocol

    Ease of publishing and consumption Simple and easy-to-use systems and technologies to facilitate uptake

    Incremental data integration Start with merged RDF graphs and provide mappings as you go

    24.10.2011

  • 36

    Challenges

    Ramp-up cost for data conversion May be alleviated by semi-automatic mappings and adequate tool support for manual conversion

    Integrated data may be messy at first But can be refined as need arises

    Distributed creation and loose coordination may result in inconsistencies

    Can be detected, diagnosed, and fixed with appropriate tools

    24.10.2011

  • 37

    The Pedantic Web Group

    Get the community to contact publishers about errors/issues as they arise Get involved: http://pedantic-web.org/ ~200 members! Acknowledgements to: Aidan Hogan, Alex Passant, Me, Antoine Zimmermann, Axel Polleres, Michael Hausenblas, Richard Cyganiak, Stéphane Corlosquet

    24.10.2011

  • 38

    SYSTEM ARCHITECTURES

    24.10.2011

  • 39

    Data Integration System Architecture

    24.10.2011

    ! ?

    Source 1 Source 2 Source n

    Wrapper 1 Wrapper 2 Wrapper n

    Integration

    Wrapper 1

  • 40

    Semantic Web Components

    24.10.2011 40

    ( )

  • 41

    (

    Linked Data: Minimal Components

    24.10.2011 41

    1. Q

    uery

    2. A

    nsw

    er

    ? !

    )

  • 42

    Architecture Styles

    24.10.2011

    1. Q

    uery

    2. A

    nsw

    er

    ? !

    0. Crawl- Index

    ? ! Warehousing/ Crawl-Index-Serve

    Virtual Integration/ Distributed Querying

  • 43

    Basic Application: Entity Browsing

    24.10.2011

    Warehousing/ Crawl-Index-Serve

    Virtual Integration/ Distributed Querying

    SWSE, Falcons, Sindice, Watson, FactForge…

    Tabulator, Disco, Zitgist…

  • 44

    ACCESSING LINKED DATA

    24.10.2011

  • 45

    RDF Graph vs. Information Resource Graph

    24.10.2011

    RDF Graph

    Information Resource Graph

  • 46

    Information Resource Graph Explosion

    Directed graph rooted in http://danbri.org/foaf.rdf Level 0: 1 Level 1: 251 (avg), 1051 (worst) Level 2: 252 (avg), 1052 (worst) Level 3: 253 (avg), 1053 (worst) Level k: nk

    24.10.2011

  • 47

    Linked Data Access Continuum

    24.10.2011

    Materialise Distributed Query Processing

    LDSpider [Isele et al. 2010]

    [Hartig et al. 2009] [Harth et al. 2010] [Ladwig et al. 2010] [Schmedding 2011]

    swget [Fionda et al. 2011]

  • 48

    Querying Data Across Sources

    Data warehousing or materialisation-based approaches (MAT)

    24.10.2011

    CRAWL INDEX SERVE

    Distributed query processing approaches (DQP) over Linked Data sources SELECT ?s WHERE…

    TP TP

    TP TP

    HTTP GET

    HTTP GET

    DQP over RDF stores (not covered)

  • 49

    (Linked Data) Crawling

    24.10.2011

    1. Get URI from a queue 2. Open connection and fetch content 3. Process and store content 4. Extract new links and put into queue 5. At defined intervals: schedule URIs in

    queue

    Image: Carlos Castillo: Effective Web Crawling, via Wikipedia

  • 50

    LDSpider [Isele et al. 2010]

    API to access Linked Data Lightweight implementation in Java GPL at Google Code (http://code.google.com/p/ldspider) Multithreading Including RDF/XML parser Adheres to robots.txt protocol Flexible configuration/application with callbacks and hooks Link type specification to focus on certain parts of the Great Global Graph Heuristics to direct crawler to high quality sites (sort URIs in queue based on inlink count) Heuristics to direct crawler deeper into the web (cutting off no of URIs per pay-level domain)

    24.10.2011

  • 51

    Pay-Level-Domain Starvation

    24.10.2011

    [Hogan et al. 2011]

  • 52

    Swget [Fionda et al. 2011]

    Controlled exploration and retrieval of RDF sources Expanding paths (e.g., foaf:knows.foaf:name) specified via regular expressions User interface to tune parameters and view results More at http://swget.wordpress.com/ (and at the ISWC Poster Session!)

    24.10.2011

  • 53

    Conjunctive Queries

    foaf:knows ?x . ?x foaf:maker ?y . ?y rdfs:label ?z .

    24.10.2011

    Built on triple patterns containing variables (?, ?, ?), (s, ?, ?), (?, p, ?), (?, ?, o), (s, p, ?), (?, p, o), (s, p, o) Variables are bound during query evaluation Query evaluation results in a set of variable bindings

    ?x ?y ?z http://www.w3.org/People/Berners-Lee/card#i

    http://www.w3.org/DesignIssues/LinkedData.html

    Linked Data - Design Issues - W3C

  • 54

    Join Tree

    Traditional query evaluation: bind variables for triple patterns (based on indices), join results Often implemented via iterators with hasNext/getNext operations

    24.10.2011

  • 55

    Challenges for Link Traversal Evaluation

    Difficulty: interweave query processing with resource discovery (plan is known a-priori, sources are not) Dominant cost: network lookup and parsing But: QP has high complexity, may be slow on local storage, even slower over the web (orders of magnitude)

    24.10.2011

  • 56

    Iterator-based Implementation [Hartig et al. 2009]

    Traditional iterator-based query processor Feed URIs to lookup component If hasNext() returns false, feed IRIs into lookup component and defer evaluation of triple pattern Pro: paradigm fits current query processors (e.g., ARQ) Pro: (limited) multithreading via lookup component Con: potentially incomplete due to scheduling of lookups and variable bindings

    24.10.2011

  • 57

    Implementation with Symmetric Hash Joins [Ladwig et al. 2010]

    Operator is own thread Push triples into operator tree, bindings are stored in operator tree Joins carried out using Symmetric Hash Joins Pro: fully multithreaded Pro: avoids incompleteness due to scheduling lookups and deriving bindings Con: memory overhead Con: not all query patterns possible

    24.10.2011

  • 58

    Approximate Data Summaries [Harth et al. 2010]

    Combined description of schema-level and instance-level Use approximation to reduce index size (incurs false positives) Possible to use entire query for source selection Parallel lookups since sources can be determined for the entire query Can answer all triple patterns and combinations of triple patterns via join estimation

    24.10.2011

  • 59

    FUTURE WORK AND SUMMARY

    24.10.2011

  • 60

    Linked Data in the Smart Energy Grid [Wagner et al. 2010]

    Large, decentralised system Difficult to materialise data at central point Will have heavily coordinated parts, but should also be flexible Heavily regulated: privacy and security are of prime concern Resilience needed (no single point of failure)

  • 61

    Linked Data in the Internet of Things

    Services?

    Streams?

  • 62

    Summary

    The Linked Data Web is a large, decentralised and complex system built on simple principles

    identify resource via HTTP URIs provide RDF that links to other URIs upon lookup

    Current trend around Linked Data allows for a re-think of components in Semantic Web Layer Cake Data publishers and consumers coordinate little Web of Data grows rapidly and covers a large variety of domains Algorithms operating over a common access protocol and data model First commercial applications emerging

  • 63

    References I

    [Fionda et al. 2011] Valeria Fionda, Claudio Gutierrez and Giuseppe Pirró. “Semantically-driven recursive navigation and retrieval of data sources in the Web of Data”, Poster at ISWC 2011.

    [Harth et al. 2010] Andreas Harth, Katja Hose, Marcel Karnstedt, Axel Polleres, Kai-Uwe Sattler, Juergen Umbrich. "Data Summaries for On-Demand Queries over Linked Data". WWW 2010.

    [Hogan et al. 2011] Aidan Hogan, Andreas Harth, Jürgen Umbrich, Sheila Kinsella, Axel Polleres and Stefan Decker. "Searching and Browsing Linked Data with SWSE: the Semantic Web Search Engine". In the Journal of Web Semantics (to appear).

    [Hartig et al. 2009] Olaf Hartig, Christian Bizer, and Johann-Christoph Freytag. “Executing SPARQL Queries over the Web of Linked Data”. ISWC 2009.

    24.10.2011

  • 64

    References II

    [Isele et al. 2010] Robert Isele and Jürgen Umbrich and Chris Bizer and Andreas Harth. “LDSpider: An open-source crawling framework for the Web of Linked Data”, Poster at ISWC 2010.

    [Ladwig 2010]. Günter Ladwig, Duc Thanh Tran. „Linked Data Query Processing Strategies“. ISWC 2010.

    [Schmedding 2011] Florian Schmedding. “Incremental SPARQL Evaluation for Query Answering on Linked Data”. COLD 2011.

    [Wagner et al. 2010] Andreas Wagner, Sebastian Speiser, Oliver Raabe, Andreas Harth. “Linked Data for a Privacy-aware Smart Grid” GI Jahrestagung 2010.

    24.10.2011

  • 65

    Attribution

    Slides from my SWT-2 lectures, WWW 2010 SILD tutorial and INFORMATIK 2011 tutorial Linking Open Data cloud diagrams, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ Images of Berlin, Hegel and Dietrich via Wikipedia

    Andreas Harth, Denny Vrandecic - Datenintegration im

    65

    30.11.2012

    Slide Number 1OutlineIntroductionMotivationLinked Data Now!Linked Data on the WebLinked Data on the WebLinked Data on the WebLinked Data on the WebLinked Data on the WebLinked Data on the WebLinked Data on the WebLinked Data on the WebLinked Data on the WebTypes of Data in the Linking Open Data CloudScenario OverviewDBpediaBBC MusicVirtual International Authority File (VIAF)Linked Data PrinciplesSemantic TechnologiesLinked Data Principles1. Use URIs as Names for ThingsNames for Things2. Use HTTP URIsInformation Resources vs. Other ResourcesCorrespondence between thing-URI and source-URI („hash URIs“)Hypertext Transfer Protocol (HTTP)Correspondence between thing-URI and source-URI („slash URIs“)3. Provide Useful InformationResource Description FrameworkMerging Data with RDF4. Link to Other URIsEquivalences via owl:sameAsBenefits of Linked DataChallengesThe Pedantic Web GroupSystem ArchitecturesData Integration System ArchitectureSemantic Web ComponentsLinked Data: Minimal ComponentsArchitecture StylesBasic Application: Entity BrowsingAccessing Linked DataRDF Graph vs. Information Resource GraphInformation Resource Graph ExplosionLinked Data Access ContinuumQuerying Data Across Sources(Linked Data) CrawlingLDSpider [Isele et al. 2010]Pay-Level-Domain StarvationSwget [Fionda et al. 2011]Conjunctive QueriesJoin TreeChallenges for Link Traversal EvaluationIterator-based Implementation [Hartig et al. 2009]Implementation with Symmetric Hash Joins [Ladwig et al. 2010]Approximate Data Summaries [Harth et al. 2010]Future Work and SummaryLinked Data in the Smart Energy Grid [Wagner et al. 2010]Linked Data in the Internet of ThingsSummaryReferences IReferences IIAttribution