scalable extraction ofscalable extraction of implicit and explicit … · 2013-10-02 · scalable...

24
Scalable Extraction of Scalable Extraction of Implicit and Explicit Schema Information in Linked Open Data in Linked Open Data Ansgar Scherp Ansgar Scherp Data and Web Science, U Mannheim Matthias Konrath Thomas Gottron Matthias Konrath, Thomas Gottron Web Science and Technologies, U Koblenz

Upload: others

Post on 24-Feb-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Informationin Linked Open Datain Linked Open DataAnsgar ScherpAnsgar ScherpData and Web Science, U MannheimMatthias Konrath Thomas GottronMatthias Konrath, Thomas GottronWeb Science and Technologies, U Koblenz

Slide 1Ansgar Scherp – [email protected]

Page 2: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Linked DataW it j t i th W b• We witness a major movement in the Web …

• Publishing and interlinking of data of different quality, purpose and source on the Web

• Technology + Social Phenomenon

World Wide Web Linked DataDocuments DataHyperlinks Typed LinksHTML RDFAddresses (URIs) Addresses (URIs)

Slide 2Ansgar Scherp – [email protected]

( ) ( )

Page 3: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Relevance of Linked Data?

Slide 3Ansgar Scherp – [email protected]

Page 4: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Linked Data: May ‘07 Sept. ‘11

Web 2.0

Media

Publications

eGovernment

Cross-Domain

GeographicLife

Sciences

Slide 4Ansgar Scherp – [email protected]< 31 Billion Triples Source: http://lod-cloud.net

Page 5: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Linked Data: Based on Four Principles

1. Identification1. Identification2. Interlinkage3. Dereferencing4 Description4. Description

Slide 5Ansgar Scherp – [email protected]

Page 6: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

1. Use URIs for Identification

Matt Briggsgg

http://biglynx.co.uk/people/matt-briggs

Scott Miller

people/matt-briggs

http://biglynx.co.uk/people/scott-miller

Slide 6Ansgar Scherp – [email protected]. Gazen,http://www.flickr.com/photos/bayat/, CC-BY

people/scott miller

Page 7: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

2. Interlinking of Ressources

http://biglynx.co.uk/people/matt-briggs

http://biglynx.co.uk/people/scott miller

Slide 7Ansgar Scherp – [email protected]. Gazen,http://www.flickr.com/photos/bayat/, CC-BY

people/scott-miller

Page 8: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

3. Dereferencing of URIs

• Pretty easy to look up web documents

• How do we “look up” things of the real world?

http://biglynx.co.uk/people/matt-briggs

http://biglynx.co.uk/people/matt-briggs.rdf

Slide 8Ansgar Scherp – [email protected]

Page 9: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

4. Description of URIsfoaf:Person

dp:Birminghamrdf:type foaf:based near …

……

biglynx:matt-briggs

yp foaf:based_near …

ex:loc

foaf:knows _1:point

wgs84:wgs84:

longbiglynx:scott-miller

wgs84:lat

g

foaf:based near“-0.118”

dp:London

foaf:based_near“51.509”

Slide 9Ansgar Scherp – [email protected]

……

Page 10: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Searching for Data Sources

Persons that are- Politicians and - Actors ?

Slide 10Ansgar Scherp – [email protected] Quelle: http://lod-cloud.net< 31 Milliarde Triples

Page 11: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Searching for Data Sources (2)• No single federated query interface providedNo single federated query interface provided• Schema-based index to find data sources

Index

?Query ?SELECT ?xFROM …WHERE {

Slide 11Ansgar Scherp – [email protected]

WHERE {?x rdf:type ex:Actor .?x rdf:type ex:Politician . }

Page 12: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

IdeaSchema-based indexSc e a based de

Define families of graph patternsAssign instances to graph patternsAssign instances to graph patternsStore the source information (context URI)

ConstructionCo st uct oStream-based for scalabilityLittle loss of accuracyLittle loss of accuracy

Slide 12Ansgar Scherp – [email protected]

Page 13: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Input Datan-Quadsn-Quads

<subject> <predicate> <object> <context>

Example:Example:<http://www.w3.org/People/Connolly/#me><http://www.w3.org/1999/02/22-rdf-syntax-ns#type>p g y yp<http://xmlns.com/foaf/0.1/Person><http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf>

w3p:

http://dig.csail.mit.edu/2008/webdav/timbl/foaf.rdf

w3p:#me

foaf:Person

Slide 13Ansgar Scherp – [email protected]

Page 14: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Layer 1: RDF ClassesCAll instances of a C1

DS 3DS 2DS 1

All instances of a particular type

DS 3DS 2DS 1

SELECT ?xFROM

foaf:Person

FROM …WHERE {

?x rdfs:type foaf:Person .}

http://dig csail mit edu/2008/

}

timbl:card#i

foaf:Person

http://www.w3.org/People/Berners-Lee/card

http://dig.csail.mit.edu/2008/...

Slide 14Ansgar Scherp – [email protected]

Page 15: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Layer 2: Type ClustersC1 C2All instances belonging 1 2All instances belonging

to exactly the same setof types

TC1

DS 3DS 2DS 1SELECT ?xFROM …

of types

foaf:Person

FROM …WHERE {

?x rdfs:type foaf:Person .?x rdfs:type pim:Male

pim:Male

?x rdfs:type pim:Male .}

tc4711

pim:

timbl: foaf: http://www.w3.org/People/Berners-Lee/card

pim:Male

Slide 15Ansgar Scherp – [email protected]

timbl:card#i Person

Page 16: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Two instances areLayer 3: Equivalence Classes

C C CTwo instances are equivalent iff:

They are in the same TC

C1 C2

TC1

C3

TC2They are in the same TCThey have the same

ti

TC1 TC2

ppropertiesThe property targets are

EQC1

in the same TC DS 3DS 2DS 1

Similar to 1-bisimulation

Slide 16Ansgar Scherp – [email protected]

Similar to 1 bisimulation

Page 17: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Layer 3: Equivalence ClassesSELECT ?x

foaf:PersonWHERE {

?x rdfs:type foaf:Person .?x rdfs:type pim:Male pim:Male foaf:PPD?x rdfs:type pim:Male .?x foaf:maker ?y .?y rdfs:type

tc4711 tc1234

eqc0815-maker-

foaf:PersonalProfileDocument .}

foaf:Person

pim:Male

foaf:PPD eqc0815

tc1234

foaf:maker

ti blhttp://www.w3.org/People/Berners-Lee/cardtimbl:

d

oa a e

Slide 17Ansgar Scherp – [email protected]

timbl:card#i

p g pcard

Page 18: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Building the Schema and Index

C2C1 C3 CkRDF

classes…

consistsOf

TC1 TC2 TCType

clustershasEQClass p1 p2

TC1 TC2 TCm clusters…

hasDataSource

EQC1 EQC2 EQCn Equivalenceclasses

Class p1 p2…

DS1 DS2 DS3 DS4 DS5 DSxData

hasDataSource

Slide 18Ansgar Scherp – [email protected]

DS1 DS2 DS3 DS4 DS5 DSx sources

Page 19: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Building the Index from a StreamStream of n-quads (coming from a LD crawler)Stream of n-quads (coming from a LD crawler)

… Q16, Q15, Q14, Q13, Q12, Q11, Q10, Q9, Q8, Q7, Q6, Q5, Q4, Q3, Q2, Q1

FiFoFiFo

4

3

1

6C3

C2 3

2

12

3

4

C2

C2

15C1

Slide 19Ansgar Scherp – [email protected]

• Linear runtime complexity wrt # of input triples

Page 20: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Computing SchemEX: TimBL Data Set

• Analysis of a smaller data set• 11 M triples, TimBL’s FOAF profile11 M triples, TimBL s FOAF profile• LDspider with ~ 2k triples / sec

• Different cache sizes: 100, 1k, 10k, 50k, 100k, , , ,• Compared SchemEX with reference schema• Index queries on all Types TCs EQCs• Index queries on all Types, TCs, EQCs• Good precision/recall ratio at 50k+

Slide 20Ansgar Scherp – [email protected]

Page 21: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Quality of Stream-based Index ConstructionConstruction

• Runtime increases hardly with window size• Memory consumption scales with window size

Slide 21Ansgar Scherp – [email protected]

• Commodity hardware (4GB RAM, single CPU)

Page 22: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Computing SchemEX: Full BTC 2011 Data

Slide 22Ansgar Scherp – [email protected]

Cache size: 50 k

Page 23: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Billion Triple Challenge 2011

Slide 23Ansgar Scherp – [email protected]

[JWS’12]

Page 24: Scalable Extraction ofScalable Extraction of Implicit and Explicit … · 2013-10-02 · Scalable Extraction ofScalable Extraction of Implicit and Explicit Schema Information in Linked

Did you mean?

Result Set Size

Ranked Retrieval

Result Snippets

R l t d Q i

Slide 24Ansgar Scherp – [email protected]

Related Queries