aidan's phd viva
Post on 08-May-2015
1.805 Views
Preview:
DESCRIPTION
TRANSCRIPT
Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
1
Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data
Corpora
Aidan HoganPhD Viva
Digital Enterprise Research Institute www.deri.ie
2
Cold Open
Figure 1: Web of Data
explicit data
implicit data
Topic of thesis:
How can consumers tap into the implicit data
Digital Enterprise Research Institute www.deri.ie
PRELUDEThe Area…
The Problem…The Hypothesis…
3
Digital Enterprise Research Institute www.deri.ie
The Area…
…Linked Data / Linking Open Data
4
Digital Enterprise Research Institute www.deri.ie
5
Bottom-up Approach to Semantic Web Individual Publishers should:
1. Use URIs to name things (not just documents)
2. Use HTTP URIs that can be looked up
3. Return information in a common structured data model (RDF)
4. Use external URIs in your data so as to link to related data
…the micro… Linked Data Principles
Digital Enterprise Research Institute www.deri.ie
6
…the macro… A Web of Data
Images from: http://richard.cyganiak.de/2007/10/lod/; Cyganiak, JentzschSeptember 2010
August 2007
November 2007
February 2008
March 2008
September 2008
March 2009
July 2009
Digital Enterprise Research Institute www.deri.ie
…so what’s The Problem?…
…heterogeneity
7
Digital Enterprise Research Institute www.deri.ie
8
Take Query Answering…
SPARQL endpoints over Web data such as YARS2, Virtuoso, FactForge, etc.
Search engines such as SWSE, Sindice, Falcons, Swoogle, Watson, etc.
Digital Enterprise Research Institute www.deri.ie
9
Take Query Answering…
Gimme webpages relating to
Tim Berners-Lee
foaf:page
timbl:i
timbl:i foaf:page ?pages .
Digital Enterprise Research Institute www.deri.ie
10
Hetereogenity in terminology…
webpage: properties
foaf:page
foaf:homepage
foaf:isPrimaryTopicOf
foaf:weblog
doap:homepage
foaf:topic
foaf:primaryTopic
mo:musicBrainz
mo:myspace
…
= rdfs:subPropertyOf
= owl:inverseOf
Digital Enterprise Research Institute www.deri.ie
11
Linked Data, RDFS and OWL: Linked Vocabularies
…
…Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman
Digital Enterprise Research Institute www.deri.ie
12
Hetereogenity in naming…
Tim Berners-Lee: URIs
…
timbl:i
dblp:100007
identica:45563
adv:timblfb:en.tim_berners-lee
db:Tim-Berners_Lee
= owl:sameAs
Digital Enterprise Research Institute www.deri.ie
13
Returning to our Query…
Gimme webpages relating to
Tim Berners-Lee
foaf:page
timbl:i timbl:i foaf:page ?pages .
... 7 x 6 = 42 possible patterns
foaf:homepage
foaf:isPrimaryTopicOf
doap:homepage foaf:topic foaf:primaryTopic
mo:myspace
dblp:100007
identica:45563adv:timbl
fb:en.tim_berners-lee
db:Tim-Berners_Lee
Digital Enterprise Research Institute www.deri.ie
…The Hypothesis?…
…we can use the OWL and RDFS inherent in Linked Data to attenuate the problem of heterogeneity for consumers
14
Digital Enterprise Research Institute www.deri.ie
Scenario…
…take a static corpus crawled from Linked Data…
…about a billion triples or so…
…and tackle the problem(s) of heterogeneity
…(without domain-specific “cheats”).
15
Digital Enterprise Research Institute www.deri.ie
Setup…
hardware …9 machines
…~6 years old… 4Gb RAM, 2.2GHz, Ethernet
16
Digital Enterprise Research Institute www.deri.ie
Setup…
corpus …crawl (9 machines: 52.5 hr)
…took random seed URIs from Billion Triple Challenge 2009 dataset
…crawled ~4 million RDF/XML documents …from arbitrary domains (e.g., dbpedia.org)
– Only found 785 domains providing RDF/XML
…1.118 billion quadruples …947 million unique triples
17
Digital Enterprise Research Institute www.deri.ie
Setup…
ranking (9 machines: 30.3 hr) …applied PageRank over interlinked source
docs.– …source A links to source B if A uses a URI which
“dereferences” (points) to B
18
Digital Enterprise Research Institute www.deri.ie
Challenges…
…what (OWL) reasoning is feasible for Linked Data?
19
Digital Enterprise Research Institute www.deri.ie
20
Linked Data Reasoning: Challenges
Digital Enterprise Research Institute www.deri.ie
CORE1. Reasoning…
2. Annotated Reasoning…3. Consolidation…
21
Digital Enterprise Research Institute www.deri.ie
1. Reasoning
22
Digital Enterprise Research Institute www.deri.ie
High Level Approach…
…apply a subset of OWL 2 RL/RDF rules over the data
23
Digital Enterprise Research Institute www.deri.ie
24
Forward Chaining materialisation:
Avoid runtime expense of backward-chaining– Users taught impatience by Google
Pre-compute answers for quick retrieval
Web-scale systems should be scalable!– More data = more disk-space/machines
Web Reasoning: Forward Chaining!
Digital Enterprise Research Institute www.deri.ie
25
Scalable Authoritative OWL Reasoner
Our Approach
Digital Enterprise Research Institute www.deri.ie
26
Our Approach…
INPUT:• Flat file of triples (quads)
OUTPUT:• Flat file of (partial) inferred triples (quads)
Digital Enterprise Research Institute www.deri.ie
27
Scalable Reasoning: In-mem T-Box
Main optimisation: Store T-Box in memory T-Box: (loosely) data describing classes and properties.
Aka. schemata/vocabularies/ontologies/terminologies. E.g.,
– foaf:topic owl:inverseOf foaf:page .– sioc:UserAccount rdfs:subClassOf foaf:OnlineAccount .
Most commonly accessed data for reasoning Quite small (~0.1% for our Linked Data corpus)
High selectivity (if you prefer) A-Box: Lots ?s foaf:page ?o .
vs. T-Box: Few foaf:page ?p ?o . + ?s ?p foaf:page .
Digital Enterprise Research Institute www.deri.ie
28
Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory Do T-Box level reasoning if required (semi-naïve)
Scan 2: Scan all on-disk data, join with in-memory T-Box.
Scalable Reasoning: Two Scans
Digital Enterprise Research Institute www.deri.ie
29
......
...
...
......
... ...
...ex:me foaf:homepage ex:hp ....
...ex:hp rdf:type foaf:Document .ex:me foaf:page ex:hp .ex:hp foaf:topic ex:me ....
IN-MEM T-BOX
ON-DISK A-BOX
ON-DISK OUTPUT
foaf:homepage
foaf:Document
rdfs:domainfoaf:page
rdfs:subPropertyOf
foaf:topic
owl:inverseOf
Execution of three rules:
OWL 2 RL rule prp-inv1?p1 owl:inverseOf ?p2 .
?x ?p1 ?y .
⇒ ?y ?p2 ?x .
OWL 2 RL rule prp-rng?p rdfs:range ?c .
?x ?p ?y .
⇒ ?y a ?c .
OWL 2 RL rule prp-spo1?p1 rdfs:subPropertyOf ?p2 .
?x ?p1 ?y.
⇒ ?x ?p2 ?y .
Scalable Reasoning: No A-Box Joins
Digital Enterprise Research Institute www.deri.ie
30
However: some rules do require A-Box joins ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z .
⇒ ?x ?p ?z . Difficult to engineer a scalable solution (which reaches a
fixpoint) for Linked Data(?) Can lead to quadratic inferences
A lot of useful reasoning still possible without A-Box joins…
Scalable Reasoning: A-Box joins?
Digital Enterprise Research Institute www.deri.ie
31
Consider source of T-Box (schemata) data
Class/property URIs dereference to their authoritative document
FOAF spec authoritative for foaf:Person ✓ MY spec not authoritative for foaf:Person ✘
Allow “extension” in authoritative documents my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓
BUT: Reduce obscure memberships foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘
ALSO: Protect specifications foaf:knows a owl:SymmetricProperty . (MY spec) ✘
Authoritative Reasoning
Digital Enterprise Research Institute www.deri.ie
32
Survey of terminology: counts
Looked at use of RDFS and OWL in our corpus
1. rdfs:subClassOf ~307k axioms ~51k docs ✓
2. owl:equivalentClass ~23k axioms ~23k docs ✓3. rdfs:domain ~16k axioms 623 docs ✓4. rdfs:range ~14k axioms 717 docs ✓5. owl:unionOf ~13k axioms 109 docs ✓6. rdfs:subPropertyOf ~9k axioms 227 docs ✓7. owl:inverseOf ~1k axioms 98 docs ✓8. owl:disjointWith 917 axioms 60 docs ✘9. owl:someValuesFrom 465 axioms 48 docs ✓10. owl:intersectionOf 325 axioms 12 docs ✓/ ✘…
Digital Enterprise Research Institute www.deri.ie
33
...summary please?
Our “cheap rules” cover 99% of RDFS/OWL axioms in our corpus
82.3% of such axioms have an authoritative version
- 78.3% of all non-authoritative axioms come from one doc
- (without which, ~96% of axioms have auth. version)
9.1% of documents have non-authoritative axioms
Authoritative reasoning for cheap rules fully support 90.6% of the “vocabulary documents”
Survey of terminology: counts
Digital Enterprise Research Institute www.deri.ie
34
Survey of terminology: ranks
Looked at use of RDFS and OWL wrt. ranks of documents…1. rdfs:subClassOf 0.295 ✓ 2. rdfs:range 0.294 ✓3. rdfs:domain 0.292 ✓4. rdfs:subPropertyOf 0.090 ✓5. owl:FunctionalProperty 0.063 ✘6. owl:disjointWith 0.049 ✘7. owl:inverseOf 0.047 ✓8. owl:unionOf 0.035 ✓9. owl:SymmetricProperty 0.033 ✓10. owl:equivalentClass 0.021 ✓11. owl:InverseFunctionalProperty 0.030 ✘12. owl:equivalentProperty 0.030 ✓13. owl:someValuesFrom 0.030 ✓/ ✘
Digital Enterprise Research Institute www.deri.ie
35
...summary please?
Adding up the ranks of all vocabularies our rules fully support gives 77% of the total rank of all vocabularies
Adding up the ranks of all vocabularies our authoritative rules fully support gives 70% of the total rank of all vocabularies
The highest ranked document our rules do not fully support was 5th overall: SKOS
The highest ranked document with non-authoritative axioms was 7th overall: FOAF
Survey of terminology: ranks
Digital Enterprise Research Institute www.deri.ie
36
...let’s stick to the simple rules
Digital Enterprise Research Institute www.deri.ie
37
Scalable Distributed Reasoning
...
...ex:me ex:presented ex:ThisTalk
...
SAME T-BOX
ex:presented
foaf:Person
rdfs:domain
ex:presented
foaf:Person
rdfs:domain
ex:Talk
rdfs:range
SAME T-BOX SAME T-BOX SAME T-BOX SAME T-BOX
DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
... LOCAL
OUTPUT......ex:me ex:presented ex:ThisTalk
...
LOCAL OUTPUT
LOCAL OUTPUT
LOCAL OUTPUT
LOCAL OUTPUT
...
...ex:me ex:presented ex:ThisTal
...
...ex:me ex:presented ex:ThisTalk
...
...ex:me ex:presented ex:ThisTalk
...
...ex:me rdf:type ex:Awesome .
ex:Talk
rdfs:range
...
ex:presented
foaf:Person
rdfs:domain
ex:Talk
rdfs:range
...
ex:presented
foaf:Person
rdfs:domain
ex:Talk
rdfs:range
...
ex:presented
foaf:Person
rdfs:domain
ex:Talk
rdfs:range
... ...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
... EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX
COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX
...
...
Digital Enterprise Research Institute www.deri.ie
38
Reasoning Performance (1 machine)
Digital Enterprise Research Institute www.deri.ie
39
Reasoning Performance: Distrib.
9 machines: Total 3.35 hours
Digital Enterprise Research Institute www.deri.ie
40
Reasoning: Results
962 million unique/novel triples
947 millionunique triples
Digital Enterprise Research Institute www.deri.ie
2. AnnotatedReasoning
41
Digital Enterprise Research Institute www.deri.ie
42
Annotated Reasoning
Let’s try track some meta-information during the reasoning process
Annotate input triples with information
Use annotated reasoning framework for transforming annotations on input triples into annotations on output triples
Digital Enterprise Research Institute www.deri.ie
43
Each input triple is assigned the sum of the ranks of the documents in which it appears…
foaf:Person rdfs:subClassOf foaf:Agent 0.3 .
timbl:i rdf:type foaf:Person 0.04 .
aidan:me rdf:type foaf:Person 0.0001 .
Annotated Reasoning: ranks
Digital Enterprise Research Institute www.deri.ie
44
During reasoning, inferences are assigned the least-trustworthy triple involved in their “proof”
foaf:Person rdfs:subClassOf foaf:Agent 0.3 .
timbl:i rdf:type foaf:Person 0.04 .
⇒timbl:i rdf:type foaf:Agent 0.04 .
Annotated Reasoning
Digital Enterprise Research Institute www.deri.ie
45
1. Can do top-k materialisation Only give me inferences above a certain rank threshold Only give me top-k inferences
2. Can fix inconsistencies in the data… …aka. logical contradictions …interpreting the rank values as denoting
“trustworthy” data
Why?
Digital Enterprise Research Institute www.deri.ie
46
foaf:Person owl:disjointWith foaf:Document .
Inconsistencies: aka. Contradictions
Digital Enterprise Research Institute www.deri.ie
47
?c1 owl:disjointWith ?c2 .
?x rdf:type ?c1 .
?x rdf:type ?c2 .
⇒ false
foaf:Person owl:disjointWith foaf:Document .
ex:sleepygirl rdf:type foaf:Person .
ex:sleepygirl rdf:type foaf:Document .
⇒ false
Cannot compute…
Digital Enterprise Research Institute www.deri.ie
48
Considered two approaches:
1. Find the “consistency threshold” of the input + inferred data: The largest rank such that all data above that rank are
consistent Unfortunately, the 22nd ranked document had an ill-
typed literal, and so was inconsistent… So we would keep the data of ~22 documents And throw away the data of nearly four million
Fixing inconsistencies
Digital Enterprise Research Institute www.deri.ie
49
Time for Plan B:
2. Perform a “granular” repair of the data Remove the weakest triple causing each contradiction
foaf:Person owl:disjointWith foaf:Document 0.3 .
ex:sleepygirl rdf:type foaf:Person 0.007 .
ex:sleepygirl rdf:type foaf:Document 0.002.
Fixing inconsistencies
Digital Enterprise Research Institute www.deri.ie
50
~294k ill-typed datatypes ~7k members of disjoint classes
Inconsistencies found
Digital Enterprise Research Institute www.deri.ie
51
Performance
9 machines
Annotated Reasoning: 14.6 hrs (vs. 3.35hrs w/o annotations: need to do a distributed sort to
remove non-optimal triples ) Detect/Extract Inconsistencies: 2.9 hrs Diagnosis/Repair 2.8 hrs
Total ~20.3 hours
Digital Enterprise Research Institute www.deri.ie
3. Consolidation
52
Digital Enterprise Research Institute www.deri.ie
53
Consolidation for Linked Data
Digital Enterprise Research Institute www.deri.ie
Baseline Approach…
…use the explicit owl:sameAs relations given in the data…
54
Digital Enterprise Research Institute www.deri.ie
55
Scan the data and extract all owl:sameAs triples
timbl:i owl:sameas identica:45563 .
dbpedia:Berners-Lee owl:sameas identica:45563 .
Load into memory Use a map to store equivalences:
timbl:i ->
identica:45563 ->
dbpedia:Berners-Lee ->
Consolidation: Baseline
timbl:i
identica:45563
dbpedia:Berners-Lee
Digital Enterprise Research Institute www.deri.ie
56
For each set of equivalent identifiers, choose a canonical term
Consolidation: Baseline
timbl:i
identica:45563
dbpedia:Berners-Lee
Digital Enterprise Research Institute www.deri.ie
57
Scan data a second time: Rewrite identifiers to their canonical version
Skip predicates and values of rdf:type
Canonicalisation
timbl:i rdf:type foaf:Person .
identica:48404 foaf:knows identica:45563 .
dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .
dbpedia:Berners-Lee rdf:type foaf:Person .
identica:48404 foaf:knows dbpedia:Berners-Lee .
dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .
timbl:i
identica:45563
dbpedia:Berners-Lee
Digital Enterprise Research Institute www.deri.ie
58
Baseline Consolidation: Performance
9 machines
1. Extract owl:sameAs: 0.2 hr 2. Gather owl:sameAs: 0.1 hr3. Canonicalise data 0.7 hr
Total ~1.1 hours
Digital Enterprise Research Institute www.deri.ie
59
Applied over raw input data
~12 million owl:sameAs triples ~2.2 million sets of equivalent identifiers ~5.8 million identifiers involved
~2.65 identifiers per set ~99.99% of terms were URIs ~6.25% of all URIs
Baseline Consolidation: Results
Digital Enterprise Research Institute www.deri.ie
Extended Approach…
…use the owl:sameAs relations inferable through reasoning…
60
Digital Enterprise Research Institute www.deri.ie
61
Infer owl:sameAs through reasoning (OWL 2 RL/RDF)1. explicit owl:sameAs (again)
2. owl:InverseFunctionalProperty
3. owl:FunctionalProperty
4. owl:cardinality 1 / owl:maxCardinality 1
foaf:homepage a owl:InverseFunctionalProperty .
timbl:i foaf:homepage w3c:timblhomepage .
adv:timbl foaf:homepage w3c:timblhomepage .
⇒timbl:i owl:sameas adv:timbl .
…then apply consolidation as before
Extended Consolidation
Digital Enterprise Research Institute www.deri.ie
62
OWL 2 RL/RDF consolidation rules require A-Box joins!
Might not be able to fit owl:sameAs index in memory (4 Gb)!
⇒ Use on-disk batch-processing Distributed sorts, scans and merge-joins
Derive owl:sameAs on-disk
Digital Enterprise Research Institute www.deri.ie
63
Extended Consolidation: Performance
9 machines
1. Inferring owl:sameAs ~7.4 hr2. Canonicalise data ~4.9 hr
Total ~12.3 hours(11X baseline)
Digital Enterprise Research Institute www.deri.ie
64
~12 million explicit owl:sameAs triples (as before) ~8.7 million thru. owl:InverseFunctionalProperty ~106 thousand thru. owl:FunctionalProperty none thru. owl:cardinality/owl:maxCardinality
~2.8 million sets of equivalent identifiers (1.31x baseline)
~14.86 million identifiers involved (2.58x baseline)
~5.8 million URIs (1.014x baseline)
Extended Consolidation: Results
Digital Enterprise Research Institute www.deri.ie
CONCLUSION
65
Digital Enterprise Research Institute www.deri.ie
66
timbl:i foaf:page ?pages .
timbl:i
identica:45563
dbpedia:Berners-Lee
dbpedia:Berners-Lee foaf:page ?pages .
Digital Enterprise Research Institute www.deri.ie
Heterogeneity poses a significant problem for consuming Linked Data
1. Lightweight reasoning can go a long way Simple/authoritative rules have reasonable coverage
2. Deceit/Noise ≠ End Of World3. Inconsistency ≠ End Of World
Useful for finding noise in fact!
4. Explicit owl:sameAs vs. extended consolidation: Extended consolidation mostly for consolidating
blank-nodes from older FOAF exporters
67
Conclusions
top related