claudio gutierrez, carlos hurtado, alberto o. mendelzon 1

48
Foundations of Semantic Web Databases Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

Post on 15-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

1

Foundations of Semantic Web Databases

Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon

Page 2: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

2

Recall: Semantic Web

The Web is a huge collection of varied interconnected data which lacks of semantic. Therefore, understandable only by humans.

To allow anyone to say anything about anything

The Semantic Web is based on the idea of adding machine understandable semantics to web information via annotations., so that they can perform more of the tedious work involved in finding, sharing and combining information on the web.

Page 3: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

3

Recall: The Relational Model The rows represent the things you are

storing information about. The columns represent the properties of

those things. The intersection gives the value of that

property for that thing.

Page 4: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

4

Recall: RDF

book title JavaScript

subject

property value

Page 5: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

5

Recall: RDF

Resource Description Framework (RDF).

The RDF model was designed with the following goals: simple data model, formal semantics and provable inference, extensible URI-based vocabulary, allowing anyone to make statements about any resource.

RDF statement is the way to describe any resource which can have a URI, through it’s properties using binary predicates and another resource.

Page 6: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

6

Recall: RDF

RDF statement - (Subject, Predicate, Object)( http://en.wikipedia.org/wiki/Dan_Brown, http://purl.org/dc/elements/1.1/publisher,

"Wikipedia“ )

Or in XML format:<rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">

<rdf:Description rdf:about="http://en.wikipedia.org/wiki/Dan_Brown">

<dc:publisher>Wikipedia</dc:publisher> …

</rdf:Description> </rdf:RDF>

Page 7: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

7

Recall: Ontology and RDFS RDF lacks the ability of expressing the

relations between objects (e.g. Cat is an Animal, Book has an Author).

RDF Schema (also called RDFS vocabulary) provides additional information about properties, e.g. adds information about the classes and properties of resources and the relations between them.

Page 8: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

8

Recall: RDF Schema

RDFS main constructs:Class, subClassOf, Property, subPropertyOf, Object, Predicate, Subject, Range, Domain, Type, etc…

A: (John, Class, Man)B: (Man, subClassOf, Person)C: (A, Subject, John)

Enables “Duck Typing”.

Reification

Page 9: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

9

Recall: RDF Query Languages Given data which is represented by RDF

format, the query language (e.g. SPARQL) enables to retrieve and manipulate the data.

Like in other querying languages we would like to “filter” and reorganize the data. Although the data can be part of different DBs, and represented in different formats, its semantic is represented with RDFS and ontologies, common to all of the data.

Page 10: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

10

The Problem

RDF DB

!

?

Page 11: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

11

The Problem

RDF DB

!

?

RDF DB RDF DBRDF DBRDF DB

!!! !

Page 12: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

12

The Problem

RDF DB

!

?

RDF DB RDF DBRDF DBRDF DB

!!! !

!! =! !≠! !

!!

U

Page 13: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

13

The Problems

Different representation of the data (no normal form) and redundancy elimination.

Equivalence (of DBs, queries and answers).

Entailment and containment of queries.

The impact of predefined semantics (RDFS vocabulary), blank nodes, reification and premises on queries.

Complexity issues.

Page 14: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

14

Blank Nodes\Resource

Blank node of resource is a resource in RDF DB (or graph), which is not identified by URI (Universal Resource Identifier).

(John, knows, _:p1)(_:p1, birthDate, 04-21)

“exist _:p1 who is known by John and his date of birth is the 21st of April”

Enables partial understanding when information is missing.

We will use letters N,X,Y,… to donate blank nodes.

Page 15: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

15

UBL(Resources)

RDF Graphs

For a given triple (Subj, Pred, Obj)

RDF graph G is a set of triples.

Subj ObjPred

U(URIs)

B(Blank Nodes)

L(Literal

s)

Page 16: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

16

RDF Graphs

The universe of a graph is the set of elements of UBL, which occur in the triples of G, universe(G).

The vocabulary of a graph G is the set of elements of UL, which occur in the triples of G.

A graph is ground if it has no blank nodes.

The union of G1, G2 is the union of their sets of triples, donate by G1∪G2.

The merge of G1, G2 is the union of their sets of triples, where the sets of blank nodes are disjoint, donate by G1+G2. (merge is safe)

Page 17: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

17

RDF Graphs

Xsc

a csc

Ysc

a csc

G2G1

G1 ∪G2 Xsc

a csc

G1 +G2

Xsc

a csc

Ysc sc

Page 18: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

18

RDFS Vocabulary

Describes properties like attributes of resources, and relationships between them. Also enable to make statements about statements, reifications.For a given triple N:(a, b, c) occurs in http://... N http://.

..

occurs

typestat

a b c

subj objpred

Page 19: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

19

Maps

Map is a function μ:UBL→UBL.

μ is consistent with graph G, if μ(G) is RDF graph. And μ(G) is an instance of G.

An instance is proper if it has fewer blank nodes.

Overloading the meaning of map, μ:G1→G2 if there is a map μ such that μ(G1) is subgraph of G2.

Page 20: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

20

Graph Isomorphism

Two graphs G1 and G2 are isomorphic if there are maps μ1 and μ2 such that μ1(G1)=G2 and μ2(G2)=G1, donated by G1≃G2.

Page 21: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

21

Graph Isomorphism

a

b

c

d

g

h

i

j

8

1

3

5

2

4

7

6

ƒ(a) = 1 ƒ(b) = 6ƒ(c) = 8ƒ(d) = 3ƒ(g) = 5ƒ(h) = 2ƒ(i) = 4ƒ(j) = 7

Page 22: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

22

Lean Graphs

A graph G is lean, if there is no map μ such that μ(G) is a proper subgraph of G.

a

q

pX

Y

p

r

ap

X

Y

p

b

G2G1

Page 23: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

23

Core

Theorem: Each RDF graph G contains a unique (up to isomorphism) lean subgraph which is an instance of G. We will denote this unique subgraph by core(G).

Theorem: Deciding if G is lean is coNP-complete

(reduction to tautology). Deciding if G’ ≃ core(G) is DP-complete.

Page 24: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

24

Graph Interpretation

An interpretation I of RDF graph G:

1. A non-empty set of resources Res.

2. The literals, a subset Lit⊆Res.

3. A set of binary properties Prop⊆ResXRes.

4. Mapping from the vocabulary of G, URes∪Prop and LLit.

Page 25: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

25

Entailment & Equivalence An RDF graph G1 entails G2, denoted G1 |=

G2, iff every interpretation over the vocabulary of G1∪G2 which satisfies G1 also satisfies G2.

We say that two graphs are equivalent, denoted G1≡G2, if G1 |= G2 and G2 |= G1.

Page 26: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

26

Semantics of Simple RDF Graphs A simple RDF graphs is a graph that do

not use vocabulary with a predefined semantics.

Theorem: A simple RDF graph G1 entails G2, denoted G1 |= G2, if and only if there is a map G2G1.

A graph entail any of its subgraphs.a

pb c

qX

pb c

q|=

Page 27: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

27

Semantics of Simple RDF Graphs Theorem:1. Deciding entailment of simple RDF graphs

is NP-complete.2. Deciding equivalence of simple RDF

graphs is isomorphism-complete.

Both depends heavily on the set of blank nodes. Can be done in O(vn), where v the set of nodes and n the blank nodes.

Theorem: If G is simple, then core(G) is the unique minimal graph equivalence to G.

Page 28: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

28

Semantics of RDF Graphs with RDFS Vocabulary

Group B (sp) Group A (simple graphs)

(a, type, prop)/(a, sp, a)(a, sp, b) (b, sp, c)/(a, sp,

c)(a, sp, b) (x, a, y)/(x, b, y)

2)3)4)

From map μ: G’GG/G’ 1

)

Group D (typing) Group C (sc)

(a, dom, c) (x, a, y)/(x, type, c)

(a, range, d) (x, a, y)/(y, type, d)

8)9)

(a, type, class)/(a, sc, a)(a, sc, b) (b, sc, c)/(a, sc, c)

(a, sp, b) (x, type, a)/(x, type, b)

5)6)7)

The following deductive system is sound & complete:

Page 29: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

29

Semantics of RDF Graphs with RDFS Vocabulary

Theorem: G1 |= G2, if and only if there is a sequence operations starts from G1 and ends with G2. NP-complete.

There is no mapping from G2G1 although G1 |= G2.

The idea is to “close” the graph with all possible triples.

bscc

a

sc

G2G1

scdb

scc

a

sc

scd

scX

sc

Page 30: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

30

Closure

A closure of a graph G is a maximal set of triples G’ over universe(G) plus the RDFS vocabulary such that G’ contains G and is equivalent to it.

There could be more than one closer for a graph.

The closer may have a redundancies.

The problem of deciding if G’ is the closure of G is DP-complete.

bq

d

a

pr X

pc

p

Page 31: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

31

Normal Form

A normal-form of a graph G, donated nf(G), is the core(G’) for the closer G’ of G.

Theorem: Let G be an RDF graph:1. The normal-form, nf(G) is unique.2. G1 |= G2 if and only if nf(G2)nf(G1).3. G1≡G2 if and only if nf(G1)≃nf(G2).

The problem of deciding if G’ is the normal form of G is DP-complete.

Page 32: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

32

Normal Form

bscc

a

sc

G2G1

scdb

scc

a

sc

scd

scX

sc

nf(Gi)

bsc

c

asc

scd

scsc

scnf is not the

most compact representation

.

Page 33: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

33

Query Language

The RDF database will be the RDF graph.

Let V be the set of variables donated by ?X, ?Y.

The query form is Datalog like HB, where H and B contain variables.(?X, ancestor, ?Y) (?X, ancestor, ?Z), (?Z, ancestor, ?Y)

The condition var(H)⊆var(B) avoids the presence of free variables in the head of the query.

The presence of blank nodes in the body plays the same rule as variable , therefore is unnecessary.

Page 34: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

34

Query Language

Query can have a set of premises P and constrains C. Query is a tuple (H, B, P, C).

The set of constrains C gives the user the possibility to discriminate between blank and ground nodes in the answer.

The premise P represents information the user supplies to the database to be queried in order to answer the query. E.g. the ability to query incomplete information by supplying information not in the DB or adding semantic information like (son, sp, relative) .

Page 35: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

35

Answer to a Query

Let q = (H, B, P, C) be a query, D a database and V set of variables.

A valuation v is function v:VUBL for all variables x in B. And for all variables x in C, v(x) is not a blank node.

A pre-answer to q over D is the set single answers v(H): preans(q,D) = {v(H): v(B)⊆nf(D+P) and v|=C}

Page 36: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

36

Answer to a Query

Composing a complex query from simpler once.1. ansu(q,D) is the union of all single answers

(blank nodes play the rule of bridges between two single answers).

2. Ans+(q,D) is the merge of all single answers (renaming blank nodes to avoid names clashes). Useful when querying to several sources.

Let q be a query:1. If D’|=D then ans(q,D’) |=ans(q,D).2. For all D, ansu(q,D)|=ans+(q,D) (the converse is

not true).

Page 37: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

37

Reification

The ability of identifying RDF statements.

By having a blank nodes in the head of the query, one can identify a statement.

(N, value, true), (N, type, stat),(N, subj, ?X), (N, pred, ?Y ),

(N, obj, ?Z) (?X, ?Y, ?Z)

Can cause an infinite DB. If statement i1 (a,b,c) is a valid then statement i2 (i1, subj, a) is also and the statement (i2, subj, i1), and so on.

Page 38: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

38

Query Containment

Exploring different notions of query containment.

In relational databases, set-theoretical inclusion of tuples captures this requirement.

Let q and q’ be queries, and for all databases D:1. q⊆pq’ , iff preans(q,D)⊆preans(q’,D) up to isomorphism.2. q⊆mq’ , iff ans(q’,D)|=ans(q,D).

Let q and q’ be queries, q⊆pq’ entails that q⊆mq’. The converse is not true.

Theorem: Deciding each one of them is NP-complete.

Page 39: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

39

Query Containment

For example:

H=B=(X, sc, Y), (Y, sc, Z)H’=B’=(X, sc, Y), (Y, sc, Z), (X, sc, Z)

q’⊆mq and q⊆mq’ is true, but NOT q’⊆pq or q⊆pq’

Page 40: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

40

Query Containment

Consider the queries q=(H,B,P,C) and q’=(H’,B’,P’,C’), and assume H,H’,B,B’, P, P’ are simple graphs.

Theorem: Then q⊆pq’ if and only if for each map μ on the variables of B, there is a substitution (of variables and blank nodes) Θμ such that:

1. Θμ(B’)⊆P’+(B−μ(B,P)), where μ(B,P) is the set of triples t of B such that μ(t)∋P.

2. Θμ(H’)=H.

3. Θμ(C’)⊆C.

Page 41: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

41

Query Containment

Consider the queries q=(H,B,P,C) and q’=(H’,B’,P’,C’), and assume H,H’,B,B’, P, P’ are simple graphs.

Theorem: Then q⊆mq’ if and only if there are substitutions (of variables) Θ1,…, Θn such that:

1. Θj(B’)⊆nf(B).

2. ∪jΘj(H’)|=H.

3. Θj(C’)⊆C.

Page 42: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

42

Complexity of Query Answering

The complexity of the evaluation problem of testing emptiness of the query answer set in two versions:

1. Query complexity version: For a fixed database D, given a query q, is q(D) non-empty?NP-complete

2. Data complexity version: For a fixed query q, given a database D, is q(D) non-empty?polynomial

The size of the set of the answer is bounded by |D||

q|.

Page 43: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

43

Redundancy Elimination – In Graphs

A reduction of a graph G is a minimal graph Gr equivalent to G and contained in G.

Algorithm computing the reduction of a graph G:1. Gnf(G)2. Apply reverse rules 7), 8), 9), 4), and 3) and 6) in this

order until no longer applicable.3. Apply any reverse rule in any order until no longer

applicable.

Theorem: The problem of deciding if G’ is the reduction of G is DP-complete.

Page 44: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

44

Redundancy Elimination – In Queries

Avoiding redundancy in query answer with lean query heads.

Lean query’s body is not always possible, and may cause for missing an answer.

Even having lean databases and queries with lean heads and bodies does not avoid redundancies. For example:

G1 is the answer to the query (?Z, p, ?U)(?Z, p, ?U) on G2

a

q

pX

Y

p

r

ap

X

Y

p

b

G2G1

Page 45: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

45

Redundancy Elimination –In Queries

The naive approach to eliminate redundancy in answers is to compute:

(1) ans(q,D), and (2) a lean equivalent to ans(q,D).

Theorem: Given a lean database D and a query q, to decide whether ans∪(q,D) is lean is coNP-complete (in the size of D).

Theorem: Given a lean database D and a query q, to decide whether ans+(q,D) is lean can be done in polynomial time in the size of D

Page 46: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

46

Contributions

Normal form.

A formal definition of query language for RDF and its main features.

Query containment and processing.

Redundancy elimination.

From entailment to mapping between graphs.

Complexity issues.

Page 47: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

47

References

Foundations of Semantic Web Databases – Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon (2004)

RDF Semantics – W3C Working Draft (2003)

Composing Web Services on the Semantic Web – Vadim Eisenberg

Special thanks to Google and Wikipedia.

Page 48: Claudio Gutierrez, Carlos Hurtado, Alberto O. Mendelzon 1

48

Thank you!