pods2003
DESCRIPTION
TRANSCRIPT
1
Computing Full Disjunctions
Yaron Kanza
Yehoshua Sagiv
The Selim and Rachel Benin
School of Engineering
and Computer Science
The Hebrew University of Jerusalem
2
Overview of the Talk• OR-semantics and weak semantics for
querying incomplete data– Complexity of query evaluation
• Full disjunctions as a special case of weak semantics
• Generalizing full disjunctions – the join constraints are not restricted to be equality constraints
• Lower bounds for some related problems
3
Querying Incomplete Data Requires a Special Semantics
• Usually, answers to a query are complete assignments of database objects (or values) to the query variables
• Consequently, partial information is lost• For example, dangling tuples are lost when
joining several relations– The purpose of outerjoins and full disjunctions
is to solve this problem, i.e., answers could be partial assignments (to some of the variables)
4
Querying Incomplete Semistructured Data
• In semistructured data, incompleteness of data is prevalent
• OR-semantics and weak semantics were introduced so that queries over semistructured data would return maximal answers rather than complete answers [Kanza, Nutt & Sagiv 1999]
5
In the Semistructured Data Model
• Both data and queries are labeled rooted directed graphs
• Query nodes are variables• Database nodes are objects• Matchings are assignments of database
objects to query variables, such that– The database root is assigned to the query root,
and– Labels are preserved
6
1
2 4
5
6
title
language
7
3
year
8
director
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
1998
English
1/12/1935
Woody Allen
title
year
acted in
acted in
A Semistructured Database About MoviesA Semistructured Database About Movies
7
v1
v2w1
v3title
actormovie
director
acted inw2
w3
w4
date of birth
name
language
A Query
Under complete semantics, the queryreturns actor-movie pairs, such that theactor played in the movie and was alsothe director of the movie
8
1
2 4
5
6
title
language
7
3
year
8
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
1998
English
1/12/1935
Woody Allen
title
year
acted in
acted inv1
v2w1
v3title
actormovie
director
acted inw2
w3
w4
date of birth
name
language
1
2
5
6
4
10
11
A complete matchingof the query variables to database objects
A complete matchingof the query variables to database objects
director
9
Constraints on Complete Matchings
Query Root Database Rootr 1
• The root constraint is satisfied if the query root is mapped to the database root
• A query edge is an edge constraint: – A query edge with a label l is satisfied if it is
mapped to a database edge with the same label l
x
y
9
11
l l
10
language
1
2 4
5
title
7
3
year
8
director
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
19986
English
1/12/1935
Woody Allen
title
year
acted in
acted inSuppose that Node 6 ismissing
language
6
English
11
1
2 4
5
title
7
3
year
8
director
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
19981/12/1935
Woody Allen
title
year
acted in
acted inv1
v2w1
v3title
actormovie
director
acted inw2
w3
w4
date of birth
name
language
1
2
5
4
10
11
An incomplete matching
An incomplete matching
This matching ismaximal
This matching ismaximal
w2
null
12
Database1
x
z
w
y
l1
r
v
l3
l2
l5
l4
l6
vQuery
The Reachability Constrainton Partial Matchings
• A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied
x
z
r l2
l4
l6
7
9
1 l2
l4
l6
w
y
l1
r
v
l3
l5
v
1
55
5
8
l1
1
l3
l5
55
13
• An edge constraint is weakly satisfied if it is either• Satisfied (as defined earlier), or• One (or more) of its nodes is mapped to a null
value
Weak Satisfaction ofEdge Constraints
x
y
9
11
l l
x
y
9
11
l m
x
y
9
11
l m
null
nullx
y
l
null
null
14
Weak Matchings
• A partial matching is a weak matching if– The root constraint is satisfied
– The reachability constraint is satisfied by every query node that is mapped to a database node
– Every edge constraint is weakly satisfied
15
1
2 4
5
title
7
3
year
8
director
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
19981/12/1935
Woody Allen
title
year
acted in
acted inv1
v2w1
v3title
actormovie
director
acted inw2
w3
w4
date of birth
name
language
1
2
5
4
10
11
A weak matchingA weak matchingw2
null
16
1
2 4
5
title
7
3
year
8
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
19981/12/1935
Woody Allen
title
year
directoracted in
acted in
A Movie DatabaseA Movie DatabaseConsider the case where the director edge is missing
director
17
1
2 4
5
title
7
3
year
8
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
19981/12/1935
Woody Allen
title
year
acted in
acted inv1
v2w1
v3title
actormovie
director
acted inw2
w3
w4
date of birth
name
language
1
2
5
4
10
11
An incompletematching that is not a weak matching
An incompletematching that is not a weak matching
w2
nullThere is an edge that is not weakly satisfied
There is an edge that is not weakly satisfied
18
OR Matchings
• A partial matching is an OR matching if– The root constraint is satisfied
– The reachability constraint is satisfied by every query node that is mapped to a database node
Differently from a weak matching, in an OR Matching, an edge constraint does not have to be weakly satisfied
Differently from a weak matching, in an OR Matching, an edge constraint does not have to be weakly satisfied
19
Maximal Matchings• Matchings can be represented as tuples (where
numbers are object id’s)
• A matching t1 subsumes a matching t2 if t1 can be obtained from t2 by replacing some nulls in t2 with non-null values
• A matching is maximal if no other matching subsumes it
• A query result consists only of maximal matchings
t1=(1, 5, 2, null) t2=(1, null, 2, null)
20
More Examples
21
1
2 4
5
6
title
language
7
3
year
8
director
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
1998
English
1/12/1935
Woody Allen
title
year
acted in
acted in
The Movie Database Before the RemovalsThe Movie Database Before the Removals
22
1
2 4
5
6
title
language
7
3
year
8
director
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
1998
English
1/12/1935
Woody Allen
title
year
acted in
acted inv1
v2w1
v3title
actormovie
director
acted inw2
w3
w4
date of birth
name
language
1
2
5
6
4
10
11
A complete matching
A complete matchingIt is also a maximal weak matching
It is also a maximal weak matchingIt is also a maximalOR-matching
It is also a maximalOR-matching
In the result, the actor must be both an
actor in the movie andthe director of the movie
23
1
2 4
5
6
title
language
7
3
year
8
director
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
1998
English
1/12/1935
Woody Allen
title
year
acted in
acted inv1
v2w1
v3title
actormovie
director
acted inw2
w3
w4
date of birth
name
language
1
8
3
A second maximalweak matching
A second maximalweak matching
null null
null
null
In the result, if the actor and themovie are assigned non-null values,then the actor must be both anactor in the movie and the director of the movie
24
1
2 4
5
6
title
language
7
3
year
8
director
9
name
10
movie
date of birth
11
1983
movieactor
Zelig Antz
1998
English
1/12/1935
Woody Allen
title
year
acted in
acted inv1
v2w1
v3title
actormovie
director
acted inw2
w3
w4
date of birth
name
language
1
8
3 4
10
11
null
A maximalOR-matching
A maximalOR-matchingIt is not a weakmatching
It is not a weakmatching
In the result, the actor either played in the movie,directed the movie, or is not related
at all to the movie
25
Complexity of Evaluating Maximal Weak Matchings
and Maximal OR Matchings
Complexity of Evaluating Maximal Weak Matchings
and Maximal OR Matchings
26
Data Complexity
• Under data complexity, the time complexity is a function of– the size of the database
27
Two Alternatives forQuery Evaluation
• A naïve algorithm computes all matchings and then removes subsumed matchings
• A better algorithm avoids computing all matchings – ideally it only computes maximal matchings
• Under data complexity, both algorithms are polynomial time
28
Input-Output Complexity
• Under input-output complexity, the time complexity is a function of– the size of the query,– the size of the database, and – the size of the result
29
A Naïve Algorithm vs.A Better Algorithm
• Under I-O complexity, a naïve algorithm is exponential
• Is there a better algorithm with a polynomial time I-O complexity?– The answer is positive for DAG queries
[Kanza, Nutt & Sagiv 1999]
30
Cyclic Queries
Theorem: For a query Q and a database D,
the set of all maximal weak matchings
can be computed in O(q3dm2) time, where
q is the size of the query, d is the size of the
database and m is the size of the result
(computing all maximal OR matchings has the
same complexity)
Theorem: For a query Q and a database D,
the set of all maximal weak matchings
can be computed in O(q3dm2) time, where
q is the size of the query, d is the size of the
database and m is the size of the result
(computing all maximal OR matchings has the
same complexity)
31
Full DisjunctionsFull Disjunctions
What is the full disjunctionof a set of relations?
What is the full disjunctionof a set of relations?
How are full disjunctions related toqueries with incomplete answers?
How are full disjunctions related toqueries with incomplete answers?
32
m-idtitleyearlanguage
1Zelig1983English
2Antz1998English
3Armageddon1998English
4Fantasia1940English
Movies
a-idnamedate-of-birth
1Woody Allen1/12/1935
2Bruce Willis19/3/1955
3Julia Roberts28/10/1967
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-ina-idm-id
11Actors-that-Directed
m-idtitleyearlanguagea-idnameDate-of-birthrole
1Zelig1983English1Woody Allen1/12/1935Zelig
2Antz1998English1Woody Allen1/12/1935Z
3Armageddon1998English2Bruce Willis19/3/1955Harry
4Fantasia1940English
3Julia Roberts28/10/1967
The Full Disjunction of the Given Relations
33
The Full Disjunction of the Given Relations
m-idtitleyearlanguagea-idnameDate-of-birthrole
1Zelig1983English1Woody Allen1/12/1935Zelig
2Antz1998English1Woody Allen1/12/1935Z
3Armageddon1998English2Bruce Willis19/3/1955Harry
4Fantasia1940English
3Julia Roberts28/10/1967
m-idtitleyearlanguagea-idnameDate-of-birthrole
1Zelig1983English
The full disjunction does not include subsumed tuples
m-idtitleyearlanguage
1Zelig1983English
2Antz1998English
3Armageddon1998English
4Fantasia1940English
Movies
This tuple will notbe in the full disjunction
34
m-idtitleyearlanguage
1Zelig1983English
2Antz1998English
3Armageddon1998English
4Fantasia1940English
Movies
a-idnamedate-of-birth
1Woody Allen1/12/1935
2Bruce Willis19/3/1955
3Julia Roberts28/10/1967
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-ina-idm-id
11Actors-that-Directed
m-idtitleyearlanguagea-idnameDate-of-birthrole
1Zelig1983English1Woody Allen1/12/1935Zelig
2Antz1998English1Woody Allen1/12/1935Z
3Armageddon1998English2Bruce Willis19/3/1955Harry
4Fantasia1940English
3Julia Roberts28/10/1967
The Full Disjunction of the Given Relations
m-idtitleyearlanguagea-idnameDate-of-birthrole
4Fantasia1940English3Julia Roberts28/10/1967
The full disjunction does not include tuples that are based on Cartesian Product rather than join
35
In the Full Disjunctionof a Given Set of Relations:
Every tuple of the input is a partof at least one tuple of the output
Tuples are joined as in a naturaljoin, padded with null values
The result includes only“maximal connected portions”
36
Motivation for Full Disjunctions
• Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’94]
• Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’96]
37
Computing Full Disjunctionsfor γ-acyclic Relation Schemas
• Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic
• Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ-acyclic
38
Weak Semantics GeneralizesFull Disjunctions
• Relations can be converted into a semistructured database
• The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics
39
Example
m-idtitle
1Zelig
2Antz
3Armageddon
4Fantasia
Movies
a-idname
1Woody Allen
2Bruce Willis
3Julia Roberts
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-in
A node is created for each tuple
Edges are added between connected tuples, in both directions
A root is added, and edges are added from the root to every node
Creating The DatabaseCreating The Database
r
We use colors instead of labels
40
m-idtitle
1Zelig
2Antz
3Armageddon
4Fantasia
Movies
a-idname
1Woody Allen
2Bruce Willis
3Julia Roberts
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-in
MoviesActors
Acted-in
Creating The QueriesCreating The Queries
Example
A node is created for each relation schemaEdges are added between connected schemas, in both directions
r
The number of queries is equal to the number of schemasIn each query, the root is connected to a different schema
r
41Queries are Evaluated under
Weak Semantics
Queries are Evaluated under Weak Semantics
m-idtitle
1Zelig
2Antz
3Armageddon
4Fantasia
Movies
a-idname
1Woody Allen
2Bruce Willis
3Julia Roberts
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-in
Example r
MoviesActors
Acted-in
r m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
m-idtitlea-idnamerole
42
m-idtitle
1Zelig
2Antz
3Armageddon
4Fantasia
Movies
a-idname
1Woody Allen
2Bruce Willis
3Julia Roberts
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-in
Example r
MoviesActors
Acted-in
r m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
Queries are Evaluated under Weak Semantics
Queries are Evaluated under Weak Semantics
43
m-idtitle
1Zelig
2Antz
3Armageddon
4Fantasia
Movies
a-idname
1Woody Allen
2Bruce Willis
3Julia Roberts
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-in
Example r
MoviesActors
Acted-in
r m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
3Armageddon2Bruce WillisHarry
Queries are Evaluated under Weak Semantics
Queries are Evaluated under Weak Semantics
44
m-idtitle
1Zelig
2Antz
3Armageddon
4Fantasia
Movies
a-idname
1Woody Allen
2Bruce Willis
3Julia Roberts
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-in
Example r
MoviesActors
Acted-in
r m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
3Armageddon2Bruce WillisHarry
m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
3Armageddon2Bruce WillisHarry
3Julia Roberts
null
null
Queries are Evaluated under Weak Semantics
Queries are Evaluated under Weak Semantics
45
m-idtitle
1Zelig
2Antz
3Armageddon
4Fantasia
Movies
a-idname
1Woody Allen
2Bruce Willis
3Julia Roberts
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-in
Example r
MoviesActors
Acted-in
r m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
3Armageddon2Bruce WillisHarry
m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
3Armageddon2Bruce WillisHarry
3Julia Roberts
Queries are Evaluated under Weak Semantics
Queries are Evaluated under Weak Semantics
46
m-idtitle
1Zelig
2Antz
3Armageddon
4Fantasia
Movies
a-idname
1Woody Allen
2Bruce Willis
3Julia Roberts
Actors
a-idm-idrole
11Zelig
12Z
23Harry
Acted-in
Example r
MoviesActors
Acted-in
r m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
3Armageddon2Bruce WillisHarry
m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
3Armageddon2Bruce WillisHarry
3Julia Roberts
null
null
m-idtitlea-idnamerole
1Zelig1Woody AllenZelig
2Antz1Woody AllenZ
3Armageddon2Bruce WillisHarry
3Julia Roberts
4Fantasia
47
Theorem: The full disjunction of relations
r1, …, rn can be computed in O(n5s 2f 2) time,
where n is the number of relations, s is the
total size of all the relations and f is the size
of the result
Theorem: The full disjunction of relations
r1, …, rn can be computed in O(n5s 2f 2) time,
where n is the number of relations, s is the
total size of all the relations and f is the size
of the result
The Algorithm Computes Full Disjunctions in Polynomial TimeUnder Input-Output Complexity
48
Generalizing Full Disjunctions
• In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join)
• We can generalize full disjunctions to support constraints that are not merely equality among attributes
49
Example
Movies (m-id, title, year, language, location)
Actors (a-id, name, date-of-birth)
Acted-in (a-id, m-id, role)
Actors-that-Directed (a-id, m-id)
Movies (m-id, title, year, language, location)
Actors (a-id, name, date-of-birth)
Acted-in (a-id, m-id, role)
Actors-that-Directed (a-id, m-id)
Historical-Events (name, date, description)
Historical-Sites (Country, State, City, Site)
Historical-Events (name, date, description)
Historical-Sites (Country, State, City, Site)
The date of the historical event is a date in the year whenthe movie was released
The filming location is near the historical site
50
The General Idea
• A set of constraints specifies how tuples should be joined
• The queries and the database are constructed according to the given constraints – A pair of nodes is connected by an edge when it
satisfies the corresponding constraint
• Queries are evaluated w.r.t. the database under weak semantics
51
Another Way of Generalizing Full
Disjunctions: Use OR-Semantics • Generate the queries and the database as
before, but the queries are evaluated under OR-semantics (rather than weak semantics)
• This relaxes the requirement that every pair of tuples should be join consistent
• Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent
52
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employee: (007, James Bond, London, 6)
Department: (6, MI-6, 10)
Located-in: (10, Liverpool, King)
e-idenamecitydept
-no
dept
-no
dnamebuildingbuildingcitystreet
007James BondLondon66MI-610
6MI-61010LiverpoolKing
Example
The Full Disjunction
53
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employees (e-id, ename, city, dept-no)
Departments (dept-no, dname, building)
Located-in (building, city, street)
Employee: (007, James Bond, London, 6)
Department: (6, MI-6, 10)
Located-in: (10, Liverpool, King)
e-idenamecitydept
-no
dept
-no
dnamebuildingbuildingcitystreet
007James BondLondon66MI-61010LiverpoolKing
Example
The Full Disjunction under OR-Semantics
54
The Projection Problem: Computing the projection of
the full disjunction on a given set of attributes
The Projection Problem: Computing the projection of
the full disjunction on a given set of attributes
The Restriction Problem: Computing only those
tuples of the full disjunction that are non-null on a
given set of attributes
The Restriction Problem: Computing only those
tuples of the full disjunction that are non-null on a
given set of attributes
Two Related ProblemsTwo Related Problems
The projection problem and the restriction problem cannot be computed in polynomial time (under input-output complexity) unless P=NP
The projection problem and the restriction problem cannot be computed in polynomial time (under input-output complexity) unless P=NP
55
Conclusion
• Cyclic queries can be computed in polynomial time (in the size of the query, the database and the result) under either OR-semantics or weak semantics
• A reduction of full-disjunction evaluation to query evaluation under weak semantics is described
• Using the reduction, full disjunctions can be computed in polynomial time (in the size of the relation schemas, the relations and the result)
56
Conclusion (continued)
• Full disjunctions can be generalized in two ways– By using OR-semantics instead of weak semantics– By joining tuples according to general constraints
• Generalized full disjunctions can be useful in the context of data integration from heterogeneous sources
• The projection problem and the restriction problem have polynomial-time algorithms (under input-output complexity) when the relations have γ-acyclic schemas, but not in the general case
57
Thank YouThank You
Questions?Questions?