computing full disjunctions

57
1 Computing Full Disjunctions Yaron Kanza Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science The Hebrew University of Jerusalem

Upload: haroun

Post on 21-Jan-2016

16 views

Category:

Documents


0 download

DESCRIPTION

Computing Full Disjunctions. Yaron Kanza Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science The Hebrew University of Jerusalem. Overview of the Talk. OR-semantics and weak semantics for querying incomplete data Complexity of query evaluation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computing Full Disjunctions

1

Computing Full Disjunctions

Yaron Kanza

Yehoshua Sagiv

The Selim and Rachel Benin

School of Engineering

and Computer Science

The Hebrew University of Jerusalem

Page 2: Computing Full Disjunctions

2

Overview of the Talk• OR-semantics and weak semantics for

querying incomplete data– Complexity of query evaluation

• Full disjunctions as a special case of weak semantics

• Generalizing full disjunctions – the join constraints are not restricted to be equality constraints

• Lower bounds for some related problems

Page 3: Computing Full Disjunctions

3

Querying Incomplete Data Requires a Special Semantics

• Usually, answers to a query are complete assignments of database objects (or values) to the query variables

• Consequently, partial information is lost• For example, dangling tuples are lost when

joining several relations– The purpose of outerjoins and full disjunctions

is to solve this problem, i.e., answers could be partial assignments (to some of the variables)

Page 4: Computing Full Disjunctions

4

Querying Incomplete Semistructured Data

• In semistructured data, incompleteness of data is prevalent

• OR-semantics and weak semantics were introduced so that queries over semistructured data would return maximal answers rather than complete answers [Kanza, Nutt & Sagiv 1999]

Page 5: Computing Full Disjunctions

5

In the Semistructured Data Model

• Both data and queries are labeled rooted directed graphs

• Query nodes are variables• Database nodes are objects• Matchings are assignments of database

objects to query variables, such that– The database root is assigned to the query root,

and– Labels are preserved

Page 6: Computing Full Disjunctions

6

1

2 4

5

6

title

language

7

3

year

8

director

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

1998

English

1/12/1935

Woody Allen

title

year

acted in

acted in

A Semistructured Database About MoviesA Semistructured Database About Movies

Page 7: Computing Full Disjunctions

7

v1

v2w1

v3title

actormovie

director

acted inw2

w3

w4

date of birth

name

language

A Query

Under complete semantics, the queryreturns actor-movie pairs, such that theactor played in the movie and was alsothe director of the movie

Page 8: Computing Full Disjunctions

8

1

2 4

5

6

title

language

7

3

year

8

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

1998

English

1/12/1935

Woody Allen

title

year

acted in

acted inv1

v2w1

v3title

actormovie

director

acted inw2

w3

w4

date of birth

name

language

1

2

5

6

4

10

11

A complete matchingof the query variables to database objects

A complete matchingof the query variables to database objects

director

Page 9: Computing Full Disjunctions

9

Constraints on Complete Matchings

Query Root Database Rootr 1

• The root constraint is satisfied if the query root is mapped to the database root

• A query edge is an edge constraint: – A query edge with a label l is satisfied if it is

mapped to a database edge with the same label l

x

y

9

11

l l

Page 10: Computing Full Disjunctions

10

language

1

2 4

5

title

7

3

year

8

director

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

19986

English

1/12/1935

Woody Allen

title

year

acted in

acted inSuppose that Node 6 ismissing

language

6

English

Page 11: Computing Full Disjunctions

11

1

2 4

5

title

7

3

year

8

director

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

19981/12/1935

Woody Allen

title

year

acted in

acted inv1

v2w1

v3title

actormovie

director

acted inw2

w3

w4

date of birth

name

language

1

2

5

4

10

11

An incomplete matching

An incomplete matching

This matching ismaximal

This matching ismaximal

w2

null

Page 12: Computing Full Disjunctions

12

Database1

x

z

w

y

l1

r

v

l3

l2

l5

l4

l6

vQuery

The Reachability Constrainton Partial Matchings

• A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied

x

z

r l2

l4

l6

7

9

1 l2

l4

l6

w

y

l1

r

v

l3

l5

v

1

55

5

8

l1

1

l3

l5

55

Page 13: Computing Full Disjunctions

13

• An edge constraint is weakly satisfied if it is either• Satisfied (as defined earlier), or• One (or more) of its nodes is mapped to a null

value

Weak Satisfaction ofEdge Constraints

x

y

9

11

l l

x

y

9

11

l m

x

y

9

11

l m

null

nullx

y

l

null

null

Page 14: Computing Full Disjunctions

14

Weak Matchings

• A partial matching is a weak matching if– The root constraint is satisfied

– The reachability constraint is satisfied by every query node that is mapped to a database node

– Every edge constraint is weakly satisfied

Page 15: Computing Full Disjunctions

15

1

2 4

5

title

7

3

year

8

director

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

19981/12/1935

Woody Allen

title

year

acted in

acted inv1

v2w1

v3title

actormovie

director

acted inw2

w3

w4

date of birth

name

language

1

2

5

4

10

11

A weak matchingA weak matchingw2

null

Page 16: Computing Full Disjunctions

16

1

2 4

5

title

7

3

year

8

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

19981/12/1935

Woody Allen

title

year

directoracted in

acted in

A Movie DatabaseA Movie DatabaseConsider the case where the director edge is missing

director

Page 17: Computing Full Disjunctions

17

1

2 4

5

title

7

3

year

8

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

19981/12/1935

Woody Allen

title

year

acted in

acted inv1

v2w1

v3title

actormovie

director

acted inw2

w3

w4

date of birth

name

language

1

2

5

4

10

11

An incompletematching that is not a weak matching

An incompletematching that is not a weak matching

w2

nullThere is an edge that is not weakly satisfied

There is an edge that is not weakly satisfied

Page 18: Computing Full Disjunctions

18

OR Matchings

• A partial matching is an OR matching if– The root constraint is satisfied

– The reachability constraint is satisfied by every query node that is mapped to a database node

Differently from a weak matching, in an OR Matching, an edge constraint does not have to be weakly satisfied

Differently from a weak matching, in an OR Matching, an edge constraint does not have to be weakly satisfied

Page 19: Computing Full Disjunctions

19

Maximal Matchings• Matchings can be represented as tuples (where

numbers are object id’s)

• A matching t1 subsumes a matching t2 if t1 can be obtained from t2 by replacing some nulls in t2 with non-null values

• A matching is maximal if no other matching subsumes it

• A query result consists only of maximal matchings

t1=(1, 5, 2, null) t2=(1, null, 2, null)

Page 20: Computing Full Disjunctions

20

More Examples

Page 21: Computing Full Disjunctions

21

1

2 4

5

6

title

language

7

3

year

8

director

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

1998

English

1/12/1935

Woody Allen

title

year

acted in

acted in

The Movie Database Before the RemovalsThe Movie Database Before the Removals

Page 22: Computing Full Disjunctions

22

1

2 4

5

6

title

language

7

3

year

8

director

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

1998

English

1/12/1935

Woody Allen

title

year

acted in

acted inv1

v2w1

v3title

actormovie

director

acted inw2

w3

w4

date of birth

name

language

1

2

5

6

4

10

11

A complete matching

A complete matchingIt is also a maximal weak matching

It is also a maximal weak matchingIt is also a maximalOR-matching

It is also a maximalOR-matching

In the result, the actor must be both an

actor in the movie andthe director of the movie

Page 23: Computing Full Disjunctions

23

1

2 4

5

6

title

language

7

3

year

8

director

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

1998

English

1/12/1935

Woody Allen

title

year

acted in

acted inv1

v2w1

v3title

actormovie

director

acted inw2

w3

w4

date of birth

name

language

1

8

3

A second maximalweak matching

A second maximalweak matching

null null

null

null

In the result, if the actor and themovie are assigned non-null values,then the actor must be both anactor in the movie and the director of the movie

Page 24: Computing Full Disjunctions

24

1

2 4

5

6

title

language

7

3

year

8

director

9

name

10

movie

date of birth

11

1983

movieactor

Zelig Antz

1998

English

1/12/1935

Woody Allen

title

year

acted in

acted inv1

v2w1

v3title

actormovie

director

acted inw2

w3

w4

date of birth

name

language

1

8

3 4

10

11

null

A maximalOR-matching

A maximalOR-matchingIt is not a weakmatching

It is not a weakmatching

In the result, the actor either played in the movie,directed the movie, or is not related

at all to the movie

Page 25: Computing Full Disjunctions

25

Complexity of Evaluating Maximal Weak Matchings

and Maximal OR Matchings

Complexity of Evaluating Maximal Weak Matchings

and Maximal OR Matchings

Page 26: Computing Full Disjunctions

26

Data Complexity

• Under data complexity, the time complexity is a function of– the size of the database

Page 27: Computing Full Disjunctions

27

Two Alternatives forQuery Evaluation

• A naïve algorithm computes all matchings and then removes subsumed matchings

• A better algorithm avoids computing all matchings – ideally it only computes maximal matchings

• Under data complexity, both algorithms are polynomial time

Page 28: Computing Full Disjunctions

28

Input-Output Complexity

• Under input-output complexity, the time complexity is a function of– the size of the query,– the size of the database, and – the size of the result

Page 29: Computing Full Disjunctions

29

A Naïve Algorithm vs.A Better Algorithm

• Under I-O complexity, a naïve algorithm is exponential

• Is there a better algorithm with a polynomial time I-O complexity?– The answer is positive for DAG queries

[Kanza, Nutt & Sagiv 1999]

Page 30: Computing Full Disjunctions

30

Cyclic Queries

Theorem: For a query Q and a database D,

the set of all maximal weak matchings

can be computed in O(q3dm2) time, where

q is the size of the query, d is the size of the

database and m is the size of the result

(computing all maximal OR matchings has the

same complexity)

Theorem: For a query Q and a database D,

the set of all maximal weak matchings

can be computed in O(q3dm2) time, where

q is the size of the query, d is the size of the

database and m is the size of the result

(computing all maximal OR matchings has the

same complexity)

Page 31: Computing Full Disjunctions

31

Full DisjunctionsFull Disjunctions

What is the full disjunctionof a set of relations?

What is the full disjunctionof a set of relations?

How are full disjunctions related toqueries with incomplete answers?

How are full disjunctions related toqueries with incomplete answers?

Page 32: Computing Full Disjunctions

32

m-idtitleyearlanguage

1Zelig1983English

2Antz1998English

3Armageddon1998English

4Fantasia1940English

Movies

a-idnamedate-of-birth

1Woody Allen1/12/1935

2Bruce Willis19/3/1955

3Julia Roberts28/10/1967

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-ina-idm-id

11Actors-that-Directed

m-idtitleyearlanguagea-idnameDate-of-birthrole

1Zelig1983English1Woody Allen1/12/1935Zelig

2Antz1998English1Woody Allen1/12/1935Z

3Armageddon1998English2Bruce Willis19/3/1955Harry

4Fantasia1940English

3Julia Roberts28/10/1967

The Full Disjunction of the Given Relations

Page 33: Computing Full Disjunctions

33

The Full Disjunction of the Given Relations

m-idtitleyearlanguagea-idnameDate-of-birthrole

1Zelig1983English1Woody Allen1/12/1935Zelig

2Antz1998English1Woody Allen1/12/1935Z

3Armageddon1998English2Bruce Willis19/3/1955Harry

4Fantasia1940English

3Julia Roberts28/10/1967

m-idtitleyearlanguagea-idnameDate-of-birthrole

1Zelig1983English

The full disjunction does not include subsumed tuples

m-idtitleyearlanguage

1Zelig1983English

2Antz1998English

3Armageddon1998English

4Fantasia1940English

Movies

This tuple will notbe in the full disjunction

Page 34: Computing Full Disjunctions

34

m-idtitleyearlanguage

1Zelig1983English

2Antz1998English

3Armageddon1998English

4Fantasia1940English

Movies

a-idnamedate-of-birth

1Woody Allen1/12/1935

2Bruce Willis19/3/1955

3Julia Roberts28/10/1967

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-ina-idm-id

11Actors-that-Directed

m-idtitleyearlanguagea-idnameDate-of-birthrole

1Zelig1983English1Woody Allen1/12/1935Zelig

2Antz1998English1Woody Allen1/12/1935Z

3Armageddon1998English2Bruce Willis19/3/1955Harry

4Fantasia1940English

3Julia Roberts28/10/1967

The Full Disjunction of the Given Relations

m-idtitleyearlanguagea-idnameDate-of-birthrole

4Fantasia1940English3Julia Roberts28/10/1967

The full disjunction does not include tuples that are based on Cartesian Product rather than join

Page 35: Computing Full Disjunctions

35

In the Full Disjunctionof a Given Set of Relations:

Every tuple of the input is a partof at least one tuple of the output

Tuples are joined as in a naturaljoin, padded with null values

The result includes only“maximal connected portions”

Page 36: Computing Full Disjunctions

36

Motivation for Full Disjunctions

• Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’94]

• Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’96]

Page 37: Computing Full Disjunctions

37

Computing Full Disjunctionsfor γ-acyclic Relation Schemas

• Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic

• Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ-acyclic

Page 38: Computing Full Disjunctions

38

Weak Semantics GeneralizesFull Disjunctions

• Relations can be converted into a semistructured database

• The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics

Page 39: Computing Full Disjunctions

39

Example

m-idtitle

1Zelig

2Antz

3Armageddon

4Fantasia

Movies

a-idname

1Woody Allen

2Bruce Willis

3Julia Roberts

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-in

A node is created for each tuple

Edges are added between connected tuples, in both directions

A root is added, and edges are added from the root to every node

Creating The DatabaseCreating The Database

r

We use colors instead of labels

Page 40: Computing Full Disjunctions

40

m-idtitle

1Zelig

2Antz

3Armageddon

4Fantasia

Movies

a-idname

1Woody Allen

2Bruce Willis

3Julia Roberts

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-in

MoviesActors

Acted-in

Creating The QueriesCreating The Queries

Example

A node is created for each relation schemaEdges are added between connected schemas, in both directions

r

The number of queries is equal to the number of schemasIn each query, the root is connected to a different schema

r

Page 41: Computing Full Disjunctions

41Queries are Evaluated under

Weak Semantics

Queries are Evaluated under Weak Semantics

m-idtitle

1Zelig

2Antz

3Armageddon

4Fantasia

Movies

a-idname

1Woody Allen

2Bruce Willis

3Julia Roberts

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-in

Example r

MoviesActors

Acted-in

r m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

m-idtitlea-idnamerole

Page 42: Computing Full Disjunctions

42

m-idtitle

1Zelig

2Antz

3Armageddon

4Fantasia

Movies

a-idname

1Woody Allen

2Bruce Willis

3Julia Roberts

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-in

Example r

MoviesActors

Acted-in

r m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

Queries are Evaluated under Weak Semantics

Queries are Evaluated under Weak Semantics

Page 43: Computing Full Disjunctions

43

m-idtitle

1Zelig

2Antz

3Armageddon

4Fantasia

Movies

a-idname

1Woody Allen

2Bruce Willis

3Julia Roberts

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-in

Example r

MoviesActors

Acted-in

r m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

3Armageddon2Bruce WillisHarry

Queries are Evaluated under Weak Semantics

Queries are Evaluated under Weak Semantics

Page 44: Computing Full Disjunctions

44

m-idtitle

1Zelig

2Antz

3Armageddon

4Fantasia

Movies

a-idname

1Woody Allen

2Bruce Willis

3Julia Roberts

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-in

Example r

MoviesActors

Acted-in

r m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

3Armageddon2Bruce WillisHarry

m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

3Armageddon2Bruce WillisHarry

3Julia Roberts

null

null

Queries are Evaluated under Weak Semantics

Queries are Evaluated under Weak Semantics

Page 45: Computing Full Disjunctions

45

m-idtitle

1Zelig

2Antz

3Armageddon

4Fantasia

Movies

a-idname

1Woody Allen

2Bruce Willis

3Julia Roberts

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-in

Example r

MoviesActors

Acted-in

r m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

3Armageddon2Bruce WillisHarry

m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

3Armageddon2Bruce WillisHarry

3Julia Roberts

Queries are Evaluated under Weak Semantics

Queries are Evaluated under Weak Semantics

Page 46: Computing Full Disjunctions

46

m-idtitle

1Zelig

2Antz

3Armageddon

4Fantasia

Movies

a-idname

1Woody Allen

2Bruce Willis

3Julia Roberts

Actors

a-idm-idrole

11Zelig

12Z

23Harry

Acted-in

Example r

MoviesActors

Acted-in

r m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

3Armageddon2Bruce WillisHarry

m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

3Armageddon2Bruce WillisHarry

3Julia Roberts

null

null

m-idtitlea-idnamerole

1Zelig1Woody AllenZelig

2Antz1Woody AllenZ

3Armageddon2Bruce WillisHarry

3Julia Roberts

4Fantasia

Page 47: Computing Full Disjunctions

47

Theorem: The full disjunction of relations

r1, …, rn can be computed in O(n5s 2f 2) time,

where n is the number of relations, s is the

total size of all the relations and f is the size

of the result

Theorem: The full disjunction of relations

r1, …, rn can be computed in O(n5s 2f 2) time,

where n is the number of relations, s is the

total size of all the relations and f is the size

of the result

The Algorithm Computes Full Disjunctions in Polynomial TimeUnder Input-Output Complexity

Page 48: Computing Full Disjunctions

48

Generalizing Full Disjunctions

• In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join)

• We can generalize full disjunctions to support constraints that are not merely equality among attributes

Page 49: Computing Full Disjunctions

49

Example

Movies (m-id, title, year, language, location)

Actors (a-id, name, date-of-birth)

Acted-in (a-id, m-id, role)

Actors-that-Directed (a-id, m-id)

Movies (m-id, title, year, language, location)

Actors (a-id, name, date-of-birth)

Acted-in (a-id, m-id, role)

Actors-that-Directed (a-id, m-id)

Historical-Events (name, date, description)

Historical-Sites (Country, State, City, Site)

Historical-Events (name, date, description)

Historical-Sites (Country, State, City, Site)

The date of the historical event is a date in the year whenthe movie was released

The filming location is near the historical site

Page 50: Computing Full Disjunctions

50

The General Idea

• A set of constraints specifies how tuples should be joined

• The queries and the database are constructed according to the given constraints – A pair of nodes is connected by an edge when it

satisfies the corresponding constraint

• Queries are evaluated w.r.t. the database under weak semantics

Page 51: Computing Full Disjunctions

51

Another Way of Generalizing Full

Disjunctions: Use OR-Semantics • Generate the queries and the database as

before, but the queries are evaluated under OR-semantics (rather than weak semantics)

• This relaxes the requirement that every pair of tuples should be join consistent

• Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent

Page 52: Computing Full Disjunctions

52

Employees (e-id, ename, city, dept-no)

Departments (dept-no, dname, building)

Located-in (building, city, street)

Employees (e-id, ename, city, dept-no)

Departments (dept-no, dname, building)

Located-in (building, city, street)

Employee: (007, James Bond, London, 6)

Department: (6, MI-6, 10)

Located-in: (10, Liverpool, King)

e-idenamecitydept

-no

dept

-no

dnamebuildingbuildingcitystreet

007James BondLondon66MI-610

6MI-61010LiverpoolKing

Example

The Full Disjunction

Page 53: Computing Full Disjunctions

53

Employees (e-id, ename, city, dept-no)

Departments (dept-no, dname, building)

Located-in (building, city, street)

Employees (e-id, ename, city, dept-no)

Departments (dept-no, dname, building)

Located-in (building, city, street)

Employee: (007, James Bond, London, 6)

Department: (6, MI-6, 10)

Located-in: (10, Liverpool, King)

e-idenamecitydept

-no

dept

-no

dnamebuildingbuildingcitystreet

007James BondLondon66MI-61010LiverpoolKing

Example

The Full Disjunction under OR-Semantics

Page 54: Computing Full Disjunctions

54

The Projection Problem: Computing the projection of

the full disjunction on a given set of attributes

The Projection Problem: Computing the projection of

the full disjunction on a given set of attributes

The Restriction Problem: Computing only those

tuples of the full disjunction that are non-null on a

given set of attributes

The Restriction Problem: Computing only those

tuples of the full disjunction that are non-null on a

given set of attributes

Two Related ProblemsTwo Related Problems

The projection problem and the restriction problem cannot be computed in polynomial time (under input-output complexity) unless P=NP

The projection problem and the restriction problem cannot be computed in polynomial time (under input-output complexity) unless P=NP

Page 55: Computing Full Disjunctions

55

Conclusion

• Cyclic queries can be computed in polynomial time (in the size of the query, the database and the result) under either OR-semantics or weak semantics

• A reduction of full-disjunction evaluation to query evaluation under weak semantics is described

• Using the reduction, full disjunctions can be computed in polynomial time (in the size of the relation schemas, the relations and the result)

Page 56: Computing Full Disjunctions

56

Conclusion (continued)

• Full disjunctions can be generalized in two ways– By using OR-semantics instead of weak semantics– By joining tuples according to general constraints

• Generalized full disjunctions can be useful in the context of data integration from heterogeneous sources

• The projection problem and the restriction problem have polynomial-time algorithms (under input-output complexity) when the relations have γ-acyclic schemas, but not in the general case

Page 57: Computing Full Disjunctions

57

Thank YouThank You

Questions?Questions?