Download - Theophanis Tsandilas - Library and Archives Canada Canada Acquisitions and ... with me in course assignmenmts and projects. ... algebra and apply the composition and any possible …

Theophanis Tsandilas

A thesis submit ted in conformity with the recpirements for the degree of Master

Graduate Department of Cornputer Science Lniversity of Toronto

Copyright @ 2001 by Theophanis Tsandilas

National Library 191 ofCanada Bibliothèque nationale du Canada

Acquisitions and Acquisitions ef Bibliographie Services services bibliographiques

395 Wellington Street 395, rue Wellington OttawaON K1AON4 ûttawa ON K1A ûN4 Canada Canada

The author has granted a non- exclusive licsnce allowing the National Library of Canada to reproduce, loan, distribute or seil copies of this thesis in microform, paper or electronic formats.

L'auteur a accordé une licence non exclusive permettant a la Bibliothèque nationale dii Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la fome de microfiche/film, de reproduction sur papier ou sur format électronique.

The author retains ownership of the L'auteur conserve la propriété du copyxight in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être UnpRmés reproduced without the author's ou autrement reproduits sans son permission. autnn'satim

Abst ract

Rewi t ing Queries over ?(hl L Views

Theophanis Tsandilas

LLaster

Graduate Department of Cornputer Science

University of Toronto

200 1

S M L i s widely used as the standard format of data representation and data eschange

on the Web. and database research has recently focused on esploiting its experience in

traditional databases to handle similar issues in the context of the new data format. This

thesis deals with the problern of query composition in the context of ShIL-QL. which is

one of the most popular S4IL query languages. !dore specifically. ive study how a user

query defined over one or more views can be rewritten. so t hat the rewritten query only

refers to source data. This problem is entensively studied in the context of relational

and object-oriented databases. but several issues are not yet exarnined in SSIL query

languages .

The thesis presents a simple rewriting algorithm which is based on a matching proce-

dure between properties of the user query and properties of the view query. The purpoae

of this procedure is to derive the bindings of the user-query variables without materializ-

ing the view. The underlying data mode1 is object-based and unordered. Special issues

t hat we study include multiple view references. equality conditions between variables in

the user query. and element grouping in the view.

To my parents

Panagiotis and Iionstant ina

for t heir love and continuous suppor t .

First. I would like to thank my advisor Renée Miller for her help and supervision during

the two years of my master. Although our interests did not coincide. our CO-operation

gave me many valuable lessons. 1 also want to tharik Alberto Mendelzon for his helpful

comments on my t hesis.

Secondly. 1 feel grateful to niy closest friends in Toronto. and rnainly to the smail

Greek community of the department of Computer Science - who hosted and directed

me in iny first steps in Canada. shared with me funny experiences as roomrnates. worked

with me in course assignmenmts and projects. drank and danced with me in the bars

and clubs of Toronto. sang with me in the cold nights of t he winter. and to conclude.

offered me two of the most mernorable years of my He. In particular, I want to t hank my

precious Bowen for her intensive devotion during the last three months of my master.

Above all. 1 want to thank rny family in Greece and mainly rny parents who have

never stopped encouraging and supporting me al1 these years of my studies. I prefer

t h i ~ k i n g t h a t the Company of' my little brother Alexandros. who is always trying to email

me their news from Greece. has helped them to forget rny long absence.

Contents

1 Introduction

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Prelirninaries

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivation

. . . . . . . . . . . . . . . . . . . . 1.3 Related work and scope of the thesis

. . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Organization of the thesis

2 Terminology 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 .YkfL.QL Y

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Data Slodel 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Syntax L I

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Semantics 11

. . . . . . . . . . . . . . . . . . . . . 2.1.4 Spli tting pattern expressions 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . 5 Object Identifiers 1 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.6 Value equality 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Views 1.7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Query Rewriting 16

. . . . . . . . . . . . . . . . . . . . 2.3.1 Constructor and Pattern trees 17

. . . . . . . . . . . . . 2.3.2 Matching pattern trees t o constructor trees 20

3 Query composition . Base case 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Formalizing the problem '21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 ..\ ssumptions 25

3.2 Matching t he user query against the view . . . . . . . . . . . . . . . . . . 25

. . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The rnatching algorithm 25

3.2.2 Deriving t h e variable bindings from the matchings . . . . . . . . . 2S

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 The rewriting algorithm 30

. . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Detailed algori t hm 30

. . . . . . . . . . . . . . . . . . . . 3.3.- Justification of the algorithm 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Regular pat h expressions 38

3.3 Multiple pattern expressions in the user query . . . . . . . . . . . . . . . 39

3.6 Handling variable equality in the user query . . . . . . . . . . . . . . . . 41

4 Introducing explicit OID definitions 46

4.1 Splitting pattern trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 SIatching grouped contents . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 Conclusions and future work 53

A XML-QL Grammar 56

Bibliography

Chapter 1

Introduction

1.1 Preliminaries

The notion of views is essential in traditional database systems. They have been studied

extensively in the context of the relational and object database models. but the relevant

research is still immature in the context of S M L [ConSS].

The recent ernergence of SSIL as the standard exchange format between applications

on the Internet has important implications for the database community. Classical data

models are not adequate to describe SkIL data. The introduction of a data mode1 also

requires the development of new storage mechanisms and new query languages implying

a series of new issues such as indexing. query optimization. and database design.

In this new world of S M L . the importance of views is even more crucial than in

standard database applications. X'rIL is only a syntax and cannot uniquely describe a

certain portion of data. Different authors may use totally different ShIL structures or

names to describe the same data. SSIL views will play a significant role in translating

information between sources and applications t hat use different schemata and ontologies.

They will also provide the rneans to integrate multiple heterogeneous sources. Having in

mind that large volumes of data are stored using traditional data models. e-g.. relational.

views may aIso be invoIved in migrating data froni these modets to SXL.

Issues that have been already studied include SbIL data models and query languages

[SLRgS, FSWSS], XML algebras [BT99. BhlR99. FSCVOO. BFSOO]. and some prelirninary

work on query optirniaation [SI\V99]. The most popular SSIL query ianguages that

have emerged are: SM L-QL [DFF+9Y. DFF+99]. Lorel [GhICV99]. S Q L [RLSSS]. YATL

[CDSS9S]. CnQL [BFSOO]. SQuery [CCF'Ol]. and more recently Quilt [DonOO]. The

language that is used in this thesis is SSIL-QL. Existing work on U I L views involves

issues like vierv definition [B LP+9S. LPC'V991. maintenance of materialized views [LDOO].

active views [MS.\+ 99. AbiSI)]. and S M L viervs over relational da ta (4IF00. Bar991.

1.2 Motivation

This thesis studies the problem of query rewriting in the context of S M L viervs. More

specifically. given an XM L query (we cal1 i t a u s e r que y) t hat is defined over one or more

virtual (non-materialized) S ' r IL vietvs. we find an equivalent query ( we cal! it a relcn'tten

query) that is defined over the source data. The term -query rewritingn also refers to

the reverse problem: given a query defined over a database and a set of viervs over the

same database. find equivalent rewritten queries defined over this set of views. Query

composition is another term t hat is used to describe the problern that ive study. since the

user query is composed rvith the view query in order to produce a single query.

Query rewriting is just one of the methods of evaluating qiieries over vieivs. Other

ways to process queries over a virtual view are: (i) materialize the view on demand:

(ii) rewrite the view so only the portion of the view that is relevant t o the query is

materialized: and (iii) translate the query and the view into expressions of a compositiona1

algebra and apply the composition and any possible optimizations a t the algebraic Level.

The second method is clearly more efficient than the first one. The query rewriting

approach is even more efficient, since view rewriting requires the partial materialization

of the view and the execution of the originai query over this view. However. we should

ment ion that query rewriting can lead to huge rewrit ten queries whose opt imization may

be difficult or even infeasible [.-\GM+97].

The third method is similar to the query rewriting approach. but composition is

based on algebraic transformations rat her t han transformations at the level of the query

language. Mostly due to the lack of a group-by operator in the relational algebra to

express groiiping. early composition and unnest ing techniques were based on transforma-

tions over SQL queries instead of applying algebraic transformations. The correctness

and completeness of such transformations can be more clearly and strict ly proved using

a forma1 algebra'. Later approaches used algebraic techniques to perform que- com-

position and unnest ing optirnizations. mainly in the world of object-oriented databases

[CM93. Feg981. However. due to limitations in existing optimizers. query rewriting al-

gorithms can be more efficient than algebraic transformations t hat are based on these

optimizers. Having in mind that query optimization is still at its infancy in the context

of S M L and that optimizers for S M L query languages are -under construction". the

development of query rewriting techniques is significant .

Example 1.1 Consider the aiem that is expreûsed b y the follou~ing S.IIL-QL query:

V: UHERE <Library>

<book> <t i t l e>$ t< /> <author>$a</> <publisher>Uise Reader</>

</> </> IN "library-xml"

CONSTRUCT <author>

<name>$a</> <book>$t</>

</>

The a b o ~ ~ e statement creates the uiew of the authors and their books that are published by

'~ransformations over SQL resulted in a number of bugs which were finally correct& [GWdï].

Vtfse Reader". The foUo wing query asks for the book ii€[es lhat are wn*€fen by *Marcel

Proust ":

Q: WHERE <view>

<author> <name>Marcel Proust</> <book>$t~/>

</> </> I N V

CONSTRUCT <t i t l e>$ t< />

..lccording to the full rnaterialiiation approach. ae have to compute the cieu*> result and

then apply the query ocer this result. The drawback of this approach is that the query is

only interested in the books of a sppecijic author. J O the computation of the the whole cieu?

query is redundant.

The view rewriting approach suggests the transformation of the oieu? and the insertion

of the condition $a = "Marcel Proust' in the riew definition (me can also replace the

variable Sa b y the string "Marcel Proust"). The dradack mentioned aboce is elin~inated.

but we still need tu ecaluate tu70 separate queries.

Finally. the query rewriting approach suggests the transformation o f the query as

fol10 ws:

Q: WHERE <Library>

<book> <t i t l e>$t</> <author>Marcel Proust</> <publisher>Addison-Wesley</>

</> IN " library .ml2 ' CONSTRUCT

<t i t l e>$ t< />

The query is directly applied on the source data. so the desired rrsult is produced by a

single que y.

Query rewriting becomes more cruciaI when there are rnany levets of views. Le.. views

defined over other views. In this case. instead of evaluating a potential big number of

queries returning redundant portions of data. we compute a single query t hat produces

only the desired data.

1.3 Related work and scope of the thesis

Rewriting queries defined over views can be seen as a special case of the general problem of

query unnesting. which is extensively studied in the context of relational [ l i imsl . GCVSr]

and object-oriented databases [CàI93. Feg981.

In the relational world. query composition is simply performed by merging the FROLI

and WHERE clauses of' the corresponding querics. as long as views are defined as simple

SPJ queries. However. wheri a view is defineci as SELECT DISTI.VCT. transformations

to move or to pull up DISTIiWT should preserve the nurnber of duplicates correctly

[PHHS-1. The situation is also compiex when aggregates are present in the view [SIurSZ].

Aggregates are not examined by t his t hesis.

The problem is similar in the world of object-oriented databases. Queries defined

over views correspond to the case of nesting in the FR0:I.I clause. when no predicate

dependency exists between the inner and the outer block of a nested query. Such nested

expressions can be t ranslormed into simple join expressions [CM9-7j.

There are several issues that make the problem different in the context of S M L and

X M L query languages.

1. T h e schema of XML data is not alivays known. -As a result of this. the schema of

the view is only partialIy knotvn.

2. More than one way of matching the user-query variables with the view variables

may be possible? since an element name can appear multiple times even under the

same parent eiement .

3. XML query tanguages aliow the use of tag variables and regufar path expressions

which make the problem of matching between the variables of the view and the

query harder.

4. Skolem functions are used to group data.

5. In the ShIL world. order is often important. The existence of' index operators

makes the problem even harder. This thesis does not address the order problem.

Query composition in the context of semistructured da ta is first addressed in SISL

[Pr\GlI96]. a datalog-like query language. :\ similar approach is presented by Papakon-

stantinou and Vassalos [PVSYa] for TSL. which is also a datalog-based query language

for semistructured data. The unification algorit hms t hat the above frarneworks propose

are analogous to Our matching algorit hm. The prohlem is also examined by Buneman et

al. [BFSOO] in the context of C'nQL (BDHS961. a que- language and algebra based on

structural recursion. The query composition is performed on the algebraic level. Hoiv-

ever, the underlying data moclel is value-based as opposed to the objcct-based model t hat

ive study. In the UIL-QL data model S M L data are represented as labeled graphs.

whereas in the VnQL data model. S M L data are represented as sets of label/value pairs.

The composition algorit hm presented in the SilkRoute framework [LIFOO] is the clos-

est to our approach. Views are expressed as R S L (Relational t o SSIL Transformation

Language) queries. which inherit the SELECT and FROM clauses from SQL and the

CO WSTRCCT clause from S3IL-QL. In ot her words. RSL queries transform relational

data to XML data. User queries are expressed as .YSIL-QL queries. LVe can enurnerate

several differences between our approach and SilkRoute.

1. Object identifiers are not always explicitly defined in the constructor of a vieiv. In

several cases, these identifiers should be present in the constructor of the rewritten

query in order to ensure that the correct object-elements are produced. The way

we derive them is different in XML-QL than in RKL queries.

2. .A variabIe in the CONSTRUCT clause of on RSL view can only be mapped to a

variable appearing in the WHERE clause of the XàIL-QL query. since it can only

represent atomic values (like variables in SQL queries). In our case. a variable in

the constructor of the view can also be mapped to a whole pattern subexpression.

since it can represent complex non-atomic values. This difference is reîlected in the

rewriting algorit hm.

3. We investigate how equality conditions between variables in the user query are

handled.

4. We present a different approach to handle grouping by Skolem functions in the

view. which is based on splitting the pattern expressions of the user query. We

also indicate that a grouping performed by a Skolem function should sometimes be

expressed by means of a query in order to perform a correct variable substitution

during the rewriting process.

1.4 Organization of the thesis

The thesis is organized as folloms.

r ln Chapter 2, we describe S I I L - Q L and introduce the terms that are used in the

next chapters.

r In Chapter 3. we present our query composition approach. when no explicitly de-

fined object identifiers exist in the COYSTRUCT clause of the view.

r In Chapter 4. we introduce explicitly defined object identifiers in the user query

and suggest a solution to the problem.

a Finally? in Chapter 5. we give a conclusion and present the main issues for future

work.

Chapter 2

Terminology

XML-QL

XML-QL is a declarative SSIL query language that can extract data from existing SSIL

documents and construct new XML documents. .An SML-QL query consists of three

parts: ( i ) a pattern part. which matches elements in the input document and binds

variables; (ii) a filter part. which provides conditions for the bound variables: and (iii)

a constructor part. which specifies the result in terms of the bound variables. The

pattern and the filter part can include multiple pattern and filter expressions. respect ively.

Patterns and filters appear in the WHERE clause. while t he constructor appears in the

CO-JSTRECT clause of t h e query.

Exarnple 2.1 The /o[lowing query returns a[[ the titles of 6oob written by -Dostoye~s!sX.i".

published a/ter 1990:

Q: WHERE CLibrarp //pattern

<book published = $par> ~ t i t l e > $ t < / > <author>Dostoyevski(/>

</> </> I N " l i b r a r y . x m l J ' , $ p a r > 1990

CONSTRUCT

Patterns and constructors have the structure of normal S5IL data. but they can also

contain variables. Patterns can also include regular path expressions. which can be used

t o specif~ clement paths of arbitrary depth and tag variables. ivhicli can be coniidered

as a way of query ing the schema of data.

Example 2.2 The following query returns d l the authors and their icritings. The schema

of the data is not knoun. Cté do riot know lchat kind of icritings erist. e.g.. vhether

is bound to a book or an article element. and in uvhat depth undcr Library they appear.

By means of regular path expressions and tag variables, uve pose a querp thnt does not

require the knowledge O[ the exact schema.

Q: WHERE CLibrary . *>

c$w> <t i t l e>$t< /> <author 1 writer>$aC/>

</> </> IN ''1ibrary.xml"

CONSTRUCT <author ID=AuthorID ($a) >

<name>$a</> <t i t l e>$t< />

</>

In the above example. the results are grouped by the name of the authors. This is

achieved by using the Skolem junction .4uthorID. which generates a new identifier for

each distinct =lue of Sa. According to t h e data model that we use and describe in the

next paragraph, duplicates are eliminated. Therefore. only a single n a m e element will

appear under each produced author element. Skolem functions are powerful and the

rewriting algorithm is based on their use.

2.1.1 Data Mode1

XIVIL-QL supports two difTerent data models: an unordered and an ordered data model.

The unordered data model considers that an XbfL document is a graph. in mhicb

each node is represented by a unique object identifier (OID). .A graph has a unique

root node and is labeled as follows: graph edges are labeled with element tags. nodes

are labeled with sets of attribute-value pairs. and leaves are labeled wit h string values.

Ciraph edges correspond to element tags. while nodes correspond to element contents.

Since every element is denoted by a tag which contains a single label. it may not be clear

how a certain node can be pointed by two different edges. An element can refer to other

elements by means of IDREF attributes. The value of an IDREF attribute is a n OID.

SSIL-Q L blurs the distinction between IDREF attributes and clements. and therefore.

labeled edges represent both element tags and IDREF attribute names. Since a certain

element can be referenced by several other elements. a graph node can be pointed by

more than one edge. Attributes are associated to nodes. Two nodes can be connected

by several edges. but with the following restriction: a node cannot have two outgoing

edges wit h the same labels and the sarne values. i.e.. betiveen any two nodes there can

be at most one edge wit h a given label. and a node cannot have two leaf children wit h

the same label and the same string value. This condition means that S l f L documents

denote sets.

In the ordered data model. the corresponding document graph contains a totally

ordered set of nodes. The order of the nodes corresponds to the natural order of elements

in the document. Given a total order on nodes. we can enforce a local order on the

outgoing edges of each node. In the ordered model. the above restriction is not retained.

so arbitrarily many edges with the sarne source, same edge label. and same destination

d u e can appear. According to this model. duplicates are not eliminated and .YML

documents denote bags.

In our study! the underlying model is unordered. so rewritten queries do not neces-

sarily preserve the right order in the output.

-4s XML-QL is stiil evolvingo its syntax varies in different publications [DFFf 98. Fer991.

In Appendix .\, we present the IILIL-QL syntax that we consider. S S I L - Q L queries can

also be espressed as functions. For simplicity. we consider only functions that have no

arguments:

function GreekCities O{ WHERE

<ci ty> <name> $c </> <population> $p </> <country> Greece </>

</> IN 'ccities.xml'' CONSTRUCT

ccity> <name> $c </> <inhabitants> $p </>

</>

2.1.3 Semantics

CVe continue with our XSIL-QL semantics defini tion. Consider an S l I L - Q L query M.-HERE

P COILTSTRUCT C and the set of variables XI. .... x k t hat are bounded by one or more

conditions that appear in P. Let G be a graph that corresponds to an ShIL document.

We construct a table R(sl, .... xk) with one column for each variable. Each row corre-

sponds to a binding of the variables t hat sat isfies al1 condit ions in P. CVe should mention

that a certain row is produced when each pattern expression in P matches a certain

?(ML subgraph of the source data. More information about how pattern expressions

match source '<ML data can be found in the X M L - Q L proposa1 [DFF'SS]. Two different

rows always correspond to tivo different matched X!d L subgraphs for at least one pattern

expression of P.

In each row. a variable can be bound either to an OID that corresponds to internai

non-leaf node in the graph, to a string Value (leaf node. attribute value or tag label).

Let C(rl, .... xi) be the template contained in t h e COKSTRUCT clause Jepending on

a subset of the variables rio .... xk. If x i . .... xi are the bindiogs in row i of the table R.

where i = I..n, then. for that row. ive construct the S M L fragment Ci := C'(xi. .... xi).

Each Ci is cailed tup le . The query's ansiver contains al1 the produced ttiples. without

taking into consideration any order.

The above description does not clarify how a variable could be bound to the content of

an SSIL fragment which contains bot h elements and string values. For instance consider

that a variable var is bound to the content of the following SM L fragment:

<city> Athens <country> Greece </country>

< /c i t y>

It is not clear whether the edge cit y points to a leaf node or te an internal node. In

such cases. extra edges Iabeled as CD.4T.4 encapsulate the string values [DFFf 991:

<city> <CDATA> Athens </CDATA> <country><CDATA> Greece </CDATA></country>

< /c i t y>

It is now clear that the variable uar is bound to the OID of the internal node that is

pointed by the edge n' ty.

2.1.4 Splitting pattern expressions

Consider t h e followiog pat, tern expression:

CVHERE El ... En [IV Source

where El! .... En are distinct SML-QL patterns. The pattern expression is equivalent to

the following set of pattern expressions:

bt'H E RE El IiV Sourcet .... En I N Source

2.1.5 Object Identifiers

Each element appearing in the CONSTRUCT part of a query includes an OID which

is probably not explicitly defined. However. it is implicitly determined by a Skolem

function. For instance. consider the function GreekCities() in Subsection 2.1.2. For

each binding of the query's variables with the source data. a distinct c i t y element is

produced. So a new identifier is constructed for each distinct binding. i-e.. rve should

define Skolem functions that produce a unique value for each roiv of the table R. mhich

contains the variable bindings. In Subsection 2.1.3. we mentioned that tivo different

rows in R are produced when at least one pattern expression of P matches two different

S M L subgraphs of the source data. Tivo subgraphs are different if they contain at least

two nodes with different values. By diffrrent values. we mean eit her different OIDs or

unequal string values in the case of leaf nodes. Lire should mention t hat two graphs can

b e different even if t hey contain identical Ieaf nodes. Therefore. in order to ensure t hat a

different result is produced for each row in R: ( i ) a different identifier variable is assigned

to each element that appears in P. and ( i i ) a unique OID is created for each element

in the constructor of t he query by means of Skolem functions defined over the above

variables. If id l , .... id, are the identifier variables t hat are assigned to the elements t hat

appear in the pattern part of a querv. each element < ei > in the constructor part can

b e ivritten as < f i I D = fi(idi. .... idp) >. The pattern part may contain several pattern

expressions. It is conceivable t hat when the O ID of an element is explicit ly defined. no

addit ional transformation is applied.

.\ccording to the above description. the function GreekCit ies( ) is written as follows:

V: function GreekCities()( WHERE

<ci ty ID = $il> <name ID = $i2> $c </> <population ID = $i3> $p </> <country ID = $i4> Greece </>

</> IN t'cities.wl*', CONSTRUCT

<c i ty ID = fl($il,$i2,$i3,$i4)>

In the case of R S L queries (4IF001. the implied OID definitions are derived by using

the keys of the relations that are involved in the FROSI clause of the query.

The problem changes when the pattern part of the user query involves regular pat h

expressions. -4 regular pat h expression does not require a specific schema for the source

data. Since the schema of the da ta that can be matched by the pattern part is not

known, it is not possible to derive the implicit OIDs of the elements in the const ructor of

the query. In Chapter 3 , we see t ha t due to this fact. there are cases where no rewritten

query can he constructed. when regular path expressions appear in the pattern part of

the vieiv.

2.1.6 Value equality

In the case of leaf nodes. two nodes are equal if they correspond t o equal string values.

In the case of interna1 nodes. the nodes are equal i f they correspond to the same OID.

Consider the following query which returns publications t hat are ivritten by more than

one authors:

WHERE cpublicat ions>

<author> <name> $nl </> <publication> $pl </>

</> <author>

<name> $n2 </> <publication> $p2 </>

</> </> IN ' 'publications .mi3 , $nl ! =$n2, $pl=$pS

CONSTRUCT <author>

<name> $nl </> <publication> $pl </>

If the pablication elements in the source data contain atornic values. then the corn-

parison between Spl ancl $ p l is based on the string values to which they are bound.

Otherwise. the OIDs to which the two variables are bound should be identical. i.e.. $ p l

and $p2 shoiild correspond to the same publication object. Content equality is not ade-

quate.

CVe can observe that the publications that are bound to the variable $pi are repro-

duced twice in the result. Since each publication element that is created has a unique

OID (no Skolem functions are declared). each binding results in two different nodes. ref-

erenced by two different publication edges. So now. these nodes refer to two different

publications.

2.2 Views

We do not provide any special view defiriition syntax. Views are defined as normal XSIL-

Q L funct ions rvit hout arguments and schema restrictions for the out put. References to

views are expressed as function calls. However. we rest rict our problem to function calls

that appear after 'IN'. and not in the interior of elements. In addition. we do not study

view definitions with constructors t hat contain nested sub-queries.

.A well-formed XSIL document should have a single root element tag. The XML-QL's

implernentation (Fer991 adds a tag named .WfL to the result of a query. We can assume

that the result of a view query has a root element narned cieui. -4 query over a view

should take into considerat ion t his external element.

CHAPTER 2. TERMINOLOGY

2.3 Query Rewriting

Consider a view defined as an S M L - Q L query I V . C w takes as input an S M L document

.Y Dl and returns an S M L document .YD2 = L'( .'r'D1 ). hIoreowr. consider a user query

Q. which takes as input S D 2 and returns an SbiL dociirnent .YD3. We are required

to construct an S!dL-QL querp Q'. which takes as input the ShIL document .l'Dl and

returns S D 3 . i.e.. Q1(SDl) = Q ( S D 2 ) = Q(C'( . (Di ) ) = (Q o L')(.l'Dl).

CVe give a brief description of the intuition behind the rewriting procas that ive

follow. The structure of an XML-QL ronstructor is very similar to the structure of

S.CIL documents. When no explicitly defined OLDs and nested sub-queries appear in the

constructor of 1,'. its schema is similar to the schema of the individual tuples that are

produced by the query. In other words. the element tags and attributes that appear in

the constructor of C' also appear in each tuple of the resulting SSLL document .Y D2. The

idea is to match the user-query pattern mith the constructor of C' instead of matching it

with X D z . Since a constructor as opposed to an S M L document aiso contains variables.

the result of the matching procedure is a set of mappings between variables. constants

and element tags. The matching and composition algorithms are presented in Chapter

3.

Explicitly defined OIDs and sub-query blocks in the constructor of a query may result

in aggregated XML data. so the structure of the produced tuples may be different frorn

the structure of the const ructor. -4s mentioned in the previous section. sub-query blocks

in the constructor of the view are not studied in this thesis. Explicitly defined OIDs in

the view definition are studied in Chapter 4. Finally. we also assume that no regular

path expressions appear in the pattern part of the view and the t w r query.

2.3.1 Constructor and Pattern trees

We consider that ail the variables. as weil as the Skolem function declarations t hat appear

in a query are stored in a syrnbol table. ive denote as Tq the symbol table that contains

al1 the variables and Skolem function declarations t hat appear in the body of a query Q.

For each variable. we store its name. while for each Skolem function. we store its name.

and its list of parameters. Parameters rnay be either constants or pointers to symbols in

We represent const ructors by trees instead of graphs. which facilitates the rnatching

procedure. The interna1 nodes of a constructor tree represent elernents. while the leaf

nodes represent non-tag variables and constants i variables and constants t hat are not

included in a n element tag) that appear in a constructor. Handling the tag elements as

terminal symbols and ignoring the end tags. a constructor tree is similar to the syntax

t ree of the corresponding const ructor.

A const ructor t ree can be expressed using the following grarnmar:

1 : Node : : = Leaf Node 1 IntemalNode 2: LeafNode : : = LeafVariable I Leaf Constant 3 : InternalNode : : = (TagLabel , Children, Attributesset) 4: TagLabel : : = TagVariable 1 ElernentName 5: Children ::= Node I Children, Node

6 : Attributesset : : = {OID, Attributes) 7 : OID : := (ID, <Skolem>) 8 : Attributes : := Attribute 1 Attributes, Attribute 9 : Attribute : := (AttrETame, AttrValue)

10 : AttrValue : := AttrVariable 1 AttrConstant

11 : LeafVariable : : = <VAR> 12: TagVariable ::= <VAFD 13: AttrVariable : : = <VAR> 14 : ElementName : : = <IDENTIFIER> 15: AttrName ::= <IDENTIFIER, 16: LeafConstant ::= <STRING> 17: AttrConstant ::= <STRING>

The terminal symbols < V.4R > and < Skolem > represent references to variables

and Skolem funct ion definit ions. respectively, in the symbol table. A node is descendent

of another node that represents an erement eParent if it represents an etement. a variable

or a constant which is nested under eParent. The root node represents the root element

that is produced by the query and is named as view if & is a view definition. In the

sequel. interna1 nodes are named after their label (TagLabel). and leaf nodes are named

after the variable or the string value that they represent.

A graphical representation of a constructor tree is displayed in Figure 2.1.

view

l($i 1 ,.... Si4)),(age,'3OV) )

name ((ID, f2($i 1,...,Si4))} tion {(ID. f3(Si l....,Si4))}

tie{(ID. f4(% l.....$i4))}

Figure 2.1: .\ query and the corresponding constructor tree

W e consider two sets of primitive functions that can be used to access and handle

t ree nodes:

Symbol(b):boolean Return t n i e if 6 can reduce to the non-terminal symbol Symbol of

the grammar presented above. For instance, if node is a leaf node that represents a

variable, the function caIIs Lea f.Vode(node) and Leaf tar iaMe(node) return Irue,

while the function call LeafConstant(node) returns false.

getSyrnbol(b):Symbl Return the Symbol part of 6. For instance. if node is an interna1

node with the form (author. {chi ld i . child2). { ( I D . j ( x l ) ) } ) . then the function call

getTagLabrl(node) returns author. and the function call g e l 0 I D(get.-ltt ributesSet(node))

returns ( I D . f(q)).

Given a node of a constriictor trce. we can reproduce the conslructor segment that

corresponds t o the subtree rooted by this nocie. A constructor segment is defined by the

non-terminal symbol Contents in the S M L - Q L grammar presented in Appendix .\.

.A pattern tree PTqsi is defined similarly and represents the i t h pattern expression of

the pattern part of Q. Skolem functions do not appear in pattern trees. OID attributes

may also be omi tted. So. the lines 7-8 of the grarnmar presented above can be simply

replaced by the following line:

7: Attributesset : := '(' Attributes

The OID at t r ibute is handled as any other attribute. We should mention that Ieaf nodes

in a pattern tree cannot have sibling nodes. since patterns rvith the form ce> Srart $carl

</> or ce> $var const </> are not vatid.

We assume that the root node of a pattern tree has always a single child. Pattern

expressions of the form:

are always t ranslated into:

After this transformation. applying a pattern expression over the view is equivalent

ta applying the pattern expression over each distinct view t uple.

2.3.2 Matching pattern trees to constructor trees

The procedure of matching a pattern part of a query to the constructor of another qciery

is analogous to matching the pattern part of a query to an ?(ML document. A certain

pattern can match an XhIL document in several different ways. resulting in different

variable binrlings. Similarly. a pattern can match a const ructor in several different ways.

Each of these matchings involves several independent mappings t hat involve variables.

constants. element labels. const ructor segments and patterns.

The matching procediire between a pattern expression and a constructor is performed

by means of the corresponding pattern and constructor tree. Nodes of the pattern tree

are matched to nodes of the constructor tree by comparing their properties. e.g.. the

label names and the attribute values. Since pattern and constructor trees also involve

variables, the comparison may also involve variables. Since variables can be bound to

any value. the cornparison between two variables or between a variable and a constant

always evaluates to true.

Consequently, an internal node of a pattern tree can match an internal node of the

pattern tree if t h e comparison between their labels. attribute names and a t t ribute values

evaluates to true. .\II the attributes of the pattern node should be matched. but it is not

necessary to match al1 the attributes of the constructor node. For instance. the pattern

node that represents <$t continent=' 'Europe >> can match the constructor node that

represents <country ID* (SI) cont inent=$c>.

The cornparison betiveen constant leaf nodes is based on their string values. Homever.

a leaf node rnay consist of a variable. Such a variable can be bound not only to atomic

values. but also to OIDs representing whole ?(>IL subgraphs. In ordor to handle this

case, we consider that a variable in a leaf node can be matched to an internal node or a

collection of si bling nodes.

Yot any pair of nodes is compared. .4 pattern node can match a constrtictor node

only if their parent nodes can be matched. In other words. in order to match two nodes.

we should match al1 their ancestors.

Matches between components of the const ructor and the pattern t ree are represented

by triples that obey the following syntau:

1: Mat ch : : = (TagVariable , TagVariabl e , nul1 ' ) 2: 1 (TagVariable , EleaentName , 'nul1 ' ) 3: 1 (El ement Name, TagVariable , ) nul 1 )

1 (AttrVariable, AttrVariable, 'nullJ ) I (AttrVariable , AttrConstant , 'null ) 1 (AttrConstant , AttrVariable, 'null' )

1 (Leaf Node, Leaf Node, <Skolem>) 1 (Leaf Variable, InternalNode, <Skolem>) 1 ( InternalNode, LeafVariable , 'null) 1 (Leaf Variable, SiblingNodes , <Skolem>) 1 (SiblingNodes , LeafVariable, 'nul1 )

12: SiblingNodes ::= Node SiblingNodes I Node Node

A match cari involve tag labels. attribute values and nodes that are matched during

the matching procedure. When two internai nodes are matched. the produced matches

represent their matched attributes and labels t hat involve variables.

The first argument of a match represents the matched component of the pattern

tree. while the second argument represents the matched component of the constructor

tree. As shown above. leaf nodes that represent variables can match coIlections of sibling

nodes. If the leaf node belongs to the pattern tree, the meaning of such a match is

that the corresponding variable should be bound to the OID of an NML subgraph ivhose

structure is defioed by the constructor segment that the matched nodes represent. This

OID is represented by the third argument of the triple. It is used when the variable

participates in equality filters and the equality should be based on the OID. If t h e leaf

node belongs to t h e constructor tree. the meaning is t hat the pattern represented by the

matched nodes should be applied on the corresponding variable.

In general. a match represents a set of bindings of a certain subset of the user-query

variables.

Example 2.3 Consider the foliou.ing pair of oielc and user querg:

V: function View()( QHERE

<a ID = $il> <$tl ID = $i2> $XI </> <$t2 ID = $i3> $x2 </>

</> IN " source. xml > ' , CONSTRUCT

cb ID = fl($il,$i2,$i3) a='Hi>> c$t ID = fS($il,$iZ,$i3)> $xi </> Cb2 ID = f3($il,$i2,$i3)> $x2 </>

</> > Q : WHERE

<pieu> 

<bl> $y1 </> <b2><b3> $y2 </></>

</> </> IN View()

CONSTRUCT <result>

<di> $y1 </> Cd2> $y2 </>

</>

The f o h w i n g set of matches can be identified: (Sv. 'Hi'. null). ( '61 *, St. n u i f ) . ('63'. St,

The purpose of the matchiog procedure between a constructor and a pattern tree is to

match the whole pattern tree? but not necessarily the whole constructor tree. Moreover.

the same node OF a constructor t ree can be rnatched by several nodes of the pattern tree,

but each node of t h e pattern tree should be matched only once.

Several different matchings can exist between a constructor and a pattern tree. i.e..

the pattern tree can match the constructor tree in several different ways. Each matching

is represented by the corresponding set of matches. Each set of matches should obey

the following restrictions: ( i ) there should be a single match for each tag or attribute

variable of t h e pattern tree; and ( i i ) each leaf node of the pattern tree or exactly one of

its ancestors should appear in a single match.

Considering the matches presented in Example 2.3. two possible rnatchings can be

identified.

1. T h e first contains the matches (Sv. 'Hi'. null). ( 'big. Ot. null). (Sul. 5x1. fl(Sil.Sil.8i3))

and (b3, Sx2. C3($iL.$i9.%i3)).

2. T h e second contains the matches (Sv. 'Hi'. null). ('bl'. St. null). ('b2'. $ t . null).

(Sy 1. 3x1. E(8i l ..Si'Z..SiI)) and (b3. Sxl . f?($il . l i Z . $ i 3 ) ) .

The matching procedure is prescnted in detail in Chapter 3.

Chapter 3

Query composition - Base case

In this chapter. ive present our solution to the problem of composing queries that are

defined over ShIL views. Erplicit OID definitions are not studied in this chapter. They

are introduced in Chapter 4.

3.1 Formalizing the problem

Consider a view V ( d b ) which is defined by a query I,'(db):

WHERE fi. (db. x 1. . .., t,) F\?( t t , .... x,) (3.1)

CONSTRUCT Ctr(x l . .... x,)

ivhere PL^ is the pattern part. FI is t h e filter part. CI. is the constructor. and .Y =

{ x l . .... 2,) is the set of variables that appear in te'. The query is applied o n the S J I L

document db.

Consider the user query Q( V ) t hat is defioed over I":

WHERE PQ (v. YI. .... ym) FQ (YI! - - s a Y,) (3.2)

CONSTRUCT Cq(yi. .... y,)

where Pq. FQ and Cq a r e defined over the variables Y = (yl. ... . y,). and .Y n Y = OL.

Our problern is the construction of a query Qf(db) which is defined over the source

da t a and returns the same result as Q. Q' has the follorving format:

WHERE PQt(db, XI, .... zn) FQ~ (X 1. -... xn. yl, .... ym) (3.3)

CONSTRUCT CQ)(rl . .... x.. 91. .... y,,,)

The rewriting process involves trvo sequential steps. During the first step. ive discover

t h e matchings between PQ and Ci.. These matchings are used in the second step to

produce the rewrit ten query Q'. The first step is presented in Section 3.2. and the second

one is presented in Section 3 . 3 .

3.1.1 Assumptions

CVe assume tha t PQ contains only a single pattern expression. which is the one that refers

t o V . W e introduce multiple pattern expressions in PQ in Section 3 .5 . ive also assume

t hat t here are no equality conditions between variables in {yL. .... y,}. This means that

there are no filters yi = yk in FQ. and furthermore. each appears only once in P4.

Equality conditions between variables are studied in Section 3.6. Finally. we assume that

both Pb* and PQ do not contain regular pat h expressions. In Section 3.4. ive justify this

assurnption.

3.2 Matching the user query against the view

3.2.1 The matching algorit hm

Let PT4 be t h e pattern tree that represents the pattern part PQ of the user que- Q.

a n d CTV be t h e constructor tree t hat represents the constructor Cv of the view P..

'If -Y n Y # 0, we can always find a renarning ,Y' of the set of variables in .Y, such that .Y' 17 Y = 0.

CHAPTER 3. QUERY CO?vIPOSITION - ~ A S E CASE 26

Algorithm 3.1 describes horv the matchings betwveen PTq and CTV are discovered.

The algorit hm consists of the funct ion findiCfatchings(node,, node, ). which performs the

matching procedure recursively. It takes as arguments a node node, froni CTv and a

node node, from PTq and returns al1 the matchings betwveen the subtrees that are defined

by these two nodes. CVe use the functions that are defined in Section 2.3 to access and

handle nodes.

Lines 4-9 check whet her the nodes represent constants. If t hey bot h represent q u a 1

constants. a single matching esists that contains only the match between the two nodes.

Lines 12-14 check ivhet her the tag labels of the tivo nodes are identical or at least one of

them is a tag variable. If none of these situations holds. no matching exists. Lines 17-20

checks whether the attributes of the elements that correspond to the nodes match. The

function matchrlttr(node,. node,) returns false if and only if ( i ) both node, and node,

represent interna1 nodes? and ( i i ) there is at least one attrihute in node, which is either

not included in the attribute list of node, or the corresponding attribute values do not

match. Trvo att ri bute values match if eit her their values are equal or at least one of t hem

is a variable. The funct ion ~nd..lttn'bute.Clatches(node,. node,) ret urns the matches t hat

contain the rnatched pairs of attribute values. when at least one value in each pair is a

variable. Lines 27-30 check whether node, has a single child which represents a variable.

If this is the case? one single rnatching exists containing the match between this child

and the whole collection of subnodes of node,.

Lines 34-43 handle the case when more than one nodes are under node,. Yone of

these nodes can be a variable. In this case, ive consider each child child, of node,

independently and find the rnatchings between child, and the children of node,. Alter

al1 the combinations of childreo pairs have b e n identified, a number of different sets

of matchings are discovered and stored in the list matchings(l. Each set of rnatchings

corresponds to a different pair of children oodes of node, and node,. The function

merge(rnatchingslfl results in a single set of matchings S by combining the matchings in

CHAPTER 3. QUERY COhIPOSITION - BASE CASE 27

matchingsu as follows. Each set of matchings corresponds to a single child or a subset

of children of node,. CVe consider al1 the combinations combination; of matchings sets

Si,, such that each combination; corresponds to disjoint children subsets. and every child

of n o d e , appears in such a subset. The matchings sets Sij that correspond to a certain

combination; are combined resulting in a single set Si. More specifically. each matching

in Si is produced by considering one matching From each Si] and merging them. Al1

the possible combinations between matchings in each Sij are considered. Following this

process rve ensure that every leaf node in the subtree under node, or one of its ancestors

appears in a single match of each matching in Si. Finally. al1 the sets Si are merged into

a single set S t hat contains al1 the matchings.

If at least one of the tag labels of node , or node, is a variable. the match between

the two labels is added to each individual matching by calling t h e function addMatch()

(Lines 47-50). F ina l l . Lines 53-54 add the attribute matches to each matching by calling

the function add;Clatches().

Algorithm 3.1

Input: root node ofCTv. root node of PTq

Output: the set of al1 the matchings betmeen C'Ti. and PTQ

1: function Matchings j?ndMatchings(node,, node,){ 9 - - - 3: !% if 60th nodes represent constants 4: if ( Lea fConstant(node$ or Leu fConstant(node,)) 5: if (LeafConstant(node J and LeafConstant(node,) and node, = node,) 6: return new Matchings( 7: new Matching( 8: new Match(node,. node,, null))); 9: else return null; 1 O: I I : % If the tag labels do not match ! 2: if (!tag V a n a ble (ge t TagLabel(node,)) and !tag Variable (ge t Tag La beL(node,)) 13: and getTagLabd(node$!=getTagLabel(nodep)) 14: return nuU; 15: 16: % Find matches between the attribute calues 17: if (!match.-lttr(getAttributesSet(nodep), getrlttributesSet(nodeP))) 18: return null:

19: eIse atf rihufe!t[afches + fidAttrz'buteMatches( 30: getAttributesSet(node,!, get.4ttributesSet(nodeP)): 21: 22: % Cet the children of the nodes 3 : children, t (ge tChi ld~n(node~) ) ) ; 24: children, t (getChildren(node,))); 25: 26: 7Z If nodc, hos a singlc child that rrprrsrnts a rariablc 27: if LeajPOn'able (ch ildren,) YS: return new .thtchings( 29: new .Watch ing( 30: new .I.Totch(children,, children,. getO~D(get.4ttributes(nodep)))): 31: 3>. % I/ node, has multiple children. find the matchings 33: % that correspond to al1 the possible pairs of subtrves and mcrge them 3 for each child,[i] in children, 35: i f !Leu f buriable (child,[i]) 36: for each child,[i] in children, 3 7: rnatchings[i. j] t find.\fatchings(childp~]. child,[i]): 38: eise 39: for each subchildren,~] in children, 4 O: rnatchings[i, j ] t new .Clatchings( 41: new Ma!ch ing ( 4 2: new .Cfatch(subchildren,li], child,[i], null): 49: mergedMatchings t rnerge(mntchings[): 44: 45: % l''the tag label of at least one node is variable, 46: % include the corresponding match 4 7: i f (tag Variable (get Tag La bel(node,)) or tag Varia ble(get TagLa bel(nodep))) 48: for each matching in merged.llatchings 49: addMatch(matching, 50: new .Watch(get TagLabe[Inode,). get TagLabd(node,j. null)): 51: 52: % lnclude the attribute matches 53: for each matching in mergedMatchings 5.j : add.\iatches(matching , attrïbuLe.tfakhes):

3.2.2 Deriving the variable bindings fkom the matchings

.As rnentioned in Section 2.3. the purpose of the matching procedure is to discover how

the set of variables of the user query are mapped to the source ?<ML document. ivithout

materializing the view. In this subsection, we show that al1 the bindings that are produced

CHAPTER 3. QUERY COMPOSITION - BASE CASE 29

when the patter part of the user qnery is apptied over the vim can be derived from the

matchings that the matching algorithm produces. CVe also show that no additional

bindings are derived from the matchings. In Section 3.3. we use this Fact to prove t hat

the rewriting algori t hm produces the correct composition query.

Let .Y' = {r;. .... xk} be a binding of the set -1' of the view variables. and let t ; be

the corresponding tuple that is produced by Le. The structure of l e ' s constructor CI-(.Y)

defines the structure of the produced tuples. Therefore t , = CIp(.Yt ). Instead of matching

the pattern part Pq(Y) of the query w i t h each tuple CV(X1). the matching algorithm

matches PQ(Y) ivith Cip(.Y) by folloiving a similar procedure.

The matching algorit hm does not bind variables to specific constants or S S I L graphs.

It finds matches involving components of t h e view constructor (constants. variables and

constructor segments), which include variables that are not yet bound ro specific values.

However. we should ensure t hat the matching algorithm produces matchings t hat derive

al1 the actual bindings of the user-query variables. if the filter part of the user query is

ignored. In addition. the matching aigorithm should not produce matchings that derive

bindings that cannot be produced by Pq(Y).

Consider a certain binding of the set of variables Y. when applying Pq(Y) over the

output of b'. This binding results after matching Pp(Y) to a certain tuple ti produced

by V . Let .Yi be the binding of V's variables that results in t; . Each variable y, E 1.

is bound to CKj ( -L i ) . where CKj( -Y t ) is a constant or an S M L subgraph constructed by

the component CVVj(.Y) of the constructor of CI. .A binding between y, and a subgraph

is represented by the OID of the node that parents the subgraph.

When PQ is applied over CI. by means of the rewriting algorit hm. t here is a rnatching,

where each =riable y, is matched to Cv,;(S). which evaluates to Cv.j(-Yi) when the set

of variables X is boiind to .Yi. If y, is not a tag or attribute variable. the algorithm

also records the OID definition of the parent element of CI,(.Y) as the t hird argument

of the match triple. This definition represents the OIDs to which gj is actually bound

CHAPTER 3. QUERY COMPOSITION - BASE C.4SE 30

and is used when is invotved in equality fiiters with other variables. Such fiiters are

introduced in Section 3.6.

The previous situation is not always possible. The binding .Y1 may involve non-atomic

values. i.e.. a variable xi E .Y may be bound to a subgraph G. Assume that when Po is

applied over ti, a variable y, is bound to a subgraph G; of G. Even if G, does not depend

on the binding .Y1 of the set of variables S. we can denote it as Ct.,,(.Y1). The matching

algorithm cannot directly represent this binding. since CiL(.Yt) is not available without

materiaiizing the view. Instead. it records a match between a sub-pat tern Pk(YP) and x k .

Applying Pk(1'.) on rk. y j is bound to CI, ( .Y ' ) . when xi, is bound to G. We conclude.

that given a binding of the set of the user query variables. there is a matching produced

by the matching algorit hm t hat represents it as descri bed above.

Consider now a matchiiig -11 returned by the matching algorit hm. Assurne t hat each

variable y j , where j = 1 ..q. is rnatched to a component CV,,(.Y ) of the constructor of I'.

Moreover, there are matches in which a variable x k E -Y is matched to a pattern Pk(lp)

or to a constant y. .CI represents a set of bindings of the variables 1. with the following

properties. Ciiven a binding S' of V's variables. each variable y,. where j = l . .q . is bound

to Cv,,(.Yi). while the binding of each y,. where j = q + l..m. is derived by applying

a certain pattern A(?-) on xi. where xi E -Y'. .-\ match between a variable xk and a

constant C>C is translated as follows. The bindings are derived only if xi = cc. where

r; E .Yt. It is clear that when Pg is apptied over the tuple C v ( S t ) that is generated by

V . al1 these bindings are produced.

3.3 The rewriting algorithm

3.3.1 Detailed algorithm

.A formal description of the rewriting procedure is presented by Algorithm 3.2. The

algorit hm first calls the find:Cfatchings() funct ion to discover the mat chings between the


pattern tree of t h e user query and the constructor tree of the view query. For each

rnatching. it produces a different sub-query block. More specifically. if { M 1 7 ..., :II,} is

the set of the discovered matchings. the final rewrit ten query has the lorm {Q;) ... {Q',).

where Q: is the sub-query that corresponds to .Cli. Since equality between variables in

the user query is not haridled by the algorithm. the third argument of the match triples

is not used.

Algorithm 3.2

Input: a uiew dejînition V . and a user quenJ Q dejîned ocer 1.'

Output: the composition que ry Q' = Q O L,'

1: function Query cornpose(Q, C'){ .? - v .

9: % Renarne the variables i n V so that they do not conflict 4: % with the variables names of Q 5: rename I.Briables(V,Q); 6: 7: $b Get the parts of Q and V thnt will form the corresponding parts 8: % 01 the rewritten q u e y

fi. t getPatternPart(V); Cp t getCons~ructorPart(Qj: Ft* t getFilterPart(V); Fq c getFilterPart (Q):

Initialite the remritten q u e y Q' Q' c new Query();

Find the matchings betureen V *s constrirctor tree and Q 's pattern trre matchfngs t find.tiafchings(ge&PatternTrre(Q). getConstructorTrre(t.'));

For each matching. create a separate sub-q,uery for each matching in matchings {

Initialize the pattern, Jilter and constructor parts of the su&-query PSQ PL'; FsQ e meqieFifters(FQ,Fv): CSQ CQ;

I / a r;ariable in V is matchcd to a constant for each (m,, rn,, id) in matching {

if (LeafConstant(m,) or ;IttrConstant(rn,) or Elernent.Varne(rn,)) addFilter(FsQ .m, = m,);

CHAPTER 3. QUERY COMPOSITION - BASE CASE

33: % l-j the matched component that corresponds to Q is a vanable 34: if (Leaf Variable(m,) or .At tr Variable(m,) or Tag Variable(mp)){ 35: ~ ~ b ~ t i t u t e ( C ~ ~ , r n ~ , r n ~ ) : 36: % If the uariable cannot be substituted in the filter part, the matching is rej~cted 3 7: if !substitute(FsQ ,m,,n,) goto 4 7; 38: 39:

1

40: 25 If a variable in Cf ia ~ n o t c h d to internai raod~s ofQ 5 pattcrn trcc 4 1: if (SiblirigiVdes(m,j or In temalNode (m,)) -f 2: addPattern(PsQ, construct Pnttern(mp) I:V m,); 5 3: 44: % Construct the sub-query and add it to Q' 45: S Q t new Q u e r y ( P s g . F s Q . C d : 56: adclSubquery (Qt ,SQ): 57: 1 48: return QI: 49: 1

W e give an explanation of the new functions that appear in the algori t hm. The func-

tion rename(queryl,query2) in Line 5 renames the variables in query,. so that th- do

not conflict with the variable names in queryz. The functions get.\'(query). where .Y is

a query part (Lines 9-19). return the corresponding part of query . while the functions

getConstructor Tree(query) and get Patte ni Tree ( q u e r d return the constructor and pat-

tern tree of query . The function mergeFil ters(/ i l teri . f i l t e r z ) (Line 25) produces a filter

that contains both filters f i l t e r l and f i l t e r2 . The function substi tute(queryPart , c a r ,

component) (Lines 35, 37) replaces al1 the appearances of variable car in queryPar t

by the constant, variable. element label o r constructor segment t hat corresponds to

component. If component represents an internal node. Le.. InternaliVode(cornponenlJ

is true, or a collection of sibling nodes, i.e.. Sibling:Vodes(compment) is true. car is

substituted by the constructor segments that correspond to the constructor subtrees par-

ented by t hese nodes. Since each subtree results in a different segment. var is repiaced

by the concatenation of al1 the individual segments. A constructor segment cannot par-

ticipate in a filter. Therefore. if the first argument of substitute() is a filter. and the third

argument is an internal node or a collection of nodes, the Function returns false. and the

corresponding matching is rejected (Line 37).

The function addPa€lern(p, exprj adds a new pattern expression erpr into the pat-

tern part p of a query ( Line 11). The translation of a pattern tree t ree into a pattern

pattern is performed by the function addPat tern( t ree . pattern). FinaIIy. the function

addSubguery(query, subquery) adds subquery to query as a sub-query bloçk.

Example 3.1 Consider the jollowing pair o j a ciew and a query:

V: function ViewO( WHERE

<a ID = $il> 

cb1 ID = $i3> $xl C/> <b2 ID = $i4> $x2 </>

</> </> IN 'csource.xmlJJ

CONSTRUCT cc ID = fl($il,$i2,$i3,$i4)>

<ci ID = f2($il,$i2,$i3,$i4)> $xi </>

The Nnplicil OrDs haue already Qeen included in the e h m e n l fags of t h e uiew's con-

structor. The matching algorithm discouers the diflerent rnatchings betmeen the pat-

tern tree of Q and the constructor tree O/ V . The first matching contains the matches

(100. $1 1. -). (CS. lx?. -) and ($y?. c4. -). and the second rnatching contains the matches

( lU0. $xi, -). (Syl . S x l . -) and ( 5 ~ 2 . c4. -1. The third argument o j each match is cgnored.

Each matching corresponds to a diflerent sub-query. The jirst sub-query incolocs the fof-

fouing transf'ormations.

(il The f l ter $11 = 100 is ndded in the filter part of the sub-query.

(i i) The cariable $y:! is substituted by the constructor segment

(iii) The patteni erpression < c5 > $91 < / > I:V Or2 is added to the pattern part of

the su6-query.

The second sub-query also incolves the jirst and the second transformations. In nddi-

tion, Syl is substituted by Sxl.

The q u e q that is produced is presented bclo w:

Q' : < WHERE

<a ID = $il> 

<bl ID = $i3> $xi </> Cb2 ID = $i4> $x2 </>

</> </> ZN c'source.xmly~, <CS> $y1 </> IN $x2, $y1 < 100, $xl = 100

CONSTRUCT <result>

<dl> $y1 </> <d2>

(c4 ID = fl($il,$i2,$i3,$i4)> <c6 ID = f2($il,$iS,$i3,$i4)> Hello </>

</> </>

</> >

€WHERE <a ID = $il>

<b1 ID = $i3> 100 </> cb2 ID = $i4> $x2 </>

</> </> IN "source.xml>', $xi < 100, $xl = 100

CONSTRUCT Cresult>

<dl> $xl </> <d2>

Cc4 ID = fl($il,$i2,$i3,$i4)> <c6 ID = f2($il,$i2,$i3,$i4)> Hello </>

</> </>

</>

.At this point, we can explain why implicit OIDs are included in the element tags

of the view's constructor as described in Subsection 2.1.5. The S M L - Q L data mode1 is

object based. and SSIL nodes represent distinct objects which are identified by unique

object identifiers. When the OID of a tag element in the constructor of a query is

not explicitly defined. then. for each binding of the query's variables. one distinct such

element is constructed containing a unique OID.

Consider the following view:

V : WHERE . . . CONSTRUCT

When V is applied over a n SbIL document. for each binding of its variables. a different

e element is constructed. Yow. consider the following query:

q: WHERE ... ... $y ... IN V

CONSTRUCT ... $y S . .

... $y ...

.\ssurne that when t h e pattern tree of Q is matched to the constructor tree of C'. Sy

is matched to the node t hat represents e. If OIDs are ignored, the rewrit ten query is as

Q J : WHERE ... CONSTRUCT

... <o...</> ...

... < e x . . < / > ...

The above rewritten query constructs two different e elements. i.e.. elements with different

OIDs. for each binding of the query variables. This is not the otitput that Q generates.

I.,' produces a single e element for each binding. CVhen Q is applied over the resulting

document. and $y is bound to the subgraph that includes an element e. a single element

e is created. So. the rewritten query should assign the sarne OID to two e elements

that correspond to the sarne binding of $y. This is achieved by including the implicit

OID definitions in the element tags of Cm's constructor and preserving them during the

matching procedure.

Figure 3.1 exhibits how ignoring implicit OID definitions can lead t o incorrect rewrit-

ten queries for the view and user query presented in Example 3.1.

3.3.2 Justification of the algorithm

CVe show that the rervritten query Qr which is produced by .-'ilgorithm 3.2 is equivalent

to the composition of the user query Q(Y) and the vierv V ( S ) . In order to simplify our

presentation. we assume that no filter conditions appear in Q.

Let PLr(l(). FIr(.Y) and C'\.(.Y) be the pattern. filter and constructor part of \'.

and PQ(Y) and CP(Y) be the pattern and constructor part of Q. We first apply the

matching algorithm and produce a set of matchings. Coosider a matching .\I in this set.

For this rnatching, the rewriting algorithm produces a query block QI, which includes:

( i ) the pattern expression P v ( S ) of V : ( i i ) the filter part Fv(.Y) of V ; (iii) a set of pattern

expressions Pl (Y). .... P,(Ie): (iv) a set of filten Fr(-Y) = {(xl = c l ) . .... (x, = c, ) ) . where

X I : ..., x, E .Y: and (v) t he constructor C P ( C ~ , l ( X ) . .... Cv,,(-Y). .... y,). -4 pattern

expression Pj (Y) corresponds to a match between a variable x, E .Y and a pattern. i.e..


output when ignoring implicit OïD definitions (b)

Hello

correct output (c)

Figure 3.1: Importance of implicit OID definitions

an interna1 node or a collection of nodes of the corresponding pattern t ree. .4 condition

xj = cj corresponds to a match between a variable x j E -Y and a constant c j . Finally.

a substitution CI,(X) in CQ corresponds to a match between a variable yj E Y and a

component Cv,j(S) of the constructor Cv(.Y).

Assume that V is applied on an XhIL document db. Let -Yi be a binding of the

set of variables X. and let t i = eV(-Y' ) be the corresponding tuple that is produced by

V. Assume that Q is applied on t i and a set of bindings is derived. Consider such a

binding Y', where each variable yj E Y is bound to CI:j(Si). The resulting tuple is

1% = c,(C;*,(Xi), .... c;;,(xi)). As discussed in Subsection 3.2.2. every stich binding can be derived from a matching

t hat the matching algorit hm produces. Assume t hat Y" is derived from .LI. This means

that Cb a Ch. where j E {L .... 'q}. It also means that there is a pattern expression

P u ( I v ) which binds yj to C';,,,(.Y1). where ZL E {l . .... r} and j E {q + 1 ..... m}. It also

means that F ' ( S i ) evaluates to true. Consequently. ivhen QIIl is applied on db. there is

a binding S'U {C[:,+,(.Yi). .... C'[.,,(SI)} of its set of variables which results in the tuple

C Q ( C V , ~ ( S ' ) ' ..., Cr:,(.Yi). C';:,+,(.Y1). .... C;:,(.Y1)) = t ; . So. Qif C Q. and therefore.

Q' G Q-

Assume now that Q' is applied on db and generates a tuple t : = CQ (CL:i(.Y1). .... C \ T , ~ ( - Y ~ ) ) .

which corresponds to a binding .Y' u ....y,} of its set of variables. Let QI,! be the

sub-query of Q' t hat generates t : . Moreover. let .CI be the matching t hat corresponds to

QI,. .As showed in Subsection 3.2.2. the matching algorithm does not produce match-

ings that derive invalid bindings of the user-que- variables. i.e.. bindings that cannot

be derived. when applying the pattern part of the user query on the view. Thus. there

is a binding of the set of variables Y. such that yJ is bound to CVKj(.Y'). The tuple that

Q generates for this binding is c ~ ( c ~ ~ ( s ' ) . .... Cv, , (S i ) ) = t : . Therefore. Q C Q'. and

Q' = Q.

3.4 Regular path expressions

The introduction of regular path expressions in the user query does not require any

changes in the rewriting algorithm. Hoivever, the matching procedure is more cornplex

than the one presented in Algorithm 3.1. W e do not provide a solution here.

Regular path expressions can also appear in a view definition. In Subsection 2.13.

we concluded that when a query involves regular path expressions we cannot derive the

implicit OIDs in the query7s constructor. In addition, in the previous section. we showed

that the rewriting can produce incorrect rewritten queries. when the OID definitions are

not included in the elernent tags of the view's constructor. For this reason. ive assume

that view queries either explicitly define al1 the OIDs' in their constructor or do not

contain regular pat h expressions in t heir pattern part.

3.5 Multiple pattern expressions in the user query

Our rewriting algorithm considers that a single pattern expression appears in the user

query. In this section rve present how multiple pattern expressions in the user query can

be handled.

Consider that a user query Q(Iv) = Q ( Y i . 1,;) contains two pattern expressions

P E i ( ) ; i ) and PE2() . . ; ) . where Y* = YL u Y ; is the set of variables that appcar in Q.

Since no equali ty conditions between variables exist, ive conclude that ( i ) 1.'; ri 1.; = 0.

and ( i i ) the filter part FQ(Is') of Q can be split into two different sets of filter conditions

F I ( ) ; ) and F2(Y;l).

Assume that PEI is defined over a view I.;(dbl. .Yl). where dbl denotes an SSIL docu-

ment, and .Yl is a set of variables such that .Yl n Y = 0. Also assume that PE2 is defined

over an XhI L document db2. Our goal is to construct a query Q1(dbi. ch2. .Yi. Y;, 13) =

Q ( K db2. Y ; , k;).

Let X i be a binding of the variables in SI when the 1.; is applied on dbi . Similarly.

let &' be a binding of the variables in i$ mhen Q is applied on dbz For each pair -Y;.

a set of tuples CQ(.Yf, I.;') is produced by Q1 where CQ is the constructor of Q.

CVe transform Q into a query Qv, by rernovinq P E2(&) and Fz(l . ; ) . Any variable

y f & tvhich appean in the constructor of the new query is handled as a constant. For

each binding of the set of variables -Yr. Qv, produces a set of tuples Cq(Xi . k;).

'Explicitly defined OiDs are introduced in Chapter 4.


Let Q;, ( d l , -Yl, K. &) be the query that is produced when applying the rewriting

algorithm on Qvl. Consider the following query:

Q': WHERE

PE2(db2, Y>). F2 (12) CO3STRC'CT

QI; (dbk. .Y1, Yi. Y?)

The above query produces a set of t uples CQ(Sl. k2) for each pair S;. 1 2 of bindings of

the variable sets .Yk and Y i . respectively. So. we conclude that Q' = Q.

The situation is similar if ive assume that PE2 is defined over a view C.i(db2. .Y2).

where -Y2 n .YL n 1.. = 0. Following the procedure described above. ive produce the

following query:

Then. we apply the rewriting algorithm to Q'. and produce the query

The nesting of the generated query QI' can be eliminated. .-\ssurne that Q;.; is com-

posed of the sub-queries QbI ,,. .... Qtr1,, and Q" is composed of the sub-queries Q','. .... Qr.

The COXSTRUCT clause of each Qt consists of the union of al1 the QI; ,, sub-qüeries.

So each Q i can be split into q sub-queries. The jth such sub-query contains the pat-

terns and filters of the WHERE clause of both Qi and QIi,,. as well as the constructor

of Qbl,j. Consequently, the resulting query consists of the union of q x r sub-queries.

Each sub-query corresponds to a certain pair of matchings ( M l . Mz). where .\Il is match-

ing between PEI and V17s constructor. while .Cf2 is a matching between PE2 and V;'s

constructor.

The above soIution can be generalked for user queries that contain more thaii two

pattern expressions. Pattern expressions can also be defined over variables. CVe do not

provide a solution for t his case.

3.6 Handling variable equality in the user qi iery

In the previous sections. we assumed that no equality conditions between variables exist

in the user query. Equality can be expressed either by including equality conditions in

the filter part. e.g., iSx = $y. or by placing the same variable in different positions in the

pattern part of a query. If the same variable appears more than once in the pattern part

of the user query. we can use additional variables and derlare t heir equality by filters. For

instance. consider the following pattern expression. which matches people whose name is

identicai to their surname:

WHERE <vieu>

<person> <name> $y </> <surname> $y </>

</> </> IN V

This pattern expression can be rewritten as follows:

Let vl and r-.? be two variables in a user query Q which is defined over a set of views

(ki . ... Pi}. Assume that a filter V I = ~2 appears in the filter part of Q. The variables cl

may appear either in the same or in different pattern expressions of Q.

CHAPTER 3. QUERY CObtPOSlTION - BASE CASE 42

Let Qr be the rervritten query that is produced, when the fiIter ul = y is omitted. As

shorvn in the previous paragraph. Q' consists of the union of several sub-queries. and each

such sub-query corresponds to a set .II of matchings between the pattern expressions of

Q and the corresponding views. Let Q;, be the sub-query that corresponds to J I . .\I

represents a set of bindings of the variables that appear in Q. which produce a certain

set of tuples. when the filter ~ 7 1 = cz is omitted. When the above filter is included. Q

produces a subset of these tuples. In order to produce the same subset of tuples with

the reivritten que- we should also include the appropriate filter in the filter part of Q:Lr.

The filter o l = c2 can be directly incliided in the filter part of Q',,. only i f both c l and

~2 have not been substituted by the rewriting algorithm.

Assume that only cl has been substituted by a cornponent C';(.Y, ) of the constructor

of a view C-i. If Ci(-Y,) is a constant c' then we just add the filter PZ = c to Q'lI. If'

Ci(.Yi) is a constructor segment. rvhich is not a variable. the filter r.1 = cl always returns

false for the corresponding bindings of and u2. T h e reason is t hat c l can only be

bound to the OID of a source element. and this OID cannot be equal to the OID of an

element produced by L i . In this case. the binding .\I is rejected. i.e.. the sub-query Q!tr

is removed. Finally. if Ci(.Yi) is a variable x E .Yi, special care is required. The fact t hat

x is bound to an OID does not indicate that cl is bound to the same OID. To be more

precise, vl is bound to the OID of an element that I.; produces. In this case the filter

v l = L.* returns false. even if the equality u2 = x is true. Horvever. x can also be bound

to a constant. In order t o handle t his case, we introduce the externat function a f o m i c ( c )

which returns t rue if and only if u is bound to an atomic value. T h e filter that me add

Assume now that both v l and a2 have been substituted by C',(Xi) and C',(.Yj). re-

spectively, where Ci(-Yi) is a component of F o s constructor. and C,(S,) is a component

of 4's constructor. The follotving six cases can be identified.

Case 1: Ci(Si) a C j ( X , ) c. where c is a constant. No fiIter is added to Qyb,. since

cl = u2 alivays evaluates t o true for the set of bindings that correspond to M.

Case 2: C ' , ( S i ) is a constant c. and C j ( S j ) is a variable x E .YJ. The filter x = c is

added to Q:l,. The case where C,(S,) is a variable and C',(.Y;) is a constant is

symmet ric.

Case 3: C l ( S i ) m f ( X I . .... x.) . C',(.Yj) i f(r:. .... 2:). and i = j. i.e.. both the compo-

nents are Skolem function declarations of the same Skolem function. and appear in

the sarne view. The filters xi = x',, .... s, = xi are added to QI,.

Case 4: Bot h C' i (S t ) and C,(S,) are const ructor segments of the same vieiv. ive.. i = j .

and t hey a re not variables. Moreover. the OID defini t ions t hat appear as arguments

in the corresponding matches of .Cf are idi /(q. .... r,) and id, a f (xi. .... x:). respect ively. -4s discussed in Subsect ion 3.2.2. t hese O ID defini t ions represent the

actual bindings of cl and ~ 2 . Thus. the filters xi = xi. .... x, = r: are aclded to

Qlw

Case 5: C',(-Yi) x and C',(.YJ) x'. rvhere x and r' are variables. We do not know

whether x and x' are bound to atomic values or t o OIDs. Consequently. we do

not know whether the equality betiveen v l and t.2 involves constants or OIDs. For

this reason, ive use again the txternal function atomic() . If id, i f (x i . .... x,).

id, r !(ri. .... x:) and i = j . rvhere idi and idj are defined as in Case 1. ive add the

filter

(atornic(cl) A S D x = 2') OR (YOT(atomic(cl)) . W D xl = y[. .... 2, = y,)

Otherwise, since cl and s cannot be bound to the same OID, we add the filter

atomic(cl) JLXD x = x'

CHAPTER 3. QUERY COMPOSITION - BASE CASE 4.2

Case 6: None of the previous cases is valid, e.g., C(>ii) is a constant and Cl (.Yl ) is a

Skolem function. In t his case, the filter vl = vz rvaluates to false for every binding

that corresponds to M. The sub-query Q:![ is removed.

Example 3.2 W e consider the following pair of ciew and user query :

V: function View()( WHERE <a>

Cal> $xl </> <a2> $x2 </>

</> IN tcsource.xml'' CONSTRUCT

 <bl> $xl </> <b2> $x2 </>

</> >

The nodes $y1 and $92 O/ the pattern tree O/ Q match the nodes 82 1 and Sx2. respec-

t iaely, of the conctruckor tree 011,' and are reguired to be e q d . The maiching conrsponds

to Case 5. T h e variables can only be bound to string values. since the OlDs of b l and

b l are always different. The rerritten query is presented belom:

Q': WHERE

<a> <ai> $xl </> <a2> $x2 </>

</> IN "source.xm12', atomic ($xl ) AND $x1=$x2,

CONSTRUCT Cresult> $xl </>

CHAPTER 3. QUERY COMPOSITION - BASE CASE 4 5

It is dear that fiiters that invoive inequality between variables of t h e user query are

handled sirnilarly.

Chapter 4

Introducing explicit OID definitions

The rewriting algorithm that we presented in the previous chapter assumes that no

explicit ly defined OIDs exist in the constructor of the view. In this chapter. we show

how explicit OID definitions can be handled without changing the core of Algorithni 3.2.

CVe assume that the same Skolem function does not define the OID of more than one

element. In ot her words. element rnerging is not handled.

Explicit OID definitions by means of Skolem functions are used to control how the

result of a query is grouped. For instance. consider the view query:

function Classes0 { WHERE

<couple> <boy> $b </> <girl> $g </> Cage> $a </>

</> IN clcouples.xml" CONSTRUCT

<class ID = f ($a)> <boy> $b </> <girl> $g c/>

</>

This query matches couples of boys and girls and groups them by their age. Each

couple contains a boy and a giri of the same age. Each class element of t h e resulting

document contains ail the boys and girls of a certain age. This means t hat the structure

CHAPTER 4. INTRODUCING EXPLICIT OID DEFINITIONS 41

of the XML document that the query produces does not strictly folIorv the structure of

the query's constructor.

4.1 Splitting pattern trees

Consider the following user query. rvhich is defined over the view Classes:

Q: WHERE <view>

Cclass> <boy> $x c/> <girl> $y cl>

cl> </> IN Classes0

CONSTRUCT <couple>

<80y> $x </> <girl> $y </>

</>

The above query creates couples betrveen boys and girls from the set of children that

appear in each class. .-\Il the possible combinations of couples between boys and girls of

the same age is produced. If we apply Algorit hm :3.2 to the above view and user query

the resulting rewritten query is:

Q : WHERE <couple>

 <gi r l> $g </> cage> $a </>

</> IN ' couples .xm12 CONSTRUCT

<couple> <boy> $b </> <girl> $g </>

</>

It is clear t hat the above query does not produce the same result. The problem originates

from the fact that a certain couple element in the view can contain more t han one boy

CHAPTER 4. INTRODL~CING EXPLICIT OID DEFINITIONS 45

and girl eiement . Unfort unately. the matching algorit hm considers t hat the structure of

each couple strictly follows the structure of the constructor of the view. which involves a

single boy and a single girl element.

However. i f we split the pattern t ree of Q so t hat the boy and the girl node can match

the constructor t ree independently. the problem is resoived. CVe transform Q as follows:

Q: WHERE <view>

Cclass ID=$il> <boy> $x </>

</> </> IN Classes0 , <viesr>

<class ID=$i2> <girl> $y </>

</> </> IN Classes(), $il = $i2

CONSTRUCT <couple>

<boy> $x </> <girl> $y </>

</>

The transformed query is identical to the original one. since the filter $il = Si2 ensures

that the tcvo pattern expressions match boy and girl elements that belong to the same

class. These trvo pattern expressions are not affected by the grouping under t h e class

element of the view. since they can match the corresponding boy and girl elements no

matter if t hese elements appear under the same or a different class element. Therefore.

we can apply the rewriting algorit hm independently for the two pattern expressions as

described in Section 3.5 and receive a rewritten query which produces the correct out put:

CHAPTER 4. INTRODUCING EXPLICIT OID DEFINITIONS

Cage) $ay </> </> IN 'ccouples.xml'', $a = $a'

CONSTRUCT <couple>

<boy> $b </> <girl> $g2 </>

</> >

The filter $a = Sa' results from the filter $il = QiZ. since Si 1 is matched to I ( $ a ) and

$2 is matched to f (Sa').

In general. the pattern tree that corresponcls to a user query is split into a set of

ordered lists. whose number is equal t o the number of the leaf nodes. Each ordered list

consists of the nodes of a different path of the tree. The first node in each list is the root

node, while the last node in the list is a leaf nocle of t h e tree. Before splitting the tree.

al1 the nodes are assigned a variable as their OID attribute. So. the OID of nodes in

different lists t hat correspond to the same t ree node is represented by a cornmon variable.

4.2 Matching grouped contents

When explicit OID definitions appear in the constructor of the vieiv query. the substitute(}

function in Line 35 of Algorithm 3.2 shoiild b e adjusted to handle grouping.

Consider the following query: Q: UHERE

Cview> <class> $c </>

C/> IN Classes(), Cteacher> $t </> in ' teachers. xml '

CONSTRUCT Cclass>

<teacher> $t </> $C

c/>

This query assigns teachers to classes. A11 the possible combinat ions between teachers

and classes are produced. If we apply the matching algorithm to the above query. c is

CHAPTER 4. INTRODUCING EXPLICIT OiD DEFINITIONS 50

matched to the coIIection of the nodes that represent the boy and girl elernent of t h e

view's constructor. These elements are grouped under the class element by the Skolem

function that defines its OID. According to the rewriting algorithm. c is substituted by

the constructor segment that corresponds to the matched nodes. Since the parent of

these nodes does not participate in the match. the grouping cannot be expressed by a

Skolem funct ion.

In general. a match betiveen a variable in the pattern tree of a user query and a node or

a collection of nodes in the constructor tree of a viem query needs a different manipulation.

when the t hird argument of the corresponding triple involves a Skolem function of an

explicit OID definition. klore precisely. the variable cannot just be replaced by the

constructor segment that the matched nodes represent. since the groupi ng information

should also be included.

We eeplain how the problern is resolved. A grouping performed by means of a Skolem

function can also be derived by using a query block. Consider the following query:

WHERE P(xi. ..., x,) COXSTRCïCT

The grouping t hat the Skolem funct ion f performs can be expressed by a query block as

follows:

WHERE P(xl, .... tn) CONSTRUCT

CHAPTER 4. INTRODUCING EXPLiCIT OID DEFINITIONS

where .Y' = {xi. ... . x',) is produced after renaming the set of variables S = {x ,. ... . x,}.

The grouping is expressed by rneans of a nested query. rvhich scans the input document

and places the d a t a that correspond to the same values of xk under the sarne e element.

Assume that the above query defines a view and consider a user qiiery that is applied

over this view. If now. there is a match between a variable of t h e user query and the

collection of nodes that represent the constructor segment c(xt . .... x,). the variable is

subst ituted by the sub-query block:

The parameters {xk. . ... xi+,} of the Skolem funct ion are derived from the t hird argument

of the triple that represents the match.

If the rewri t t en query involves several such substi tut ions. t h e result ing sub-query

blocks are independent, i.e.. there is no nesting between them. Therefore. the same

riinarning -Y' can be applied to al1 the sub-queries. Since the pattern part P ( r ; . .... x',)

is common in al1 these sub-queries. it can be replaced by a single P(r', . .... xn) in the

WHERE clause of t he main query. Before this replacement. ive explicit ly define the OIDs

in al1 the element tags of t h e query's constructor in order to ensure t hat the mappings

that correspond to P ( x i . .... x',) only affect data that participate i n groupings.

In our example. after follorving the procedure presented above. the follorving rewritten

query is produced:

Q': WHERE <couple>

<boy> $b </>

CHAPTER 4. INTRODUCING EXPLICIT OrD DEFINITIONS

<girl> $g </> cage> $a C/>

</> IN "couple~.wil~~, <teacher> $t </> in ' ' teachers . xml ' >

CONSTRUCT <class>

cteacher> $t </> <

WHERE <couple ID = $ilt>

<boy ID = $i2'> $b8 </> <girl ID = $i3'> $g' </> cage ID = $i4'> $at </>

</> IN ' ' couples. m l , $aJ=$a

CONSTRUCT <boy ID = gl($ilt ,$i2' ,$i3' ,$i4')> $bt </> <girl ID = g2($ilJ ,$i2' ,$i3' ,$i4')> $g8 </>

> c/>

We can rnove the pattern part of t h e sub-query to the outermost level. after including

al1 the implicit OID definitions:

Q' : WHERE <couple ID = $il>

<boy ID = $i2> $b </> <girl ID = $i3> $g </> cage ID = $i4> $a </>

</> IN "couples.xml> ' , (teacher ID = $i5> $t </> in " teachers. xmlJ > , Ccouple ID = $il8>

<boy ID = $i2'> $b8 C / >

<girl ID = $i3'> $gt c/> cage ID = $i4'> $at </>

</> IN ''couples.xml~, CONSTRUCT

cclass ID = hl($il,$i2,$i3,$i4,$i5)> Cteacher ID = h2($ii, $i2,$i3 ,$i4,$i5)> $t c/> €

WHERE $at=$a CONSTRUCT

<boy ID = gl($il',$i2',$i3',$i4')> $b8 </> <girl ID = g?($il'.$i2',$i3' ,$i4')> $g> </>

1 </>

Chapter 5

Conclusions and future work

The problem addressed by this thesis is query composition by means of query rewriting

in the context of XML-QL. More precisely we studied horv a user query defined over one

or more views can be rewritten so that the rewritten query only refers to source data.

Cser queries and views are expressed as S M L - Q L queries.

CVe presented a simple rewriting aigorithm which is based on a matching procedure

between the pattern part of the user query and the constructor part of t h e vietv query.

The matching procedure results in sets of matchings which involve components of the

above parts of the view and the user query. Matchings represent the bindings of the

user-qtiery variables ivhen the user query is applied over the vierv. These bindings can be

derived direct ly from the bindings of the view-query variables. ivi t hout materializing the

vieiv. In order to facilitate the matching procedure. pattern expressions and constructors

are represented by tree structures. The S!vIL-QL data mode1 is object based. i.e., S M L

elemeots are assigned a unique OID. In order to produce a correct rewriting. the elements

in the constructor of the view query should be assigned a n OID by means of Skolem

functions. We showed how OIDs are derived when they are not explicitly defined.

The rewriting algorit hm does not handle multiple pattern expressions and equality

condit ions bet ween variables in the user query, explicit Skolem function defioit ions and

nested queries in the constructor of the vierv query, as weII as regurar path expressions.

CVe showed that multiple pattern expressions can be handled by applying the algori thm

in multiple steps. Each step perlorms the rewriting for a different pattern expression. W e

also demonstrated how the algorithm is enhanced to handle equality conditions between

user-query variables. Finally. we presented how grouping introduced by explicit Skolem

function definitions can be addressed wit hout changing the algorit hm. More specifically.

we proposed a technique of splitting the pattern expressions of the user query. so t hat

the produced pattern expressions are not affected by the grouping. CVe aIso identified

cases in which a grouping performed by a Skolem function should be expressed by means

of a nested query in the rewritten query.

This t hesis did not present a complete solut ion for explicit Skolem funct ion definit ions.

CVe assumed that the same Skolem function cannot appear in more than one element tag.

so element merging is not allowed. CVe are working on how constructor trees should be

extended in order to represent element merging. Nested queries in the constructor of the

view query could also be handled by enhancing the properties of construc tor trees.

In order to simpliiy the matching procedure. we did not study regular path expressions

in the user query. The problem is similar to matching an '(!VIL-QL pattern wit h regular

path erpressions to an XML document. Techniques used by existing SSIL-QL query

engines [Fer991 can be adapted to handle this problem. Regular path expressions in

the view query cannot be handled. when the OIDs in the constructor of the query are

not explicitly defined. in this case. a rewritten query cannot be alrvays produced. This

also means that XML-QL is oot closed under composition. In general. when a query

language involves regular path expressions and the underlying mode1 is object based, it

is not closed under composition [B LP+SS].

We considered that the schema of the source data is unknown. Information about

the schema of the source data. e.g., by means of DTDs or SML Schema? can change the

situation in some special cases. For instance, regular path expressions in the pattern part

CHAPTER 5 . CONCLUSIONS A.ID FUTURE WORK 55

of a query can be eiiminated if the schema strict ty rest ricts the nesting dept h of the data.

Future ivork will conîent rate on how the schema of the data can be exploi ted in order to

overcome the above limitations and produce optimized rewritten qiieries. Previous work

on this direction has been presented in the context of S4I.4S [LPVVW. PV99bI.

Our study was restricted in the iinordered data model. Order makes the problem

harder. It is a subject of future work the investigation of under ivhich conditions and

horv rewriting can be performed when order is considered. Aggregation is another issue

not addressed by this thesis. Relevant work on relational (IiimSi. GWSF. .CIur92] and

object-oriented models [CSI94] can be extended to handle aggregated views in the context

of SML.

Appendix A

XML-QL Grammar

XML-QL grammar for views and user queries XML-QL : : = (Function 1 Query) <EOF>

Function ::= 'FüNCTION' <FUN-ID> '0 '1 ' '0 Query

'P Element ::= StartTag Contents EndTag

S t W T a g ::= '<>(<ID>I<VAR>) ('ID' '= ' SkolemID)? Attribute* '> ' SkolemID ::= <ID> '0 <VAR> 0,' <VAR>)* '1' Attribute : : = <ID> '=' ( "" <STRING> ' Ml 1 <VAR> )

EndTag ::= '<' / <ID>? '> ' Literal ::= <STRING> Query : : = Where Construct ( > { ' Query ' 3 ' ) * Where ::= )WHEREf Condition (', ' Condition)*

Construct ::= 'CONSTRUCT' Contents Contents : := (<VAR> 1 Literal 1 Element 1 Query)+

Condition ::= Pattern BindingAs* 'IN' DataSource Predicate Pattern ::= StartTagPattern Pattern* EndTag

StartTagPattern ::= '<' RegularExpression Attribute* '> ' RegularExpression : : = RegularExpression ' * ' 1

RegularExpression '+' 1 RegularExpression ).' RegularExpression 1 Regularkpression ' 1 ' RegularExpression 1 (<VAR> 1 <ID> 1 <UNDER>)

BindingAs ::= 'ELEMENT,AS3 <VAR> 1 'CONTENT-AS' <VAR> Predicate ::= Predicate <OR> Predicate 1

Predicate <AND> Predicate 1 CNOT> Predicate i ' ( ' Predicate ' ) ' 1 Expression OpRel Expression

Expression ::= <VAR, 1 <CONSTANT> O p R d : := > < * 1 I<=S 1 8 ) ) 1 ' > = a 1 '=' 1 > ! = J

DataSource ::= <VAR> 1 <URI> 1 <FUN-ID> ' ( ' '1'

Bibliography

[.4bi99] S. Abideboul. On Views and XML. In PODS '99. Proccedings of the

e ighteenth -4 C.bI SIG.4 CT-SIGi\IOD-SIG.4 RT s yrnposiu m on Pririciples of

database sytems. pages 1-9. ACXI press. 1999. Invited talk.

[r\GM+ST] S. Abiteboul. R. Goldman. J. SlcHugh. V. Vassalos. and Y. Zhuge. Views for

semistructured data. In Workshop on .\fanagement of Sernistructured Data.

Tucson. Arizona. 1997.

[Bar991

[BDHS96]

[BFSOO]

[BLPC98]

C. Baru. SViews: SML views of relational schemas. In Proceedings of the

10th International CCorkshop on Database and Expert Systerns iIpplications.

pages 700-70.5. Florence. Italy. September 1999.

Peter Buneman. Susan Davidson. Gerd Hillebrand. and Dan Suciu. -4

query language and opt imizat ion techniques for unst ruct ured data. SlG-llOD

Record (-4 C.\I Special Interest Gro u p on .Clanagem ent O/ Data). 23(2):505-

516. 1996.

Peter Buneman. Mary Fernandez. and Dan Suciu. CnQL: -4 Query Language

and Algebra for Sernistructured Data Based on Structural Recursion. C'LDB

Journal. 9(1):T6-110. 2000.

C. Baru, B. Ludaescher, Y. Papakonstantinou, P. Velikhov. and V. k'ianu.

Feat ures and Requirements for an XML View Defini t ion Language: Lessons

[B XI R99]

(BTS91

[CC F+O 11

[CDSSSS]

19 31

[C Mg51

[Con981

from SML Information hfediation. Position paper in W C ' s Query Language

Workshop. 1998.

D. Beech. A. hIalhotra. and SI. Rys. -1 formal data mode1 and algebra for

ShIL. Communication to the LV3C. September 1999.

C. Beeri and Y. Tzabari. S.IL: -An algebra for semistructured data and SML.

In Informal Proceedings of CCorkshop on Thhe 1Veb and Databases. A C.11 SIG-

.VOL). June 1999.

D. Chamberlin, J . Clark. D. Florescu. J. Robie. J . Simeon. and 11. Stefanescu.

SQuery 1.0: A n S.CIL Query Language. June 2001. LV3C Working Draft.

S. Cluet. C. Delobel. J. Siméon. and fi. Smaga. Your mediators need data

conversion! In Proceedings of the -4 C.\I SIG.MOD Inte mationai Con ference on

.Vanagement of Data (SIGUOD'98), volume 27.2 of .-IC.\I SIGLCIOD Record.

pages 177-1SS. Yew York. June 1-4 1998. ACSI Press.

Sophie Cluet and Guido Sloerkotte. Nested queries in object bases. In Catriel

Beeri. Atsushi Ohori. and Dennis Shasha. editors, Database Programming

Languages (DBPL-4). Proceedings of the Fourth Inlernationnl CC'orlrshop on

Database Programming Languages - Object 5lodels and Languages. .Manhat-

tan. ,Vew York City, USA, 30 ..lugzist - 1 Septernber 199.7. iVorkshops in

Computing, pages 226-242. Springer, 1993.

S. Cluet and G. Sloerkotte. Classification and opt imization of nested queries

in object bases. Technical Report 95-6. RCVTH Aachen. 199.5.

World Wide CVeb Consort iurn. Extensible Markup Language (SM L ).

htt p://~vww.\vJ.org/TR/REC-xml. February 1998.

[FS Wf 991

[FS WOO]

A. Deutch. M. Fernandez. D. FIorescu. A. Levy. and D. Suciu. XML-QL: A

query language for S41L. Subrnission to the World CVide Web Consortium.

-4ugust 199s. http://rvww.w3.org/TR/NOTE-xml-ql.

-4. Deutch. M. Fernandez. D. Florescu. A. Levy. and D. Suciu. .A query

language for ShIL. In Proceedings of the 8th lrrternational Ciorld Wïde W e b

Conjerence. pages 77-9 1. Elsevier Science. hlay 1999.

Don Chamberlin and Daniela Florescu and Jonathan Robie. Quilt: an S M L

query language for heterogeneous data sources. In Proceedings of ' iébDB.

Dallas. T S . May 2000.

L. Fegaras. Query unnesting in object-oriented databases. In Laura Haas

and Ashutosh Tiwary. edi tors. Proceedings of the 1998 -4 C.11 SIC;.IIOD In-

ternational Conference on .\fanagement of Data: June 1-4. 1998. Seattle.

Washington. LiS.4. volume X ( 2 ) of SIC:\IOD Record (-4 C . i Special Interest

Group on .\fanagement of Data). pages 49-60. Sew Iork. NY 10036. CS..\.

1998. ACM Press.

SI. Fernandez. S l 1 L-QL : .A Query Language for SML. User's Guide Version

0.9. http://rvww.research.att.com/ mff/xmlql/doc/. 1999.

11. Fernandez. J . Siméon. P. CVadIer. S. Cluet. A. Deutsch. D. Flo-

rescu. A. Levy. D. Slaier. J . SIcHugh. J. Robie. D. Suciu. and

J. Widom. KM L query languages: Experiences and exemplars. http://www-

dbxesearch. belllabs.com/user/simeon/.uqps 1999.

Mary Fernandez. Jerome Simeon. and Philip Wadler. -4 Data Mode1 and

Algebra for XàI L Query. Technical Report. iVum be r Unpublished manuscript.

[Ghf W99j R. Goidman. J. McHugh, and J. Widom. From Semistructured Data to XML:

Migrating the Lore Data %[ode1 and Query Language. In Proceedings of the

2nd International IVorkshop o n the CVeb and Databases (CVebDB '99). pages

2-5-30. P hiladephia. Pensylvania, June 1999.

[GWS7] Richard A. Ganski and Harry K. T. Wong. Optimization of nested SQL

qtieries revisi ted. In Cmeshwar Dayal and Irv Traiger. editors. Proceedings of

rlssociation /or Computing .CIachinery Speciul Interest Group on Manage ment

of Data 1987 annual conference. San Francisco. .Va y 27-29, 1987. pages 23-

33. Yew York. Nk- 10036. LS.4. 1987. .\C'.CI Press.

[IiimY 11 W. Kim. On optimizing SQL-like nested queries. Technical Report R J 3063.

IBM, San Jose. C.A. 1981.

[LDOO] H. Lieke and S. Davidson. View Maintenance for Hierarchical Semist ructured

Data. In International Conference on Data LVarehousing and Iinoudedge

Discovery. 2000.

[LPVV99] B. Ludaescher. Y. Papakonstantinou. P. Velikhov. and V. Vianii. Vierv def-

inition and DTD Inference for ! M L . In Workshop on Quwy Processing for

Semi-Structurd Data and .Von-Standard Data Formats (in conjunction udh

[CD T '99). .Jerusalem. Israel. 1999.

[MFOO] D. Suciu 41. Fernandez. W-C. Tan. SiIkRoute: Trading between Relations and

.Y31 L. In Proceedings o/the 9th International WWCV Con ference. -4 msterdam.

May 2000.

[MS..\+99] J. C. Mamou. C. Souza. S. A bideboul. V. .iguilera. .A. .Ailleret. B. Amann.

S. Cluet. B. Hills. F. Hubert. -4. Marian. L. Mignet. B. Tessier. -4. 'LI. Ver-

coustre, and T. Milo. XML repository and Jlctive Vieivs Demonstration.

In Proceedings of 25th Infernafionat Conference /or COry Large Databases

( I.Z DE? *99), Sep tember 1999. Demonst rat ion.

M. Muralikrishna. Improved unnest ing algorithms for join aggregate SQL

queries. In Proceedings of the 18th Conference on Ikry Large Databases.

.klorgan Iiaufman pubs. (Los rlltos C.4). Cbncouoer. August 1992.

J . .LIcHugh and J. Widom. Query Optimization for SML. In Proeeed-

ings of the Tuenty-Filth International Conference on I éry Large Databases

(I.ZDB.99). September 1999.

[P.\Gi*I96] Y. Papakonstantinou. S. Abiteboul. and H. Garcia-hlolina. Object Fusion in

Mediator Systems. In T. SI. Vijayaraman et al.. editors. Proceedings of the

E n d Internationul Conference on l è r y Large Data Bnscs (1208'96). pages

4 13-424. Los .Utos. CA 94022. US.\. 1996. Morgan Kaufmann Pu blishers.

[P HHSI] Hamid Pirahesh. Joseph 11. Hellerstein. and Waqar Hasan. Extensi ble/ruIe

based query rewrite optimization in Starburst. In hIichae1 Stonebraker. ed-

itor. Proceedings of the 1992 rlCM SIGMOD International Conference on

ibfunngement of Data. San Diego, California, June 2-5. 1992 volume 21(2)

of SIGMOD Record (A CM Special Interest Croup on Management of Data).

pages 39-43. Sew York. XY 10036. CSA. 1992. ACM Press.

Y. Papakonstant inou and V. Vassalos. Query Rewriting for Semistructured

Data. In Alex Delis, Chris tos Faloutsos, and Shahram G handeharizadeh, edi-

tors. Proceedings ojthe ACM SIGMOD Idemational Conference on :Cfanage-

ment of Data (SIGMod-991, volume ZY .2 of SIG.WD Record. pages 45.5-466.

Xew York. June 1-3 1999. .ICM Press.

Y. Papakonstatinou and P. Velikhov. Enhancing Semistructured Data Me-

diators with Document Type Definitions. In Proceedings of the 15th Inter-

nalionaf Conference on Dafa Engineering. TEEE Corn pu ter Society Press?

March 3 - 2 6 1999.

[RLSSS] J . Robie. .J. Lapp. and D. Schach. ?(ML Query Language (SQL) .

http://www.w3.org/TandS/QL/QL9S/pp/xql.html. 199s.

[SLRSS] D. Schach. J . Lapp. and J. Robie. Querying and transforming U I L . In

Proceedings of The N:3C Q u e r y Languages CCorkshop. Boston. 3-4 Dccember

1998. http://mrvr\r.rv9.org/TandS/QLSY/.

Download - Theophanis Tsandilas - Library and Archives Canada Canada Acquisitions and ... with me in course assignmenmts and projects. ... algebra and apply the composition and any possible …

Top Related