Theophanis Tsandilas
A thesis submit ted in conformity with the recpirements for the degree of Master
Graduate Department of Cornputer Science Lniversity of Toronto
Copyright @ 2001 by Theophanis Tsandilas
National Library 191 ofCanada Bibliothèque nationale du Canada
Acquisitions and Acquisitions ef Bibliographie Services services bibliographiques
395 Wellington Street 395, rue Wellington OttawaON K1AON4 ûttawa ON K1A ûN4 Canada Canada
The author has granted a non- exclusive licsnce allowing the National Library of Canada to reproduce, loan, distribute or seil copies of this thesis in microform, paper or electronic formats.
L'auteur a accordé une licence non exclusive permettant a la Bibliothèque nationale dii Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la fome de microfiche/film, de reproduction sur papier ou sur format électronique.
The author retains ownership of the L'auteur conserve la propriété du copyxight in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fiom it Ni la thèse ni des extraits substantiels may be printed or otherwise de celle-ci ne doivent être UnpRmés reproduced without the author's ou autrement reproduits sans son permission. autnn'satim
Abst ract
Rewi t ing Queries over ?(hl L Views
Theophanis Tsandilas
LLaster
Graduate Department of Cornputer Science
University of Toronto
200 1
S M L i s widely used as the standard format of data representation and data eschange
on the Web. and database research has recently focused on esploiting its experience in
traditional databases to handle similar issues in the context of the new data format. This
thesis deals with the problern of query composition in the context of ShIL-QL. which is
one of the most popular S4IL query languages. !dore specifically. ive study how a user
query defined over one or more views can be rewritten. so t hat the rewritten query only
refers to source data. This problem is entensively studied in the context of relational
and object-oriented databases. but several issues are not yet exarnined in SSIL query
languages .
The thesis presents a simple rewriting algorithm which is based on a matching proce-
dure between properties of the user query and properties of the view query. The purpoae
of this procedure is to derive the bindings of the user-query variables without materializ-
ing the view. The underlying data mode1 is object-based and unordered. Special issues
t hat we study include multiple view references. equality conditions between variables in
the user query. and element grouping in the view.
To my parents
Panagiotis and Iionstant ina
for t heir love and continuous suppor t .
First. I would like to thank my advisor Renée Miller for her help and supervision during
the two years of my master. Although our interests did not coincide. our CO-operation
gave me many valuable lessons. 1 also want to tharik Alberto Mendelzon for his helpful
comments on my t hesis.
Secondly. 1 feel grateful to niy closest friends in Toronto. and rnainly to the smail
Greek community of the department of Computer Science - who hosted and directed
me in iny first steps in Canada. shared with me funny experiences as roomrnates. worked
with me in course assignmenmts and projects. drank and danced with me in the bars
and clubs of Toronto. sang with me in the cold nights of t he winter. and to conclude.
offered me two of the most mernorable years of my He. In particular, I want to t hank my
precious Bowen for her intensive devotion during the last three months of my master.
Above all. 1 want to thank rny family in Greece and mainly rny parents who have
never stopped encouraging and supporting me al1 these years of my studies. I prefer
t h i ~ k i n g t h a t the Company of' my little brother Alexandros. who is always trying to email
me their news from Greece. has helped them to forget rny long absence.
Contents
1 Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Prelirninaries
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivation
. . . . . . . . . . . . . . . . . . . . 1.3 Related work and scope of the thesis
. . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Organization of the thesis
2 Terminology 8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 .YkfL.QL Y
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Data Slodel 9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Syntax L I
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Semantics 11
. . . . . . . . . . . . . . . . . . . . . 2.1.4 Spli tting pattern expressions 19
. . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . 5 Object Identifiers 1 3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.6 Value equality 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Views 1.7
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Query Rewriting 16
. . . . . . . . . . . . . . . . . . . . 2.3.1 Constructor and Pattern trees 17
. . . . . . . . . . . . . 2.3.2 Matching pattern trees t o constructor trees 20
3 Query composition . Base case 24
. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Formalizing the problem '21
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 ..\ ssumptions 25
3.2 Matching t he user query against the view . . . . . . . . . . . . . . . . . . 25
. . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The rnatching algorithm 25
3.2.2 Deriving t h e variable bindings from the matchings . . . . . . . . . 2S
. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 The rewriting algorithm 30
. . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Detailed algori t hm 30
. . . . . . . . . . . . . . . . . . . . 3.3.- Justification of the algorithm 36
. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Regular pat h expressions 38
3.3 Multiple pattern expressions in the user query . . . . . . . . . . . . . . . 39
3.6 Handling variable equality in the user query . . . . . . . . . . . . . . . . 41
4 Introducing explicit OID definitions 46
4.1 Splitting pattern trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 SIatching grouped contents . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5 Conclusions and future work 53
A XML-QL Grammar 56
Bibliography
Chapter 1
Introduction
1.1 Preliminaries
The notion of views is essential in traditional database systems. They have been studied
extensively in the context of the relational and object database models. but the relevant
research is still immature in the context of S M L [ConSS].
The recent ernergence of SSIL as the standard exchange format between applications
on the Internet has important implications for the database community. Classical data
models are not adequate to describe SkIL data. The introduction of a data mode1 also
requires the development of new storage mechanisms and new query languages implying
a series of new issues such as indexing. query optimization. and database design.
In this new world of S M L . the importance of views is even more crucial than in
standard database applications. X'rIL is only a syntax and cannot uniquely describe a
certain portion of data. Different authors may use totally different ShIL structures or
names to describe the same data. SSIL views will play a significant role in translating
information between sources and applications t hat use different schemata and ontologies.
They will also provide the rneans to integrate multiple heterogeneous sources. Having in
mind that large volumes of data are stored using traditional data models. e-g.. relational.
views may aIso be invoIved in migrating data froni these modets to SXL.
Issues that have been already studied include SbIL data models and query languages
[SLRgS, FSWSS], XML algebras [BT99. BhlR99. FSCVOO. BFSOO]. and some prelirninary
work on query optirniaation [SI\V99]. The most popular SSIL query ianguages that
have emerged are: SM L-QL [DFF+9Y. DFF+99]. Lorel [GhICV99]. S Q L [RLSSS]. YATL
[CDSS9S]. CnQL [BFSOO]. SQuery [CCF'Ol]. and more recently Quilt [DonOO]. The
language that is used in this thesis is SSIL-QL. Existing work on U I L views involves
issues like vierv definition [B LP+9S. LPC'V991. maintenance of materialized views [LDOO].
active views [MS.\+ 99. AbiSI)]. and S M L viervs over relational da ta (4IF00. Bar991.
1.2 Motivation
This thesis studies the problem of query rewriting in the context of S M L viervs. More
specifically. given an XM L query (we cal1 i t a u s e r que y) t hat is defined over one or more
virtual (non-materialized) S ' r IL vietvs. we find an equivalent query ( we cal! it a relcn'tten
query) that is defined over the source data. The term -query rewritingn also refers to
the reverse problem: given a query defined over a database and a set of viervs over the
same database. find equivalent rewritten queries defined over this set of views. Query
composition is another term t hat is used to describe the problern that ive study. since the
user query is composed rvith the view query in order to produce a single query.
Query rewriting is just one of the methods of evaluating qiieries over vieivs. Other
ways to process queries over a virtual view are: (i) materialize the view on demand:
(ii) rewrite the view so only the portion of the view that is relevant t o the query is
materialized: and (iii) translate the query and the view into expressions of a compositiona1
algebra and apply the composition and any possible optimizations a t the algebraic Level.
The second method is clearly more efficient than the first one. The query rewriting
approach is even more efficient, since view rewriting requires the partial materialization
of the view and the execution of the originai query over this view. However. we should
ment ion that query rewriting can lead to huge rewrit ten queries whose opt imization may
be difficult or even infeasible [.-\GM+97].
The third method is similar to the query rewriting approach. but composition is
based on algebraic transformations rat her t han transformations at the level of the query
language. Mostly due to the lack of a group-by operator in the relational algebra to
express groiiping. early composition and unnest ing techniques were based on transforma-
tions over SQL queries instead of applying algebraic transformations. The correctness
and completeness of such transformations can be more clearly and strict ly proved using
a forma1 algebra'. Later approaches used algebraic techniques to perform que- com-
position and unnest ing optirnizations. mainly in the world of object-oriented databases
[CM93. Feg981. However. due to limitations in existing optimizers. query rewriting al-
gorithms can be more efficient than algebraic transformations t hat are based on these
optimizers. Having in mind that query optimization is still at its infancy in the context
of S M L and that optimizers for S M L query languages are -under construction". the
development of query rewriting techniques is significant .
Example 1.1 Consider the aiem that is expreûsed b y the follou~ing S.IIL-QL query:
V: UHERE <Library>
<book> <t i t l e>$ t< /> <author>$a</> <publisher>Uise Reader</>
</> </> IN "library-xml"
CONSTRUCT <author>
<name>$a</> <book>$t</>
</>
The a b o ~ ~ e statement creates the uiew of the authors and their books that are published by
'~ransformations over SQL resulted in a number of bugs which were finally correct& [GWdï].
Vtfse Reader". The foUo wing query asks for the book ii€[es lhat are wn*€fen by *Marcel
Proust ":
Q: WHERE <view>
<author> <name>Marcel Proust</> <book>$t~/>
</> </> I N V
CONSTRUCT <t i t l e>$ t< />
..lccording to the full rnaterialiiation approach. ae have to compute the cieu*> result and
then apply the query ocer this result. The drawback of this approach is that the query is
only interested in the books of a sppecijic author. J O the computation of the the whole cieu?
query is redundant.
The view rewriting approach suggests the transformation of the oieu? and the insertion
of the condition $a = "Marcel Proust' in the riew definition (me can also replace the
variable Sa b y the string "Marcel Proust"). The dradack mentioned aboce is elin~inated.
but we still need tu ecaluate tu70 separate queries.
Finally. the query rewriting approach suggests the transformation o f the query as
fol10 ws:
Q: WHERE <Library>
<book> <t i t l e>$t</> <author>Marcel Proust</> <publisher>Addison-Wesley</>
</> IN " library .ml2 ' CONSTRUCT
<t i t l e>$ t< />
The query is directly applied on the source data. so the desired rrsult is produced by a
single que y.
Query rewriting becomes more cruciaI when there are rnany levets of views. Le.. views
defined over other views. In this case. instead of evaluating a potential big number of
queries returning redundant portions of data. we compute a single query t hat produces
only the desired data.
1.3 Related work and scope of the thesis
Rewriting queries defined over views can be seen as a special case of the general problem of
query unnesting. which is extensively studied in the context of relational [ l i imsl . GCVSr]
and object-oriented databases [CàI93. Feg981.
In the relational world. query composition is simply performed by merging the FROLI
and WHERE clauses of' the corresponding querics. as long as views are defined as simple
SPJ queries. However. wheri a view is defineci as SELECT DISTI.VCT. transformations
to move or to pull up DISTIiWT should preserve the nurnber of duplicates correctly
[PHHS-1. The situation is also compiex when aggregates are present in the view [SIurSZ].
Aggregates are not examined by t his t hesis.
The problem is similar in the world of object-oriented databases. Queries defined
over views correspond to the case of nesting in the FR0:I.I clause. when no predicate
dependency exists between the inner and the outer block of a nested query. Such nested
expressions can be t ranslormed into simple join expressions [CM9-7j.
There are several issues that make the problem different in the context of S M L and
X M L query languages.
1. T h e schema of XML data is not alivays known. -As a result of this. the schema of
the view is only partialIy knotvn.
2. More than one way of matching the user-query variables with the view variables
may be possible? since an element name can appear multiple times even under the
same parent eiement .
3. XML query tanguages aliow the use of tag variables and regufar path expressions
which make the problem of matching between the variables of the view and the
query harder.
4. Skolem functions are used to group data.
5. In the ShIL world. order is often important. The existence of' index operators
makes the problem even harder. This thesis does not address the order problem.
Query composition in the context of semistructured da ta is first addressed in SISL
[Pr\GlI96]. a datalog-like query language. :\ similar approach is presented by Papakon-
stantinou and Vassalos [PVSYa] for TSL. which is also a datalog-based query language
for semistructured data. The unification algorit hms t hat the above frarneworks propose
are analogous to Our matching algorit hm. The prohlem is also examined by Buneman et
al. [BFSOO] in the context of C'nQL (BDHS961. a que- language and algebra based on
structural recursion. The query composition is performed on the algebraic level. Hoiv-
ever, the underlying data moclel is value-based as opposed to the objcct-based model t hat
ive study. In the UIL-QL data model S M L data are represented as labeled graphs.
whereas in the VnQL data model. S M L data are represented as sets of label/value pairs.
The composition algorit hm presented in the SilkRoute framework [LIFOO] is the clos-
est to our approach. Views are expressed as R S L (Relational t o SSIL Transformation
Language) queries. which inherit the SELECT and FROM clauses from SQL and the
CO WSTRCCT clause from S3IL-QL. In ot her words. RSL queries transform relational
data to XML data. User queries are expressed as .YSIL-QL queries. LVe can enurnerate
several differences between our approach and SilkRoute.
1. Object identifiers are not always explicitly defined in the constructor of a vieiv. In
several cases, these identifiers should be present in the constructor of the rewritten
query in order to ensure that the correct object-elements are produced. The way
we derive them is different in XML-QL than in RKL queries.
2. .A variabIe in the CONSTRUCT clause of on RSL view can only be mapped to a
variable appearing in the WHERE clause of the XàIL-QL query. since it can only
represent atomic values (like variables in SQL queries). In our case. a variable in
the constructor of the view can also be mapped to a whole pattern subexpression.
since it can represent complex non-atomic values. This difference is reîlected in the
rewriting algorit hm.
3. We investigate how equality conditions between variables in the user query are
handled.
4. We present a different approach to handle grouping by Skolem functions in the
view. which is based on splitting the pattern expressions of the user query. We
also indicate that a grouping performed by a Skolem function should sometimes be
expressed by means of a query in order to perform a correct variable substitution
during the rewriting process.
1.4 Organization of the thesis
The thesis is organized as folloms.
r ln Chapter 2, we describe S I I L - Q L and introduce the terms that are used in the
next chapters.
r In Chapter 3. we present our query composition approach. when no explicitly de-
fined object identifiers exist in the COYSTRUCT clause of the view.
r In Chapter 4. we introduce explicitly defined object identifiers in the user query
and suggest a solution to the problem.
a Finally? in Chapter 5. we give a conclusion and present the main issues for future
work.
Chapter 2
Terminology
XML-QL
XML-QL is a declarative SSIL query language that can extract data from existing SSIL
documents and construct new XML documents. .An SML-QL query consists of three
parts: ( i ) a pattern part. which matches elements in the input document and binds
variables; (ii) a filter part. which provides conditions for the bound variables: and (iii)
a constructor part. which specifies the result in terms of the bound variables. The
pattern and the filter part can include multiple pattern and filter expressions. respect ively.
Patterns and filters appear in the WHERE clause. while t he constructor appears in the
CO-JSTRECT clause of t h e query.
Exarnple 2.1 The /o[lowing query returns a[[ the titles of 6oob written by -Dostoye~s!sX.i".
published a/ter 1990:
Q: WHERE CLibrarp //pattern
<book published = $par> ~ t i t l e > $ t < / > <author>Dostoyevski(/>
</> </> I N " l i b r a r y . x m l J ' , $ p a r > 1990
CONSTRUCT
Patterns and constructors have the structure of normal S5IL data. but they can also
contain variables. Patterns can also include regular path expressions. which can be used
t o specif~ clement paths of arbitrary depth and tag variables. ivhicli can be coniidered
as a way of query ing the schema of data.
Example 2.2 The following query returns d l the authors and their icritings. The schema
of the data is not knoun. Cté do riot know lchat kind of icritings erist. e.g.. vhether
is bound to a book or an article element. and in uvhat depth undcr Library they appear.
By means of regular path expressions and tag variables, uve pose a querp thnt does not
require the knowledge O[ the exact schema.
Q: WHERE CLibrary . *>
c$w> <t i t l e>$t< /> <author 1 writer>$aC/>
</> </> IN ''1ibrary.xml"
CONSTRUCT <author ID=AuthorID ($a) >
<name>$a</> <t i t l e>$t< />
</>
In the above example. the results are grouped by the name of the authors. This is
achieved by using the Skolem junction .4uthorID. which generates a new identifier for
each distinct =lue of Sa. According to t h e data model that we use and describe in the
next paragraph, duplicates are eliminated. Therefore. only a single n a m e element will
appear under each produced author element. Skolem functions are powerful and the
rewriting algorithm is based on their use.
2.1.1 Data Mode1
XIVIL-QL supports two difTerent data models: an unordered and an ordered data model.
The unordered data model considers that an XbfL document is a graph. in mhicb
each node is represented by a unique object identifier (OID). .A graph has a unique
root node and is labeled as follows: graph edges are labeled with element tags. nodes
are labeled with sets of attribute-value pairs. and leaves are labeled wit h string values.
Ciraph edges correspond to element tags. while nodes correspond to element contents.
Since every element is denoted by a tag which contains a single label. it may not be clear
how a certain node can be pointed by two different edges. An element can refer to other
elements by means of IDREF attributes. The value of an IDREF attribute is a n OID.
SSIL-Q L blurs the distinction between IDREF attributes and clements. and therefore.
labeled edges represent both element tags and IDREF attribute names. Since a certain
element can be referenced by several other elements. a graph node can be pointed by
more than one edge. Attributes are associated to nodes. Two nodes can be connected
by several edges. but with the following restriction: a node cannot have two outgoing
edges wit h the same labels and the sarne values. i.e.. betiveen any two nodes there can
be at most one edge wit h a given label. and a node cannot have two leaf children wit h
the same label and the same string value. This condition means that S l f L documents
denote sets.
In the ordered data model. the corresponding document graph contains a totally
ordered set of nodes. The order of the nodes corresponds to the natural order of elements
in the document. Given a total order on nodes. we can enforce a local order on the
outgoing edges of each node. In the ordered model. the above restriction is not retained.
so arbitrarily many edges with the sarne source, same edge label. and same destination
d u e can appear. According to this model. duplicates are not eliminated and .YML
documents denote bags.
In our study! the underlying model is unordered. so rewritten queries do not neces-
sarily preserve the right order in the output.
-4s XML-QL is stiil evolvingo its syntax varies in different publications [DFFf 98. Fer991.
In Appendix .\, we present the IILIL-QL syntax that we consider. S S I L - Q L queries can
also be espressed as functions. For simplicity. we consider only functions that have no
arguments:
function GreekCities O{ WHERE
<ci ty> <name> $c </> <population> $p </> <country> Greece </>
</> IN 'ccities.xml'' CONSTRUCT
ccity> <name> $c </> <inhabitants> $p </>
</>
2.1.3 Semantics
CVe continue with our XSIL-QL semantics defini tion. Consider an S l I L - Q L query M.-HERE
P COILTSTRUCT C and the set of variables XI. .... x k t hat are bounded by one or more
conditions that appear in P. Let G be a graph that corresponds to an ShIL document.
We construct a table R(sl, .... xk) with one column for each variable. Each row corre-
sponds to a binding of the variables t hat sat isfies al1 condit ions in P. CVe should mention
that a certain row is produced when each pattern expression in P matches a certain
?(ML subgraph of the source data. More information about how pattern expressions
match source '<ML data can be found in the X M L - Q L proposa1 [DFF'SS]. Two different
rows always correspond to tivo different matched X!d L subgraphs for at least one pattern
expression of P.
In each row. a variable can be bound either to an OID that corresponds to internai
non-leaf node in the graph, to a string Value (leaf node. attribute value or tag label).
Let C(rl, .... xi) be the template contained in t h e COKSTRUCT clause Jepending on
a subset of the variables rio .... xk. If x i . .... xi are the bindiogs in row i of the table R.
where i = I..n, then. for that row. ive construct the S M L fragment Ci := C'(xi. .... xi).
Each Ci is cailed tup le . The query's ansiver contains al1 the produced ttiples. without
taking into consideration any order.
The above description does not clarify how a variable could be bound to the content of
an SSIL fragment which contains bot h elements and string values. For instance consider
that a variable var is bound to the content of the following SM L fragment:
<city> Athens <country> Greece </country>
< /c i t y>
It is not clear whether the edge cit y points to a leaf node or te an internal node. In
such cases. extra edges Iabeled as CD.4T.4 encapsulate the string values [DFFf 991:
<city> <CDATA> Athens </CDATA> <country><CDATA> Greece </CDATA></country>
< /c i t y>
It is now clear that the variable uar is bound to the OID of the internal node that is
pointed by the edge n' ty.
2.1.4 Splitting pattern expressions
Consider t h e followiog pat, tern expression:
CVHERE El ... En [IV Source
where El! .... En are distinct SML-QL patterns. The pattern expression is equivalent to
the following set of pattern expressions:
bt'H E RE El IiV Sourcet .... En I N Source
2.1.5 Object Identifiers
Each element appearing in the CONSTRUCT part of a query includes an OID which
is probably not explicitly defined. However. it is implicitly determined by a Skolem
function. For instance. consider the function GreekCities() in Subsection 2.1.2. For
each binding of the query's variables with the source data. a distinct c i t y element is
produced. So a new identifier is constructed for each distinct binding. i-e.. rve should
define Skolem functions that produce a unique value for each roiv of the table R. mhich
contains the variable bindings. In Subsection 2.1.3. we mentioned that tivo different
rows in R are produced when at least one pattern expression of P matches two different
S M L subgraphs of the source data. Tivo subgraphs are different if they contain at least
two nodes with different values. By diffrrent values. we mean eit her different OIDs or
unequal string values in the case of leaf nodes. Lire should mention t hat two graphs can
b e different even if t hey contain identical Ieaf nodes. Therefore. in order to ensure t hat a
different result is produced for each row in R: ( i ) a different identifier variable is assigned
to each element that appears in P. and ( i i ) a unique OID is created for each element
in the constructor of t he query by means of Skolem functions defined over the above
variables. If id l , .... id, are the identifier variables t hat are assigned to the elements t hat
appear in the pattern part of a querv. each element < ei > in the constructor part can
b e ivritten as < f i I D = fi(idi. .... idp) >. The pattern part may contain several pattern
expressions. It is conceivable t hat when the O ID of an element is explicit ly defined. no
addit ional transformation is applied.
.\ccording to the above description. the function GreekCit ies( ) is written as follows:
V: function GreekCities()( WHERE
<ci ty ID = $il> <name ID = $i2> $c </> <population ID = $i3> $p </> <country ID = $i4> Greece </>
</> IN t'cities.wl*', CONSTRUCT
<c i ty ID = fl($il,$i2,$i3,$i4)>
In the case of R S L queries (4IF001. the implied OID definitions are derived by using
the keys of the relations that are involved in the FROSI clause of the query.
The problem changes when the pattern part of the user query involves regular pat h
expressions. -4 regular pat h expression does not require a specific schema for the source
data. Since the schema of the da ta that can be matched by the pattern part is not
known, it is not possible to derive the implicit OIDs of the elements in the const ructor of
the query. In Chapter 3 , we see t ha t due to this fact. there are cases where no rewritten
query can he constructed. when regular path expressions appear in the pattern part of
the vieiv.
2.1.6 Value equality
In the case of leaf nodes. two nodes are equal if they correspond t o equal string values.
In the case of interna1 nodes. the nodes are equal i f they correspond to the same OID.
Consider the following query which returns publications t hat are ivritten by more than
one authors:
WHERE cpublicat ions>
<author> <name> $nl </> <publication> $pl </>
</> <author>
<name> $n2 </> <publication> $p2 </>
</> </> IN ' 'publications .mi3 , $nl ! =$n2, $pl=$pS
CONSTRUCT <author>
<name> $nl </> <publication> $pl </>
If the pablication elements in the source data contain atornic values. then the corn-
parison between Spl ancl $ p l is based on the string values to which they are bound.
Otherwise. the OIDs to which the two variables are bound should be identical. i.e.. $ p l
and $p2 shoiild correspond to the same publication object. Content equality is not ade-
quate.
CVe can observe that the publications that are bound to the variable $pi are repro-
duced twice in the result. Since each publication element that is created has a unique
OID (no Skolem functions are declared). each binding results in two different nodes. ref-
erenced by two different publication edges. So now. these nodes refer to two different
publications.
2.2 Views
We do not provide any special view defiriition syntax. Views are defined as normal XSIL-
Q L funct ions rvit hout arguments and schema restrictions for the out put. References to
views are expressed as function calls. However. we rest rict our problem to function calls
that appear after 'IN'. and not in the interior of elements. In addition. we do not study
view definitions with constructors t hat contain nested sub-queries.
.A well-formed XSIL document should have a single root element tag. The XML-QL's
implernentation (Fer991 adds a tag named .WfL to the result of a query. We can assume
that the result of a view query has a root element narned cieui. -4 query over a view
should take into considerat ion t his external element.
CHAPTER 2. TERMINOLOGY
2.3 Query Rewriting
Consider a view defined as an S M L - Q L query I V . C w takes as input an S M L document
.Y Dl and returns an S M L document .YD2 = L'( .'r'D1 ). hIoreowr. consider a user query
Q. which takes as input S D 2 and returns an SbiL dociirnent .YD3. We are required
to construct an S!dL-QL querp Q'. which takes as input the ShIL document .l'Dl and
returns S D 3 . i.e.. Q1(SDl) = Q ( S D 2 ) = Q(C'( . (Di ) ) = (Q o L')(.l'Dl).
CVe give a brief description of the intuition behind the rewriting procas that ive
follow. The structure of an XML-QL ronstructor is very similar to the structure of
S.CIL documents. When no explicitly defined OLDs and nested sub-queries appear in the
constructor of 1,'. its schema is similar to the schema of the individual tuples that are
produced by the query. In other words. the element tags and attributes that appear in
the constructor of C' also appear in each tuple of the resulting SSLL document .Y D2. The
idea is to match the user-query pattern mith the constructor of C' instead of matching it
with X D z . Since a constructor as opposed to an S M L document aiso contains variables.
the result of the matching procedure is a set of mappings between variables. constants
and element tags. The matching and composition algorithms are presented in Chapter
3.
Explicitly defined OIDs and sub-query blocks in the constructor of a query may result
in aggregated XML data. so the structure of the produced tuples may be different frorn
the structure of the const ructor. -4s mentioned in the previous section. sub-query blocks
in the constructor of the view are not studied in this thesis. Explicitly defined OIDs in
the view definition are studied in Chapter 4. Finally. we also assume that no regular
path expressions appear in the pattern part of the view and the t w r query.
2.3.1 Constructor and Pattern trees
We consider that ail the variables. as weil as the Skolem function declarations t hat appear
in a query are stored in a syrnbol table. ive denote as Tq the symbol table that contains
al1 the variables and Skolem function declarations t hat appear in the body of a query Q.
For each variable. we store its name. while for each Skolem function. we store its name.
and its list of parameters. Parameters rnay be either constants or pointers to symbols in
We represent const ructors by trees instead of graphs. which facilitates the rnatching
procedure. The interna1 nodes of a constructor tree represent elernents. while the leaf
nodes represent non-tag variables and constants i variables and constants t hat are not
included in a n element tag) that appear in a constructor. Handling the tag elements as
terminal symbols and ignoring the end tags. a constructor tree is similar to the syntax
t ree of the corresponding const ructor.
A const ructor t ree can be expressed using the following grarnmar:
1 : Node : : = Leaf Node 1 IntemalNode 2: LeafNode : : = LeafVariable I Leaf Constant 3 : InternalNode : : = (TagLabel , Children, Attributesset) 4: TagLabel : : = TagVariable 1 ElernentName 5: Children ::= Node I Children, Node
6 : Attributesset : : = {OID, Attributes) 7 : OID : := (ID, <Skolem>) 8 : Attributes : := Attribute 1 Attributes, Attribute 9 : Attribute : := (AttrETame, AttrValue)
10 : AttrValue : := AttrVariable 1 AttrConstant
11 : LeafVariable : : = <VAR> 12: TagVariable ::= <VAFD 13: AttrVariable : : = <VAR> 14 : ElementName : : = <IDENTIFIER> 15: AttrName ::= <IDENTIFIER, 16: LeafConstant ::= <STRING> 17: AttrConstant ::= <STRING>
The terminal symbols < V.4R > and < Skolem > represent references to variables
and Skolem funct ion definit ions. respectively, in the symbol table. A node is descendent
of another node that represents an erement eParent if it represents an etement. a variable
or a constant which is nested under eParent. The root node represents the root element
that is produced by the query and is named as view if & is a view definition. In the
sequel. interna1 nodes are named after their label (TagLabel). and leaf nodes are named
after the variable or the string value that they represent.
A graphical representation of a constructor tree is displayed in Figure 2.1.
view
l($i 1 ,.... Si4)),(age,'3OV) )
name ((ID, f2($i 1,...,Si4))} tion {(ID. f3(Si l....,Si4))}
tie{(ID. f4(% l.....$i4))}
Figure 2.1: .\ query and the corresponding constructor tree
W e consider two sets of primitive functions that can be used to access and handle
t ree nodes:
Symbol(b):boolean Return t n i e if 6 can reduce to the non-terminal symbol Symbol of
the grammar presented above. For instance, if node is a leaf node that represents a
variable, the function caIIs Lea f.Vode(node) and Leaf tar iaMe(node) return Irue,
while the function call LeafConstant(node) returns false.
getSyrnbol(b):Symbl Return the Symbol part of 6. For instance. if node is an interna1
node with the form (author. {chi ld i . child2). { ( I D . j ( x l ) ) } ) . then the function call
getTagLabrl(node) returns author. and the function call g e l 0 I D(get.-ltt ributesSet(node))
returns ( I D . f(q)).
Given a node of a constriictor trce. we can reproduce the conslructor segment that
corresponds t o the subtree rooted by this nocie. A constructor segment is defined by the
non-terminal symbol Contents in the S M L - Q L grammar presented in Appendix .\.
.A pattern tree PTqsi is defined similarly and represents the i t h pattern expression of
the pattern part of Q. Skolem functions do not appear in pattern trees. OID attributes
may also be omi tted. So. the lines 7-8 of the grarnmar presented above can be simply
replaced by the following line:
7: Attributesset : := '(' Attributes
The OID at t r ibute is handled as any other attribute. We should mention that Ieaf nodes
in a pattern tree cannot have sibling nodes. since patterns rvith the form ce> Srart $carl
</> or ce> $var const </> are not vatid.
We assume that the root node of a pattern tree has always a single child. Pattern
expressions of the form:
are always t ranslated into:
After this transformation. applying a pattern expression over the view is equivalent
ta applying the pattern expression over each distinct view t uple.
2.3.2 Matching pattern trees to constructor trees
The procedure of matching a pattern part of a query to the constructor of another qciery
is analogous to matching the pattern part of a query to an ?(ML document. A certain
pattern can match an XhIL document in several different ways. resulting in different
variable binrlings. Similarly. a pattern can match a const ructor in several different ways.
Each of these matchings involves several independent mappings t hat involve variables.
constants. element labels. const ructor segments and patterns.
The matching procediire between a pattern expression and a constructor is performed
by means of the corresponding pattern and constructor tree. Nodes of the pattern tree
are matched to nodes of the constructor tree by comparing their properties. e.g.. the
label names and the attribute values. Since pattern and constructor trees also involve
variables, the comparison may also involve variables. Since variables can be bound to
any value. the cornparison between two variables or between a variable and a constant
always evaluates to true.
Consequently, an internal node of a pattern tree can match an internal node of the
pattern tree if t h e comparison between their labels. attribute names and a t t ribute values
evaluates to true. .\II the attributes of the pattern node should be matched. but it is not
necessary to match al1 the attributes of the constructor node. For instance. the pattern
node that represents <$t continent=' 'Europe >> can match the constructor node that
represents <country ID* (SI) cont inent=$c>.
The cornparison betiveen constant leaf nodes is based on their string values. Homever.
a leaf node rnay consist of a variable. Such a variable can be bound not only to atomic
values. but also to OIDs representing whole ?(>IL subgraphs. In ordor to handle this
case, we consider that a variable in a leaf node can be matched to an internal node or a
collection of si bling nodes.
Yot any pair of nodes is compared. .4 pattern node can match a constrtictor node
only if their parent nodes can be matched. In other words. in order to match two nodes.
we should match al1 their ancestors.
Matches between components of the const ructor and the pattern t ree are represented
by triples that obey the following syntau:
1: Mat ch : : = (TagVariable , TagVariabl e , nul1 ' ) 2: 1 (TagVariable , EleaentName , 'nul1 ' ) 3: 1 (El ement Name, TagVariable , ) nul 1 )
1 (AttrVariable, AttrVariable, 'nullJ ) I (AttrVariable , AttrConstant , 'null ) 1 (AttrConstant , AttrVariable, 'null' )
1 (Leaf Node, Leaf Node, <Skolem>) 1 (Leaf Variable, InternalNode, <Skolem>) 1 ( InternalNode, LeafVariable , 'null) 1 (Leaf Variable, SiblingNodes , <Skolem>) 1 (SiblingNodes , LeafVariable, 'nul1 )
12: SiblingNodes ::= Node SiblingNodes I Node Node
A match cari involve tag labels. attribute values and nodes that are matched during
the matching procedure. When two internai nodes are matched. the produced matches
represent their matched attributes and labels t hat involve variables.
The first argument of a match represents the matched component of the pattern
tree. while the second argument represents the matched component of the constructor
tree. As shown above. leaf nodes that represent variables can match coIlections of sibling
nodes. If the leaf node belongs to the pattern tree, the meaning of such a match is
that the corresponding variable should be bound to the OID of an NML subgraph ivhose
structure is defioed by the constructor segment that the matched nodes represent. This
OID is represented by the third argument of the triple. It is used when the variable
participates in equality filters and the equality should be based on the OID. If t h e leaf
node belongs to t h e constructor tree. the meaning is t hat the pattern represented by the
matched nodes should be applied on the corresponding variable.
In general. a match represents a set of bindings of a certain subset of the user-query
variables.
Example 2.3 Consider the foliou.ing pair of oielc and user querg:
V: function View()( QHERE
<a ID = $il> <$tl ID = $i2> $XI </> <$t2 ID = $i3> $x2 </>
</> IN " source. xml > ' , CONSTRUCT
cb ID = fl($il,$i2,$i3) a='Hi>> c$t ID = fS($il,$iZ,$i3)> $xi </> Cb2 ID = f3($il,$i2,$i3)> $x2 </>
</> > Q : WHERE
<pieu> <b a=$v>
<bl> $y1 </> <b2><b3> $y2 </></>
</> </> IN View()
CONSTRUCT <result>
<di> $y1 </> Cd2> $y2 </>
</>
The f o h w i n g set of matches can be identified: (Sv. 'Hi'. null). ( '61 *, St. n u i f ) . ('63'. St,
The purpose of the matchiog procedure between a constructor and a pattern tree is to
match the whole pattern tree? but not necessarily the whole constructor tree. Moreover.
the same node OF a constructor t ree can be rnatched by several nodes of the pattern tree,
but each node of t h e pattern tree should be matched only once.
Several different matchings can exist between a constructor and a pattern tree. i.e..
the pattern tree can match the constructor tree in several different ways. Each matching
is represented by the corresponding set of matches. Each set of matches should obey
the following restrictions: ( i ) there should be a single match for each tag or attribute
variable of t h e pattern tree; and ( i i ) each leaf node of the pattern tree or exactly one of
its ancestors should appear in a single match.
Considering the matches presented in Example 2.3. two possible rnatchings can be
identified.
1. T h e first contains the matches (Sv. 'Hi'. null). ( 'big. Ot. null). (Sul. 5x1. fl(Sil.Sil.8i3))
and (b3, Sx2. C3($iL.$i9.%i3)).
2. T h e second contains the matches (Sv. 'Hi'. null). ('bl'. St. null). ('b2'. $ t . null).
(Sy 1. 3x1. E(8i l ..Si'Z..SiI)) and (b3. Sxl . f?($il . l i Z . $ i 3 ) ) .
The matching procedure is prescnted in detail in Chapter 3.
Chapter 3
Query composition - Base case
In this chapter. ive present our solution to the problem of composing queries that are
defined over ShIL views. Erplicit OID definitions are not studied in this chapter. They
are introduced in Chapter 4.
3.1 Formalizing the problem
Consider a view V ( d b ) which is defined by a query I,'(db):
WHERE fi. (db. x 1. . .., t,) F\?( t t , .... x,) (3.1)
CONSTRUCT Ctr(x l . .... x,)
ivhere PL^ is the pattern part. FI is t h e filter part. CI. is the constructor. and .Y =
{ x l . .... 2,) is the set of variables that appear in te'. The query is applied o n the S J I L
document db.
Consider the user query Q( V ) t hat is defioed over I":
WHERE PQ (v. YI. .... ym) FQ (YI! - - s a Y,) (3.2)
CONSTRUCT Cq(yi. .... y,)
where Pq. FQ and Cq a r e defined over the variables Y = (yl. ... . y,). and .Y n Y = OL.
Our problern is the construction of a query Qf(db) which is defined over the source
da t a and returns the same result as Q. Q' has the follorving format:
WHERE PQt(db, XI, .... zn) FQ~ (X 1. -... xn. yl, .... ym) (3.3)
CONSTRUCT CQ)(rl . .... x.. 91. .... y,,,)
The rewriting process involves trvo sequential steps. During the first step. ive discover
t h e matchings between PQ and Ci.. These matchings are used in the second step to
produce the rewrit ten query Q'. The first step is presented in Section 3.2. and the second
one is presented in Section 3 . 3 .
3.1.1 Assumptions
CVe assume tha t PQ contains only a single pattern expression. which is the one that refers
t o V . W e introduce multiple pattern expressions in PQ in Section 3 .5 . ive also assume
t hat t here are no equality conditions between variables in {yL. .... y,}. This means that
there are no filters yi = yk in FQ. and furthermore. each appears only once in P4.
Equality conditions between variables are studied in Section 3.6. Finally. we assume that
both Pb* and PQ do not contain regular pat h expressions. In Section 3.4. ive justify this
assurnption.
3.2 Matching the user query against the view
3.2.1 The matching algorit hm
Let PT4 be t h e pattern tree that represents the pattern part PQ of the user que- Q.
a n d CTV be t h e constructor tree t hat represents the constructor Cv of the view P..
'If -Y n Y # 0, we can always find a renarning ,Y' of the set of variables in .Y, such that .Y' 17 Y = 0.
CHAPTER 3. QUERY CO?vIPOSITION - ~ A S E CASE 26
Algorithm 3.1 describes horv the matchings betwveen PTq and CTV are discovered.
The algorit hm consists of the funct ion findiCfatchings(node,, node, ). which performs the
matching procedure recursively. It takes as arguments a node node, froni CTv and a
node node, from PTq and returns al1 the matchings betwveen the subtrees that are defined
by these two nodes. CVe use the functions that are defined in Section 2.3 to access and
handle nodes.
Lines 4-9 check whet her the nodes represent constants. If t hey bot h represent q u a 1
constants. a single matching esists that contains only the match between the two nodes.
Lines 12-14 check ivhet her the tag labels of the tivo nodes are identical or at least one of
them is a tag variable. If none of these situations holds. no matching exists. Lines 17-20
checks whether the attributes of the elements that correspond to the nodes match. The
function matchrlttr(node,. node,) returns false if and only if ( i ) both node, and node,
represent interna1 nodes? and ( i i ) there is at least one attrihute in node, which is either
not included in the attribute list of node, or the corresponding attribute values do not
match. Trvo att ri bute values match if eit her their values are equal or at least one of t hem
is a variable. The funct ion ~nd..lttn'bute.Clatches(node,. node,) ret urns the matches t hat
contain the rnatched pairs of attribute values. when at least one value in each pair is a
variable. Lines 27-30 check whether node, has a single child which represents a variable.
If this is the case? one single rnatching exists containing the match between this child
and the whole collection of subnodes of node,.
Lines 34-43 handle the case when more than one nodes are under node,. Yone of
these nodes can be a variable. In this case, ive consider each child child, of node,
independently and find the rnatchings between child, and the children of node,. Alter
al1 the combinations of childreo pairs have b e n identified, a number of different sets
of matchings are discovered and stored in the list matchings(l. Each set of rnatchings
corresponds to a different pair of children oodes of node, and node,. The function
merge(rnatchingslfl results in a single set of matchings S by combining the matchings in
CHAPTER 3. QUERY COhIPOSITION - BASE CASE 27
matchingsu as follows. Each set of matchings corresponds to a single child or a subset
of children of node,. CVe consider al1 the combinations combination; of matchings sets
Si,, such that each combination; corresponds to disjoint children subsets. and every child
of n o d e , appears in such a subset. The matchings sets Sij that correspond to a certain
combination; are combined resulting in a single set Si. More specifically. each matching
in Si is produced by considering one matching From each Si] and merging them. Al1
the possible combinations between matchings in each Sij are considered. Following this
process rve ensure that every leaf node in the subtree under node, or one of its ancestors
appears in a single match of each matching in Si. Finally. al1 the sets Si are merged into
a single set S t hat contains al1 the matchings.
If at least one of the tag labels of node , or node, is a variable. the match between
the two labels is added to each individual matching by calling t h e function addMatch()
(Lines 47-50). F ina l l . Lines 53-54 add the attribute matches to each matching by calling
the function add;Clatches().
Algorithm 3.1
Input: root node ofCTv. root node of PTq
Output: the set of al1 the matchings betmeen C'Ti. and PTQ
1: function Matchings j?ndMatchings(node,, node,){ 9 - - - 3: !% if 60th nodes represent constants 4: if ( Lea fConstant(node$ or Leu fConstant(node,)) 5: if (LeafConstant(node J and LeafConstant(node,) and node, = node,) 6: return new Matchings( 7: new Matching( 8: new Match(node,. node,, null))); 9: else return null; 1 O: I I : % If the tag labels do not match ! 2: if (!tag V a n a ble (ge t TagLabel(node,)) and !tag Variable (ge t Tag La beL(node,)) 13: and getTagLabd(node$!=getTagLabel(nodep)) 14: return nuU; 15: 16: % Find matches between the attribute calues 17: if (!match.-lttr(getAttributesSet(nodep), getrlttributesSet(nodeP))) 18: return null:
19: eIse atf rihufe!t[afches + fidAttrz'buteMatches( 30: getAttributesSet(node,!, get.4ttributesSet(nodeP)): 21: 22: % Cet the children of the nodes 3 : children, t (ge tChi ld~n(node~) ) ) ; 24: children, t (getChildren(node,))); 25: 26: 7Z If nodc, hos a singlc child that rrprrsrnts a rariablc 27: if LeajPOn'able (ch ildren,) YS: return new .thtchings( 29: new .Watch ing( 30: new .I.Totch(children,, children,. getO~D(get.4ttributes(nodep)))): 31: 3>. % I/ node, has multiple children. find the matchings 33: % that correspond to al1 the possible pairs of subtrves and mcrge them 3 for each child,[i] in children, 35: i f !Leu f buriable (child,[i]) 36: for each child,[i] in children, 3 7: rnatchings[i. j] t find.\fatchings(childp~]. child,[i]): 38: eise 39: for each subchildren,~] in children, 4 O: rnatchings[i, j ] t new .Clatchings( 41: new Ma!ch ing ( 4 2: new .Cfatch(subchildren,li], child,[i], null): 49: mergedMatchings t rnerge(mntchings[): 44: 45: % l''the tag label of at least one node is variable, 46: % include the corresponding match 4 7: i f (tag Variable (get Tag La bel(node,)) or tag Varia ble(get TagLa bel(nodep))) 48: for each matching in merged.llatchings 49: addMatch(matching, 50: new .Watch(get TagLabe[Inode,). get TagLabd(node,j. null)): 51: 52: % lnclude the attribute matches 53: for each matching in mergedMatchings 5.j : add.\iatches(matching , attrïbuLe.tfakhes):
3.2.2 Deriving the variable bindings fkom the matchings
.As rnentioned in Section 2.3. the purpose of the matching procedure is to discover how
the set of variables of the user query are mapped to the source ?<ML document. ivithout
materializing the view. In this subsection, we show that al1 the bindings that are produced
CHAPTER 3. QUERY COMPOSITION - BASE CASE 29
when the patter part of the user qnery is apptied over the vim can be derived from the
matchings that the matching algorithm produces. CVe also show that no additional
bindings are derived from the matchings. In Section 3.3. we use this Fact to prove t hat
the rewriting algori t hm produces the correct composition query.
Let .Y' = {r;. .... xk} be a binding of the set -1' of the view variables. and let t ; be
the corresponding tuple that is produced by Le. The structure of l e ' s constructor CI-(.Y)
defines the structure of the produced tuples. Therefore t , = CIp(.Yt ). Instead of matching
the pattern part Pq(Y) of the query w i t h each tuple CV(X1). the matching algorithm
matches PQ(Y) ivith Cip(.Y) by folloiving a similar procedure.
The matching algorit hm does not bind variables to specific constants or S S I L graphs.
It finds matches involving components of t h e view constructor (constants. variables and
constructor segments), which include variables that are not yet bound ro specific values.
However. we should ensure t hat the matching algorithm produces matchings t hat derive
al1 the actual bindings of the user-query variables. if the filter part of the user query is
ignored. In addition. the matching aigorithm should not produce matchings that derive
bindings that cannot be produced by Pq(Y).
Consider a certain binding of the set of variables Y. when applying Pq(Y) over the
output of b'. This binding results after matching Pp(Y) to a certain tuple ti produced
by V . Let .Yi be the binding of V's variables that results in t; . Each variable y, E 1.
is bound to CKj ( -L i ) . where CKj( -Y t ) is a constant or an S M L subgraph constructed by
the component CVVj(.Y) of the constructor of CI. .A binding between y, and a subgraph
is represented by the OID of the node that parents the subgraph.
When PQ is applied over CI. by means of the rewriting algorit hm. t here is a rnatching,
where each =riable y, is matched to Cv,;(S). which evaluates to Cv.j(-Yi) when the set
of variables X is boiind to .Yi. If y, is not a tag or attribute variable. the algorithm
also records the OID definition of the parent element of CI,(.Y) as the t hird argument
of the match triple. This definition represents the OIDs to which gj is actually bound
CHAPTER 3. QUERY COMPOSITION - BASE C.4SE 30
and is used when is invotved in equality fiiters with other variables. Such fiiters are
introduced in Section 3.6.
The previous situation is not always possible. The binding .Y1 may involve non-atomic
values. i.e.. a variable xi E .Y may be bound to a subgraph G. Assume that when Po is
applied over ti, a variable y, is bound to a subgraph G; of G. Even if G, does not depend
on the binding .Y1 of the set of variables S. we can denote it as Ct.,,(.Y1). The matching
algorithm cannot directly represent this binding. since CiL(.Yt) is not available without
materiaiizing the view. Instead. it records a match between a sub-pat tern Pk(YP) and x k .
Applying Pk(1'.) on rk. y j is bound to CI, ( .Y ' ) . when xi, is bound to G. We conclude.
that given a binding of the set of the user query variables. there is a matching produced
by the matching algorit hm t hat represents it as descri bed above.
Consider now a matchiiig -11 returned by the matching algorit hm. Assurne t hat each
variable y j , where j = 1 ..q. is rnatched to a component CV,,(.Y ) of the constructor of I'.
Moreover, there are matches in which a variable x k E -Y is matched to a pattern Pk(lp)
or to a constant y. .CI represents a set of bindings of the variables 1. with the following
properties. Ciiven a binding S' of V's variables. each variable y,. where j = l . .q . is bound
to Cv,,(.Yi). while the binding of each y,. where j = q + l..m. is derived by applying
a certain pattern A(?-) on xi. where xi E -Y'. .-\ match between a variable xk and a
constant C>C is translated as follows. The bindings are derived only if xi = cc. where
r; E .Yt. It is clear that when Pg is apptied over the tuple C v ( S t ) that is generated by
V . al1 these bindings are produced.
3.3 The rewriting algorithm
3.3.1 Detailed algorithm
.A formal description of the rewriting procedure is presented by Algorithm 3.2. The
algorit hm first calls the find:Cfatchings() funct ion to discover the mat chings between the
CHAPTER 3. QUERY COMPOSITION - BASE CASE 32
pattern tree of t h e user query and the constructor tree of the view query. For each
rnatching. it produces a different sub-query block. More specifically. if { M 1 7 ..., :II,} is
the set of the discovered matchings. the final rewrit ten query has the lorm {Q;) ... {Q',).
where Q: is the sub-query that corresponds to .Cli. Since equality between variables in
the user query is not haridled by the algorithm. the third argument of the match triples
is not used.
Algorithm 3.2
Input: a uiew dejînition V . and a user quenJ Q dejîned ocer 1.'
Output: the composition que ry Q' = Q O L,'
1: function Query cornpose(Q, C'){ .? - v .
9: % Renarne the variables i n V so that they do not conflict 4: % with the variables names of Q 5: rename I.Briables(V,Q); 6: 7: $b Get the parts of Q and V thnt will form the corresponding parts 8: % 01 the rewritten q u e y
fi. t getPatternPart(V); Cp t getCons~ructorPart(Qj: Ft* t getFilterPart(V); Fq c getFilterPart (Q):
Initialite the remritten q u e y Q' Q' c new Query();
Find the matchings betureen V *s constrirctor tree and Q 's pattern trre matchfngs t find.tiafchings(ge&PatternTrre(Q). getConstructorTrre(t.'));
For each matching. create a separate sub-q,uery for each matching in matchings {
Initialize the pattern, Jilter and constructor parts of the su&-query PSQ PL'; FsQ e meqieFifters(FQ,Fv): CSQ CQ;
I / a r;ariable in V is matchcd to a constant for each (m,, rn,, id) in matching {
if (LeafConstant(m,) or ;IttrConstant(rn,) or Elernent.Varne(rn,)) addFilter(FsQ .m, = m,);
CHAPTER 3. QUERY COMPOSITION - BASE CASE
33: % l-j the matched component that corresponds to Q is a vanable 34: if (Leaf Variable(m,) or .At tr Variable(m,) or Tag Variable(mp)){ 35: ~ ~ b ~ t i t u t e ( C ~ ~ , r n ~ , r n ~ ) : 36: % If the uariable cannot be substituted in the filter part, the matching is rej~cted 3 7: if !substitute(FsQ ,m,,n,) goto 4 7; 38: 39:
1
40: 25 If a variable in Cf ia ~ n o t c h d to internai raod~s ofQ 5 pattcrn trcc 4 1: if (SiblirigiVdes(m,j or In temalNode (m,)) -f 2: addPattern(PsQ, construct Pnttern(mp) I:V m,); 5 3: 44: % Construct the sub-query and add it to Q' 45: S Q t new Q u e r y ( P s g . F s Q . C d : 56: adclSubquery (Qt ,SQ): 57: 1 48: return QI: 49: 1
W e give an explanation of the new functions that appear in the algori t hm. The func-
tion rename(queryl,query2) in Line 5 renames the variables in query,. so that th- do
not conflict with the variable names in queryz. The functions get.\'(query). where .Y is
a query part (Lines 9-19). return the corresponding part of query . while the functions
getConstructor Tree(query) and get Patte ni Tree ( q u e r d return the constructor and pat-
tern tree of query . The function mergeFil ters(/ i l teri . f i l t e r z ) (Line 25) produces a filter
that contains both filters f i l t e r l and f i l t e r2 . The function substi tute(queryPart , c a r ,
component) (Lines 35, 37) replaces al1 the appearances of variable car in queryPar t
by the constant, variable. element label o r constructor segment t hat corresponds to
component. If component represents an internal node. Le.. InternaliVode(cornponenlJ
is true, or a collection of sibling nodes, i.e.. Sibling:Vodes(compment) is true. car is
substituted by the constructor segments that correspond to the constructor subtrees par-
ented by t hese nodes. Since each subtree results in a different segment. var is repiaced
by the concatenation of al1 the individual segments. A constructor segment cannot par-
ticipate in a filter. Therefore. if the first argument of substitute() is a filter. and the third
argument is an internal node or a collection of nodes, the Function returns false. and the
corresponding matching is rejected (Line 37).
The function addPa€lern(p, exprj adds a new pattern expression erpr into the pat-
tern part p of a query ( Line 11). The translation of a pattern tree t ree into a pattern
pattern is performed by the function addPat tern( t ree . pattern). FinaIIy. the function
addSubguery(query, subquery) adds subquery to query as a sub-query bloçk.
Example 3.1 Consider the jollowing pair o j a ciew and a query:
V: function ViewO( WHERE
<a ID = $il> <b ID = $i2>
cb1 ID = $i3> $xl C/> <b2 ID = $i4> $x2 </>
</> </> IN 'csource.xmlJJ
CONSTRUCT cc ID = fl($il,$i2,$i3,$i4)>
<ci ID = f2($il,$i2,$i3,$i4)> $xi </>
CHAPTER 3. QUERY COMPOSITION - BASE CASE 34
The Nnplicil OrDs haue already Qeen included in the e h m e n l fags of t h e uiew's con-
structor. The matching algorithm discouers the diflerent rnatchings betmeen the pat-
tern tree of Q and the constructor tree O/ V . The first matching contains the matches
(100. $1 1. -). (CS. lx?. -) and ($y?. c4. -). and the second rnatching contains the matches
( lU0. $xi, -). (Syl . S x l . -) and ( 5 ~ 2 . c4. -1. The third argument o j each match is cgnored.
Each matching corresponds to a diflerent sub-query. The jirst sub-query incolocs the fof-
fouing transf'ormations.
(il The f l ter $11 = 100 is ndded in the filter part of the sub-query.
(i i) The cariable $y:! is substituted by the constructor segment
(iii) The patteni erpression < c5 > $91 < / > I:V Or2 is added to the pattern part of
the su6-query.
The second sub-query also incolves the jirst and the second transformations. In nddi-
tion, Syl is substituted by Sxl.
The q u e q that is produced is presented bclo w:
Q' : < WHERE
<a ID = $il> <b ID = $i2>
<bl ID = $i3> $xi </> Cb2 ID = $i4> $x2 </>
</> </> ZN c'source.xmly~, <CS> $y1 </> IN $x2, $y1 < 100, $xl = 100
CONSTRUCT <result>
<dl> $y1 </> <d2>
(c4 ID = fl($il,$i2,$i3,$i4)> <c6 ID = f2($il,$iS,$i3,$i4)> Hello </>
</> </>
</> >
€WHERE <a ID = $il>
CHAPTER 3. QUERY COMPOSITION - BASE CASE
<b ID = $i2> <b1 ID = $i3> 100 </> cb2 ID = $i4> $x2 </>
</> </> IN "source.xml>', $xi < 100, $xl = 100
CONSTRUCT Cresult>
<dl> $xl </> <d2>
Cc4 ID = fl($il,$i2,$i3,$i4)> <c6 ID = f2($il,$i2,$i3,$i4)> Hello </>
</> </>
</>
.At this point, we can explain why implicit OIDs are included in the element tags
of the view's constructor as described in Subsection 2.1.5. The S M L - Q L data mode1 is
object based. and SSIL nodes represent distinct objects which are identified by unique
object identifiers. When the OID of a tag element in the constructor of a query is
not explicitly defined. then. for each binding of the query's variables. one distinct such
element is constructed containing a unique OID.
Consider the following view:
V : WHERE . . . CONSTRUCT
When V is applied over a n SbIL document. for each binding of its variables. a different
e element is constructed. Yow. consider the following query:
q: WHERE ... ... $y ... IN V
CONSTRUCT ... $y S . .
... $y ...
.\ssurne that when t h e pattern tree of Q is matched to the constructor tree of C'. Sy
is matched to the node t hat represents e. If OIDs are ignored, the rewrit ten query is as
CHAPTER 3. QUERY COMPOSITION - BASE CASE
Q J : WHERE ... CONSTRUCT
... <o...</> ...
... < e x . . < / > ...
The above rewritten query constructs two different e elements. i.e.. elements with different
OIDs. for each binding of the query variables. This is not the otitput that Q generates.
I.,' produces a single e element for each binding. CVhen Q is applied over the resulting
document. and $y is bound to the subgraph that includes an element e. a single element
e is created. So. the rewritten query should assign the sarne OID to two e elements
that correspond to the sarne binding of $y. This is achieved by including the implicit
OID definitions in the element tags of Cm's constructor and preserving them during the
matching procedure.
Figure 3.1 exhibits how ignoring implicit OID definitions can lead t o incorrect rewrit-
ten queries for the view and user query presented in Example 3.1.
3.3.2 Justification of the algorithm
CVe show that the rervritten query Qr which is produced by .-'ilgorithm 3.2 is equivalent
to the composition of the user query Q(Y) and the vierv V ( S ) . In order to simplify our
presentation. we assume that no filter conditions appear in Q.
Let PLr(l(). FIr(.Y) and C'\.(.Y) be the pattern. filter and constructor part of \'.
and PQ(Y) and CP(Y) be the pattern and constructor part of Q. We first apply the
matching algorithm and produce a set of matchings. Coosider a matching .\I in this set.
For this rnatching, the rewriting algorithm produces a query block QI, which includes:
( i ) the pattern expression P v ( S ) of V : ( i i ) the filter part Fv(.Y) of V ; (iii) a set of pattern
expressions Pl (Y). .... P,(Ie): (iv) a set of filten Fr(-Y) = {(xl = c l ) . .... (x, = c, ) ) . where
X I : ..., x, E .Y: and (v) t he constructor C P ( C ~ , l ( X ) . .... Cv,,(-Y). .... y,). -4 pattern
expression Pj (Y) corresponds to a match between a variable x, E .Y and a pattern. i.e..
CHAPTER 3. QUERY COMPOSITION - BASE CASE
output when ignoring implicit OïD definitions (b)
Hello
correct output (c)
Figure 3.1: Importance of implicit OID definitions
an interna1 node or a collection of nodes of the corresponding pattern t ree. .4 condition
xj = cj corresponds to a match between a variable x j E -Y and a constant c j . Finally.
a substitution CI,(X) in CQ corresponds to a match between a variable yj E Y and a
component Cv,j(S) of the constructor Cv(.Y).
Assume that V is applied on an XhIL document db. Let -Yi be a binding of the
set of variables X. and let t i = eV(-Y' ) be the corresponding tuple that is produced by
V. Assume that Q is applied on t i and a set of bindings is derived. Consider such a
binding Y', where each variable yj E Y is bound to CI:j(Si). The resulting tuple is
1% = c,(C;*,(Xi), .... c;;,(xi)). As discussed in Subsection 3.2.2. every stich binding can be derived from a matching
t hat the matching algorit hm produces. Assume t hat Y" is derived from .LI. This means
that Cb a Ch. where j E {L .... 'q}. It also means that there is a pattern expression
P u ( I v ) which binds yj to C';,,,(.Y1). where ZL E {l . .... r} and j E {q + 1 ..... m}. It also
means that F ' ( S i ) evaluates to true. Consequently. ivhen QIIl is applied on db. there is
a binding S'U {C[:,+,(.Yi). .... C'[.,,(SI)} of its set of variables which results in the tuple
C Q ( C V , ~ ( S ' ) ' ..., Cr:,(.Yi). C';:,+,(.Y1). .... C;:,(.Y1)) = t ; . So. Qif C Q. and therefore.
Q' G Q-
Assume now that Q' is applied on db and generates a tuple t : = CQ (CL:i(.Y1). .... C \ T , ~ ( - Y ~ ) ) .
which corresponds to a binding .Y' u ....y,} of its set of variables. Let QI,! be the
sub-query of Q' t hat generates t : . Moreover. let .CI be the matching t hat corresponds to
QI,. .As showed in Subsection 3.2.2. the matching algorithm does not produce match-
ings that derive invalid bindings of the user-que- variables. i.e.. bindings that cannot
be derived. when applying the pattern part of the user query on the view. Thus. there
is a binding of the set of variables Y. such that yJ is bound to CVKj(.Y'). The tuple that
Q generates for this binding is c ~ ( c ~ ~ ( s ' ) . .... Cv, , (S i ) ) = t : . Therefore. Q C Q'. and
Q' = Q.
3.4 Regular path expressions
The introduction of regular path expressions in the user query does not require any
changes in the rewriting algorithm. Hoivever, the matching procedure is more cornplex
than the one presented in Algorithm 3.1. W e do not provide a solution here.
Regular path expressions can also appear in a view definition. In Subsection 2.13.
we concluded that when a query involves regular path expressions we cannot derive the
implicit OIDs in the query7s constructor. In addition, in the previous section. we showed
that the rewriting can produce incorrect rewritten queries. when the OID definitions are
not included in the elernent tags of the view's constructor. For this reason. ive assume
that view queries either explicitly define al1 the OIDs' in their constructor or do not
contain regular pat h expressions in t heir pattern part.
3.5 Multiple pattern expressions in the user query
Our rewriting algorithm considers that a single pattern expression appears in the user
query. In this section rve present how multiple pattern expressions in the user query can
be handled.
Consider that a user query Q(Iv) = Q ( Y i . 1,;) contains two pattern expressions
P E i ( ) ; i ) and PE2() . . ; ) . where Y* = YL u Y ; is the set of variables that appcar in Q.
Since no equali ty conditions between variables exist, ive conclude that ( i ) 1.'; ri 1.; = 0.
and ( i i ) the filter part FQ(Is') of Q can be split into two different sets of filter conditions
F I ( ) ; ) and F2(Y;l).
Assume that PEI is defined over a view I.;(dbl. .Yl). where dbl denotes an SSIL docu-
ment, and .Yl is a set of variables such that .Yl n Y = 0. Also assume that PE2 is defined
over an XhI L document db2. Our goal is to construct a query Q1(dbi. ch2. .Yi. Y;, 13) =
Q ( K db2. Y ; , k;).
Let X i be a binding of the variables in SI when the 1.; is applied on dbi . Similarly.
let &' be a binding of the variables in i$ mhen Q is applied on dbz For each pair -Y;.
a set of tuples CQ(.Yf, I.;') is produced by Q1 where CQ is the constructor of Q.
CVe transform Q into a query Qv, by rernovinq P E2(&) and Fz(l . ; ) . Any variable
y f & tvhich appean in the constructor of the new query is handled as a constant. For
each binding of the set of variables -Yr. Qv, produces a set of tuples Cq(Xi . k;).
'Explicitly defined OiDs are introduced in Chapter 4.
CHAPTER 3. QUERY COMPOSITION - BASE CASE 40
Let Q;, ( d l , -Yl, K. &) be the query that is produced when applying the rewriting
algorithm on Qvl. Consider the following query:
Q': WHERE
PE2(db2, Y>). F2 (12) CO3STRC'CT
QI; (dbk. .Y1, Yi. Y?)
The above query produces a set of t uples CQ(Sl. k2) for each pair S;. 1 2 of bindings of
the variable sets .Yk and Y i . respectively. So. we conclude that Q' = Q.
The situation is similar if ive assume that PE2 is defined over a view C.i(db2. .Y2).
where -Y2 n .YL n 1.. = 0. Following the procedure described above. ive produce the
following query:
Then. we apply the rewriting algorithm to Q'. and produce the query
The nesting of the generated query QI' can be eliminated. .-\ssurne that Q;.; is com-
posed of the sub-queries QbI ,,. .... Qtr1,, and Q" is composed of the sub-queries Q','. .... Qr.
The COXSTRUCT clause of each Qt consists of the union of al1 the QI; ,, sub-qüeries.
So each Q i can be split into q sub-queries. The jth such sub-query contains the pat-
terns and filters of the WHERE clause of both Qi and QIi,,. as well as the constructor
of Qbl,j. Consequently, the resulting query consists of the union of q x r sub-queries.
Each sub-query corresponds to a certain pair of matchings ( M l . Mz). where .\Il is match-
ing between PEI and V17s constructor. while .Cf2 is a matching between PE2 and V;'s
constructor.
CHAPTER 3. QUERY COMPOSITION - BASE CASE 41
The above soIution can be generalked for user queries that contain more thaii two
pattern expressions. Pattern expressions can also be defined over variables. CVe do not
provide a solution for t his case.
3.6 Handling variable equality in the user qi iery
In the previous sections. we assumed that no equality conditions between variables exist
in the user query. Equality can be expressed either by including equality conditions in
the filter part. e.g., iSx = $y. or by placing the same variable in different positions in the
pattern part of a query. If the same variable appears more than once in the pattern part
of the user query. we can use additional variables and derlare t heir equality by filters. For
instance. consider the following pattern expression. which matches people whose name is
identicai to their surname:
WHERE <vieu>
<person> <name> $y </> <surname> $y </>
</> </> IN V
This pattern expression can be rewritten as follows:
Let vl and r-.? be two variables in a user query Q which is defined over a set of views
(ki . ... Pi}. Assume that a filter V I = ~2 appears in the filter part of Q. The variables cl
may appear either in the same or in different pattern expressions of Q.
CHAPTER 3. QUERY CObtPOSlTION - BASE CASE 42
Let Qr be the rervritten query that is produced, when the fiIter ul = y is omitted. As
shorvn in the previous paragraph. Q' consists of the union of several sub-queries. and each
such sub-query corresponds to a set .II of matchings between the pattern expressions of
Q and the corresponding views. Let Q;, be the sub-query that corresponds to J I . .\I
represents a set of bindings of the variables that appear in Q. which produce a certain
set of tuples. when the filter ~ 7 1 = cz is omitted. When the above filter is included. Q
produces a subset of these tuples. In order to produce the same subset of tuples with
the reivritten que- we should also include the appropriate filter in the filter part of Q:Lr.
The filter o l = c2 can be directly incliided in the filter part of Q',,. only i f both c l and
~2 have not been substituted by the rewriting algorithm.
Assume that only cl has been substituted by a cornponent C';(.Y, ) of the constructor
of a view C-i. If Ci(-Y,) is a constant c' then we just add the filter PZ = c to Q'lI. If'
Ci(.Yi) is a constructor segment. rvhich is not a variable. the filter r.1 = cl always returns
false for the corresponding bindings of and u2. T h e reason is t hat c l can only be
bound to the OID of a source element. and this OID cannot be equal to the OID of an
element produced by L i . In this case. the binding .\I is rejected. i.e.. the sub-query Q!tr
is removed. Finally. if Ci(.Yi) is a variable x E .Yi, special care is required. The fact t hat
x is bound to an OID does not indicate that cl is bound to the same OID. To be more
precise, vl is bound to the OID of an element that I.; produces. In this case the filter
v l = L.* returns false. even if the equality u2 = x is true. Horvever. x can also be bound
to a constant. In order t o handle t his case, we introduce the externat function a f o m i c ( c )
which returns t rue if and only if u is bound to an atomic value. T h e filter that me add
Assume now that both v l and a2 have been substituted by C',(Xi) and C',(.Yj). re-
spectively, where Ci(-Yi) is a component of F o s constructor. and C,(S,) is a component
of 4's constructor. The follotving six cases can be identified.
Case 1: Ci(Si) a C j ( X , ) c. where c is a constant. No fiIter is added to Qyb,. since
cl = u2 alivays evaluates t o true for the set of bindings that correspond to M.
Case 2: C ' , ( S i ) is a constant c. and C j ( S j ) is a variable x E .YJ. The filter x = c is
added to Q:l,. The case where C,(S,) is a variable and C',(.Y;) is a constant is
symmet ric.
Case 3: C l ( S i ) m f ( X I . .... x.) . C',(.Yj) i f(r:. .... 2:). and i = j. i.e.. both the compo-
nents are Skolem function declarations of the same Skolem function. and appear in
the sarne view. The filters xi = x',, .... s, = xi are added to QI,.
Case 4: Bot h C' i (S t ) and C,(S,) are const ructor segments of the same vieiv. ive.. i = j .
and t hey a re not variables. Moreover. the OID defini t ions t hat appear as arguments
in the corresponding matches of .Cf are idi /(q. .... r,) and id, a f (xi. .... x:). respect ively. -4s discussed in Subsect ion 3.2.2. t hese O ID defini t ions represent the
actual bindings of cl and ~ 2 . Thus. the filters xi = xi. .... x, = r: are aclded to
Qlw
Case 5: C',(-Yi) x and C',(.YJ) x'. rvhere x and r' are variables. We do not know
whether x and x' are bound to atomic values or t o OIDs. Consequently. we do
not know whether the equality betiveen v l and t.2 involves constants or OIDs. For
this reason, ive use again the txternal function atomic() . If id, i f (x i . .... x,).
id, r !(ri. .... x:) and i = j . rvhere idi and idj are defined as in Case 1. ive add the
filter
(atornic(cl) A S D x = 2') OR (YOT(atomic(cl)) . W D xl = y[. .... 2, = y,)
Otherwise, since cl and s cannot be bound to the same OID, we add the filter
atomic(cl) JLXD x = x'
CHAPTER 3. QUERY COMPOSITION - BASE CASE 4.2
Case 6: None of the previous cases is valid, e.g., C(>ii) is a constant and Cl (.Yl ) is a
Skolem function. In t his case, the filter vl = vz rvaluates to false for every binding
that corresponds to M. The sub-query Q:![ is removed.
Example 3.2 W e consider the following pair of ciew and user query :
V: function View()( WHERE <a>
Cal> $xl </> <a2> $x2 </>
</> IN tcsource.xml'' CONSTRUCT
<b> <bl> $xl </> <b2> $x2 </>
</> >
The nodes $y1 and $92 O/ the pattern tree O/ Q match the nodes 82 1 and Sx2. respec-
t iaely, of the conctruckor tree 011,' and are reguired to be e q d . The maiching conrsponds
to Case 5. T h e variables can only be bound to string values. since the OlDs of b l and
b l are always different. The rerritten query is presented belom:
Q': WHERE
<a> <ai> $xl </> <a2> $x2 </>
</> IN "source.xm12', atomic ($xl ) AND $x1=$x2,
CONSTRUCT Cresult> $xl </>
CHAPTER 3. QUERY COMPOSITION - BASE CASE 4 5
It is dear that fiiters that invoive inequality between variables of t h e user query are
handled sirnilarly.
Chapter 4
Introducing explicit OID definitions
The rewriting algorithm that we presented in the previous chapter assumes that no
explicit ly defined OIDs exist in the constructor of the view. In this chapter. we show
how explicit OID definitions can be handled without changing the core of Algorithni 3.2.
CVe assume that the same Skolem function does not define the OID of more than one
element. In ot her words. element rnerging is not handled.
Explicit OID definitions by means of Skolem functions are used to control how the
result of a query is grouped. For instance. consider the view query:
function Classes0 { WHERE
<couple> <boy> $b </> <girl> $g </> Cage> $a </>
</> IN clcouples.xml" CONSTRUCT
<class ID = f ($a)> <boy> $b </> <girl> $g c/>
</>
This query matches couples of boys and girls and groups them by their age. Each
couple contains a boy and a giri of the same age. Each class element of t h e resulting
document contains ail the boys and girls of a certain age. This means t hat the structure
CHAPTER 4. INTRODUCING EXPLICIT OID DEFINITIONS 41
of the XML document that the query produces does not strictly folIorv the structure of
the query's constructor.
4.1 Splitting pattern trees
Consider the following user query. rvhich is defined over the view Classes:
Q: WHERE <view>
Cclass> <boy> $x c/> <girl> $y cl>
cl> </> IN Classes0
CONSTRUCT <couple>
<80y> $x </> <girl> $y </>
</>
The above query creates couples betrveen boys and girls from the set of children that
appear in each class. .-\Il the possible combinations of couples between boys and girls of
the same age is produced. If we apply Algorit hm :3.2 to the above view and user query
the resulting rewritten query is:
Q : WHERE <couple>
< b o p $b </> <gi r l> $g </> cage> $a </>
</> IN ' couples .xm12 CONSTRUCT
<couple> <boy> $b </> <girl> $g </>
</>
It is clear t hat the above query does not produce the same result. The problem originates
from the fact that a certain couple element in the view can contain more t han one boy
CHAPTER 4. INTRODL~CING EXPLICIT OID DEFINITIONS 45
and girl eiement . Unfort unately. the matching algorit hm considers t hat the structure of
each couple strictly follows the structure of the constructor of the view. which involves a
single boy and a single girl element.
However. i f we split the pattern t ree of Q so t hat the boy and the girl node can match
the constructor t ree independently. the problem is resoived. CVe transform Q as follows:
Q: WHERE <view>
Cclass ID=$il> <boy> $x </>
</> </> IN Classes0 , <viesr>
<class ID=$i2> <girl> $y </>
</> </> IN Classes(), $il = $i2
CONSTRUCT <couple>
<boy> $x </> <girl> $y </>
</>
The transformed query is identical to the original one. since the filter $il = Si2 ensures
that the tcvo pattern expressions match boy and girl elements that belong to the same
class. These trvo pattern expressions are not affected by the grouping under t h e class
element of the view. since they can match the corresponding boy and girl elements no
matter if t hese elements appear under the same or a different class element. Therefore.
we can apply the rewriting algorit hm independently for the two pattern expressions as
described in Section 3.5 and receive a rewritten query which produces the correct out put:
CHAPTER 4. INTRODUCING EXPLICIT OID DEFINITIONS
Cage) $ay </> </> IN 'ccouples.xml'', $a = $a'
CONSTRUCT <couple>
<boy> $b </> <girl> $g2 </>
</> >
The filter $a = Sa' results from the filter $il = QiZ. since Si 1 is matched to I ( $ a ) and
$2 is matched to f (Sa').
In general. the pattern tree that corresponcls to a user query is split into a set of
ordered lists. whose number is equal t o the number of the leaf nodes. Each ordered list
consists of the nodes of a different path of the tree. The first node in each list is the root
node, while the last node in the list is a leaf nocle of t h e tree. Before splitting the tree.
al1 the nodes are assigned a variable as their OID attribute. So. the OID of nodes in
different lists t hat correspond to the same t ree node is represented by a cornmon variable.
4.2 Matching grouped contents
When explicit OID definitions appear in the constructor of the vieiv query. the substitute(}
function in Line 35 of Algorithm 3.2 shoiild b e adjusted to handle grouping.
Consider the following query: Q: UHERE
Cview> <class> $c </>
C/> IN Classes(), Cteacher> $t </> in ' teachers. xml '
CONSTRUCT Cclass>
<teacher> $t </> $C
c/>
This query assigns teachers to classes. A11 the possible combinat ions between teachers
and classes are produced. If we apply the matching algorithm to the above query. c is
CHAPTER 4. INTRODUCING EXPLICIT OiD DEFINITIONS 50
matched to the coIIection of the nodes that represent the boy and girl elernent of t h e
view's constructor. These elements are grouped under the class element by the Skolem
function that defines its OID. According to the rewriting algorithm. c is substituted by
the constructor segment that corresponds to the matched nodes. Since the parent of
these nodes does not participate in the match. the grouping cannot be expressed by a
Skolem funct ion.
In general. a match betiveen a variable in the pattern tree of a user query and a node or
a collection of nodes in the constructor tree of a viem query needs a different manipulation.
when the t hird argument of the corresponding triple involves a Skolem function of an
explicit OID definition. klore precisely. the variable cannot just be replaced by the
constructor segment that the matched nodes represent. since the groupi ng information
should also be included.
We eeplain how the problern is resolved. A grouping performed by means of a Skolem
function can also be derived by using a query block. Consider the following query:
WHERE P(xi. ..., x,) COXSTRCïCT
The grouping t hat the Skolem funct ion f performs can be expressed by a query block as
follows:
WHERE P(xl, .... tn) CONSTRUCT
CHAPTER 4. INTRODUCING EXPLiCIT OID DEFINITIONS
where .Y' = {xi. ... . x',) is produced after renaming the set of variables S = {x ,. ... . x,}.
The grouping is expressed by rneans of a nested query. rvhich scans the input document
and places the d a t a that correspond to the same values of xk under the sarne e element.
Assume that the above query defines a view and consider a user qiiery that is applied
over this view. If now. there is a match between a variable of t h e user query and the
collection of nodes that represent the constructor segment c(xt . .... x,). the variable is
subst ituted by the sub-query block:
The parameters {xk. . ... xi+,} of the Skolem funct ion are derived from the t hird argument
of the triple that represents the match.
If the rewri t t en query involves several such substi tut ions. t h e result ing sub-query
blocks are independent, i.e.. there is no nesting between them. Therefore. the same
riinarning -Y' can be applied to al1 the sub-queries. Since the pattern part P ( r ; . .... x',)
is common in al1 these sub-queries. it can be replaced by a single P(r', . .... xn) in the
WHERE clause of t he main query. Before this replacement. ive explicit ly define the OIDs
in al1 the element tags of t h e query's constructor in order to ensure t hat the mappings
that correspond to P ( x i . .... x',) only affect data that participate i n groupings.
In our example. after follorving the procedure presented above. the follorving rewritten
query is produced:
Q': WHERE <couple>
<boy> $b </>
CHAPTER 4. INTRODUCING EXPLICIT OrD DEFINITIONS
<girl> $g </> cage> $a C/>
</> IN "couple~.wil~~, <teacher> $t </> in ' ' teachers . xml ' >
CONSTRUCT <class>
cteacher> $t </> <
WHERE <couple ID = $ilt>
<boy ID = $i2'> $b8 </> <girl ID = $i3'> $g' </> cage ID = $i4'> $at </>
</> IN ' ' couples. m l , $aJ=$a
CONSTRUCT <boy ID = gl($ilt ,$i2' ,$i3' ,$i4')> $bt </> <girl ID = g2($ilJ ,$i2' ,$i3' ,$i4')> $g8 </>
> c/>
We can rnove the pattern part of t h e sub-query to the outermost level. after including
al1 the implicit OID definitions:
Q' : WHERE <couple ID = $il>
<boy ID = $i2> $b </> <girl ID = $i3> $g </> cage ID = $i4> $a </>
</> IN "couples.xml> ' , (teacher ID = $i5> $t </> in " teachers. xmlJ > , Ccouple ID = $il8>
<boy ID = $i2'> $b8 C / >
<girl ID = $i3'> $gt c/> cage ID = $i4'> $at </>
</> IN ''couples.xml~, CONSTRUCT
cclass ID = hl($il,$i2,$i3,$i4,$i5)> Cteacher ID = h2($ii, $i2,$i3 ,$i4,$i5)> $t c/> €
WHERE $at=$a CONSTRUCT
<boy ID = gl($il',$i2',$i3',$i4')> $b8 </> <girl ID = g?($il'.$i2',$i3' ,$i4')> $g> </>
1 </>
Chapter 5
Conclusions and future work
The problem addressed by this thesis is query composition by means of query rewriting
in the context of XML-QL. More precisely we studied horv a user query defined over one
or more views can be rewritten so that the rewritten query only refers to source data.
Cser queries and views are expressed as S M L - Q L queries.
CVe presented a simple rewriting aigorithm which is based on a matching procedure
between the pattern part of the user query and the constructor part of t h e vietv query.
The matching procedure results in sets of matchings which involve components of the
above parts of the view and the user query. Matchings represent the bindings of the
user-qtiery variables ivhen the user query is applied over the vierv. These bindings can be
derived direct ly from the bindings of the view-query variables. ivi t hout materializing the
vieiv. In order to facilitate the matching procedure. pattern expressions and constructors
are represented by tree structures. The S!vIL-QL data mode1 is object based. i.e., S M L
elemeots are assigned a unique OID. In order to produce a correct rewriting. the elements
in the constructor of the view query should be assigned a n OID by means of Skolem
functions. We showed how OIDs are derived when they are not explicitly defined.
The rewriting algorit hm does not handle multiple pattern expressions and equality
condit ions bet ween variables in the user query, explicit Skolem function defioit ions and
nested queries in the constructor of the vierv query, as weII as regurar path expressions.
CVe showed that multiple pattern expressions can be handled by applying the algori thm
in multiple steps. Each step perlorms the rewriting for a different pattern expression. W e
also demonstrated how the algorithm is enhanced to handle equality conditions between
user-query variables. Finally. we presented how grouping introduced by explicit Skolem
function definitions can be addressed wit hout changing the algorit hm. More specifically.
we proposed a technique of splitting the pattern expressions of the user query. so t hat
the produced pattern expressions are not affected by the grouping. CVe aIso identified
cases in which a grouping performed by a Skolem function should be expressed by means
of a nested query in the rewritten query.
This t hesis did not present a complete solut ion for explicit Skolem funct ion definit ions.
CVe assumed that the same Skolem function cannot appear in more than one element tag.
so element merging is not allowed. CVe are working on how constructor trees should be
extended in order to represent element merging. Nested queries in the constructor of the
view query could also be handled by enhancing the properties of construc tor trees.
In order to simpliiy the matching procedure. we did not study regular path expressions
in the user query. The problem is similar to matching an '(!VIL-QL pattern wit h regular
path erpressions to an XML document. Techniques used by existing SSIL-QL query
engines [Fer991 can be adapted to handle this problem. Regular path expressions in
the view query cannot be handled. when the OIDs in the constructor of the query are
not explicitly defined. in this case. a rewritten query cannot be alrvays produced. This
also means that XML-QL is oot closed under composition. In general. when a query
language involves regular path expressions and the underlying mode1 is object based, it
is not closed under composition [B LP+SS].
We considered that the schema of the source data is unknown. Information about
the schema of the source data. e.g., by means of DTDs or SML Schema? can change the
situation in some special cases. For instance, regular path expressions in the pattern part
CHAPTER 5 . CONCLUSIONS A.ID FUTURE WORK 55
of a query can be eiiminated if the schema strict ty rest ricts the nesting dept h of the data.
Future ivork will conîent rate on how the schema of the data can be exploi ted in order to
overcome the above limitations and produce optimized rewritten qiieries. Previous work
on this direction has been presented in the context of S4I.4S [LPVVW. PV99bI.
Our study was restricted in the iinordered data model. Order makes the problem
harder. It is a subject of future work the investigation of under ivhich conditions and
horv rewriting can be performed when order is considered. Aggregation is another issue
not addressed by this thesis. Relevant work on relational (IiimSi. GWSF. .CIur92] and
object-oriented models [CSI94] can be extended to handle aggregated views in the context
of SML.
Appendix A
XML-QL Grammar
XML-QL grammar for views and user queries XML-QL : : = (Function 1 Query) <EOF>
Function ::= 'FüNCTION' <FUN-ID> '0 '1 ' '0 Query
'P Element ::= StartTag Contents EndTag
S t W T a g ::= '<>(<ID>I<VAR>) ('ID' '= ' SkolemID)? Attribute* '> ' SkolemID ::= <ID> '0 <VAR> 0,' <VAR>)* '1' Attribute : : = <ID> '=' ( "" <STRING> ' Ml 1 <VAR> )
EndTag ::= '<' / <ID>? '> ' Literal ::= <STRING> Query : : = Where Construct ( > { ' Query ' 3 ' ) * Where ::= )WHEREf Condition (', ' Condition)*
Construct ::= 'CONSTRUCT' Contents Contents : := (<VAR> 1 Literal 1 Element 1 Query)+
Condition ::= Pattern BindingAs* 'IN' DataSource Predicate Pattern ::= StartTagPattern Pattern* EndTag
StartTagPattern ::= '<' RegularExpression Attribute* '> ' RegularExpression : : = RegularExpression ' * ' 1
RegularExpression '+' 1 RegularExpression ).' RegularExpression 1 Regularkpression ' 1 ' RegularExpression 1 (<VAR> 1 <ID> 1 <UNDER>)
BindingAs ::= 'ELEMENT,AS3 <VAR> 1 'CONTENT-AS' <VAR> Predicate ::= Predicate <OR> Predicate 1
Predicate <AND> Predicate 1 CNOT> Predicate i ' ( ' Predicate ' ) ' 1 Expression OpRel Expression
Expression ::= <VAR, 1 <CONSTANT> O p R d : := > < * 1 I<=S 1 8 ) ) 1 ' > = a 1 '=' 1 > ! = J
DataSource ::= <VAR> 1 <URI> 1 <FUN-ID> ' ( ' '1'
Bibliography
[.4bi99] S. Abideboul. On Views and XML. In PODS '99. Proccedings of the
e ighteenth -4 C.bI SIG.4 CT-SIGi\IOD-SIG.4 RT s yrnposiu m on Pririciples of
database sytems. pages 1-9. ACXI press. 1999. Invited talk.
[r\GM+ST] S. Abiteboul. R. Goldman. J. SlcHugh. V. Vassalos. and Y. Zhuge. Views for
semistructured data. In Workshop on .\fanagement of Sernistructured Data.
Tucson. Arizona. 1997.
[Bar991
[BDHS96]
[BFSOO]
[BLPC98]
C. Baru. SViews: SML views of relational schemas. In Proceedings of the
10th International CCorkshop on Database and Expert Systerns iIpplications.
pages 700-70.5. Florence. Italy. September 1999.
Peter Buneman. Susan Davidson. Gerd Hillebrand. and Dan Suciu. -4
query language and opt imizat ion techniques for unst ruct ured data. SlG-llOD
Record (-4 C.\I Special Interest Gro u p on .Clanagem ent O/ Data). 23(2):505-
516. 1996.
Peter Buneman. Mary Fernandez. and Dan Suciu. CnQL: -4 Query Language
and Algebra for Sernistructured Data Based on Structural Recursion. C'LDB
Journal. 9(1):T6-110. 2000.
C. Baru, B. Ludaescher, Y. Papakonstantinou, P. Velikhov. and V. k'ianu.
Feat ures and Requirements for an XML View Defini t ion Language: Lessons
[B XI R99]
(BTS91
[CC F+O 11
[CDSSSS]
19 31
[C Mg51
[Con981
from SML Information hfediation. Position paper in W C ' s Query Language
Workshop. 1998.
D. Beech. A. hIalhotra. and SI. Rys. -1 formal data mode1 and algebra for
ShIL. Communication to the LV3C. September 1999.
C. Beeri and Y. Tzabari. S.IL: -An algebra for semistructured data and SML.
In Informal Proceedings of CCorkshop on Thhe 1Veb and Databases. A C.11 SIG-
.VOL). June 1999.
D. Chamberlin, J . Clark. D. Florescu. J. Robie. J . Simeon. and 11. Stefanescu.
SQuery 1.0: A n S.CIL Query Language. June 2001. LV3C Working Draft.
S. Cluet. C. Delobel. J. Siméon. and fi. Smaga. Your mediators need data
conversion! In Proceedings of the -4 C.\I SIG.MOD Inte mationai Con ference on
.Vanagement of Data (SIGUOD'98), volume 27.2 of .-IC.\I SIGLCIOD Record.
pages 177-1SS. Yew York. June 1-4 1998. ACSI Press.
Sophie Cluet and Guido Sloerkotte. Nested queries in object bases. In Catriel
Beeri. Atsushi Ohori. and Dennis Shasha. editors, Database Programming
Languages (DBPL-4). Proceedings of the Fourth Inlernationnl CC'orlrshop on
Database Programming Languages - Object 5lodels and Languages. .Manhat-
tan. ,Vew York City, USA, 30 ..lugzist - 1 Septernber 199.7. iVorkshops in
Computing, pages 226-242. Springer, 1993.
S. Cluet and G. Sloerkotte. Classification and opt imization of nested queries
in object bases. Technical Report 95-6. RCVTH Aachen. 199.5.
World Wide CVeb Consort iurn. Extensible Markup Language (SM L ).
htt p://~vww.\vJ.org/TR/REC-xml. February 1998.
[FS Wf 991
[FS WOO]
A. Deutch. M. Fernandez. D. FIorescu. A. Levy. and D. Suciu. XML-QL: A
query language for S41L. Subrnission to the World CVide Web Consortium.
-4ugust 199s. http://rvww.w3.org/TR/NOTE-xml-ql.
-4. Deutch. M. Fernandez. D. Florescu. A. Levy. and D. Suciu. .A query
language for ShIL. In Proceedings of the 8th lrrternational Ciorld Wïde W e b
Conjerence. pages 77-9 1. Elsevier Science. hlay 1999.
Don Chamberlin and Daniela Florescu and Jonathan Robie. Quilt: an S M L
query language for heterogeneous data sources. In Proceedings of ' iébDB.
Dallas. T S . May 2000.
L. Fegaras. Query unnesting in object-oriented databases. In Laura Haas
and Ashutosh Tiwary. edi tors. Proceedings of the 1998 -4 C.11 SIC;.IIOD In-
ternational Conference on .\fanagement of Data: June 1-4. 1998. Seattle.
Washington. LiS.4. volume X ( 2 ) of SIC:\IOD Record (-4 C . i Special Interest
Group on .\fanagement of Data). pages 49-60. Sew Iork. NY 10036. CS..\.
1998. ACM Press.
SI. Fernandez. S l 1 L-QL : .A Query Language for SML. User's Guide Version
0.9. http://rvww.research.att.com/ mff/xmlql/doc/. 1999.
11. Fernandez. J . Siméon. P. CVadIer. S. Cluet. A. Deutsch. D. Flo-
rescu. A. Levy. D. Slaier. J . SIcHugh. J. Robie. D. Suciu. and
J. Widom. KM L query languages: Experiences and exemplars. http://www-
dbxesearch. belllabs.com/user/simeon/.uqps 1999.
Mary Fernandez. Jerome Simeon. and Philip Wadler. -4 Data Mode1 and
Algebra for XàI L Query. Technical Report. iVum be r Unpublished manuscript.
[Ghf W99j R. Goidman. J. McHugh, and J. Widom. From Semistructured Data to XML:
Migrating the Lore Data %[ode1 and Query Language. In Proceedings of the
2nd International IVorkshop o n the CVeb and Databases (CVebDB '99). pages
2-5-30. P hiladephia. Pensylvania, June 1999.
[GWS7] Richard A. Ganski and Harry K. T. Wong. Optimization of nested SQL
qtieries revisi ted. In Cmeshwar Dayal and Irv Traiger. editors. Proceedings of
rlssociation /or Computing .CIachinery Speciul Interest Group on Manage ment
of Data 1987 annual conference. San Francisco. .Va y 27-29, 1987. pages 23-
33. Yew York. Nk- 10036. LS.4. 1987. .\C'.CI Press.
[IiimY 11 W. Kim. On optimizing SQL-like nested queries. Technical Report R J 3063.
IBM, San Jose. C.A. 1981.
[LDOO] H. Lieke and S. Davidson. View Maintenance for Hierarchical Semist ructured
Data. In International Conference on Data LVarehousing and Iinoudedge
Discovery. 2000.
[LPVV99] B. Ludaescher. Y. Papakonstantinou. P. Velikhov. and V. Vianii. Vierv def-
inition and DTD Inference for ! M L . In Workshop on Quwy Processing for
Semi-Structurd Data and .Von-Standard Data Formats (in conjunction udh
[CD T '99). .Jerusalem. Israel. 1999.
[MFOO] D. Suciu 41. Fernandez. W-C. Tan. SiIkRoute: Trading between Relations and
.Y31 L. In Proceedings o/the 9th International WWCV Con ference. -4 msterdam.
May 2000.
[MS..\+99] J. C. Mamou. C. Souza. S. A bideboul. V. .iguilera. .A. .Ailleret. B. Amann.
S. Cluet. B. Hills. F. Hubert. -4. Marian. L. Mignet. B. Tessier. -4. 'LI. Ver-
coustre, and T. Milo. XML repository and Jlctive Vieivs Demonstration.
In Proceedings of 25th Infernafionat Conference /or COry Large Databases
( I.Z DE? *99), Sep tember 1999. Demonst rat ion.
M. Muralikrishna. Improved unnest ing algorithms for join aggregate SQL
queries. In Proceedings of the 18th Conference on Ikry Large Databases.
.klorgan Iiaufman pubs. (Los rlltos C.4). Cbncouoer. August 1992.
J . .LIcHugh and J. Widom. Query Optimization for SML. In Proeeed-
ings of the Tuenty-Filth International Conference on I éry Large Databases
(I.ZDB.99). September 1999.
[P.\Gi*I96] Y. Papakonstantinou. S. Abiteboul. and H. Garcia-hlolina. Object Fusion in
Mediator Systems. In T. SI. Vijayaraman et al.. editors. Proceedings of the
E n d Internationul Conference on l è r y Large Data Bnscs (1208'96). pages
4 13-424. Los .Utos. CA 94022. US.\. 1996. Morgan Kaufmann Pu blishers.
[P HHSI] Hamid Pirahesh. Joseph 11. Hellerstein. and Waqar Hasan. Extensi ble/ruIe
based query rewrite optimization in Starburst. In hIichae1 Stonebraker. ed-
itor. Proceedings of the 1992 rlCM SIGMOD International Conference on
ibfunngement of Data. San Diego, California, June 2-5. 1992 volume 21(2)
of SIGMOD Record (A CM Special Interest Croup on Management of Data).
pages 39-43. Sew York. XY 10036. CSA. 1992. ACM Press.
Y. Papakonstant inou and V. Vassalos. Query Rewriting for Semistructured
Data. In Alex Delis, Chris tos Faloutsos, and Shahram G handeharizadeh, edi-
tors. Proceedings ojthe ACM SIGMOD Idemational Conference on :Cfanage-
ment of Data (SIGMod-991, volume ZY .2 of SIG.WD Record. pages 45.5-466.
Xew York. June 1-3 1999. .ICM Press.
Y. Papakonstatinou and P. Velikhov. Enhancing Semistructured Data Me-
diators with Document Type Definitions. In Proceedings of the 15th Inter-
nalionaf Conference on Dafa Engineering. TEEE Corn pu ter Society Press?
March 3 - 2 6 1999.
[RLSSS] J . Robie. .J. Lapp. and D. Schach. ?(ML Query Language (SQL) .
http://www.w3.org/TandS/QL/QL9S/pp/xql.html. 199s.
[SLRSS] D. Schach. J . Lapp. and J. Robie. Querying and transforming U I L . In
Proceedings of The N:3C Q u e r y Languages CCorkshop. Boston. 3-4 Dccember
1998. http://mrvr\r.rv9.org/TandS/QLSY/.