QUERYING AUTONOMOUS, HETEROGENEOUS
INFORMATION SOURCES
a dissertation
submitted to the department of computer science
and the committee on graduate studies
of stanford university
in partial fulfillment of the requirements
for the degree of
doctor of philosophy
By
Vasilios Antoniou Vassalos
September 2000
c© Copyright 2000 by Vasilios Antoniou Vassalos
All Rights Reserved
ii
I certify that I have read this dissertation and that in my
opinion it is fully adequate, in scope and quality, as a disser-
tation for the degree of Doctor of Philosophy.
Jeffrey D. Ullman(Principal Advisor)
I certify that I have read this dissertation and that in my
opinion it is fully adequate, in scope and quality, as a disser-
tation for the degree of Doctor of Philosophy.
Hector Garcia-Molina
I certify that I have read this dissertation and that in my
opinion it is fully adequate, in scope and quality, as a disser-
tation for the degree of Doctor of Philosophy.
Yannis Papakonstantinou
Approved for the University Committee on Graduate Studies:
iii
Abstract
A wide variety of information sources are available both in internal networks of organiza-
tions and on the Web. These sources are autonomous, have different and limited query
capabilities, and usually contain heterogeneous data that only have partial, flexible, or im-
plicit structure, i.e., that are semistructured (e.g., XML, bibliographic, or genomic data).
Enabling users to query in an integrated manner the wealth of information in these sources
is a crucial requirement for increasing the usefulness of the Web as an information resource
and for enabling electronic commerce.
An effective system for online integration of such sources needs to perform efficiently
two main tasks in response to a user query: First, devise a query plan that locates and
retrieves the relevant pieces of information from the sources, by submitting to the sources
localized queries that respect the sources’ query capabilities. Then, combine the pieces of
information to produce a unified answer. This thesis develops powerful query processing
techniques and architectures for information integration and studies some of the tradeoffs
between the generality of the language and the efficiency of query processing in such an
integration system.
The thesis adopts a powerful framework for the construction of an online integration
system, proposed by the TSIMMIS project at Stanford University and inspired by the
formal, declarative underpinnings of modern database and knowledge base management
systems. In this framework, the core of the integration system is a query processor called
a mediator, that implements the integrated query processing algorithms. The details of an
integration scenario, including the integrated views that describe the way source information
is to be combined, and the contents and query capabilities of the sources, are specified
declaratively, in a high-level specification language.
The thesis studies logical languages for the specification of integrated views and the
description of query capabilities. Both relational and semistructured languages, with and
without recursion, are studied from the point of view of expressive power and efficiency.
iv
The thesis presents sound and complete algorithms that solve the key problem of generat-
ing query plans that respect query capabilities described in these powerful languages (the
capability-based rewriting problem). In particular, the first algorithm solving this problem
for a semistructured language is presented.
v
Nunc est bibendum
Horace, Ode 37
Dedicated to my parents, Antonios and Angeliki, and my sister Leda
vi
Acknowledgments
First and foremost, I would like to thank my advisor, Jeffrey Ullman, for his guidance,
advice and support. His door was always open, and his comments on any issue were always
insightful. As indispensable as his mentoring was to me in research, his guidance and advice
on all other aspects of academic and professional life were just as important. I am grateful.
I would also like to thank Yannis Papakonstantinou. Yannis and I have collaborated
extensively; this thesis would not have been possible without him. I thank him for his
insight, his enthusiasm, the generous application of his critical skills to our papers (and to
this thesis), and for being a friend. I hope our current endeavors are even more successful
in their own way.
I thank Hector Garcia-Molina for his leadership and his support of the TSIMMIS project
in general and my work in particular. His impeccable taste in research topics and his ability
to home in on the flaws and merits of an idea quickly and explain them incisively have been
an inspiration.
Serge Abiteboul’s brilliant comments and his research style greatly influenced both my
own research and my attitude towards research. I thank him for sharing his enthusiasm for
good research and his distaste for sloppy, lazy research. I also thank him for selling me a
trouble-free car.
I would like to thank the members of my reading and oral defense committees, Hector,
Yannis, Gio Wiederhold, and Kincho Law. I would also like to thank all the faculty members
of the Stanford database group, Jeff, Hector, Gio and Jennifer Widom, for creating an
exciting environment for research and for leading the best database group in the world.
The Stanford database group is full of amazingly intelligent, exciting and friendly people.
It has been a pleasure to live and work in such a high-energy milieu, and for that I want
to thank every member of the group, and particularly Shiva, Sergey, Wilburt, Jason, Roy,
Svetlozar and Tom. Special thanks go to Junghoo and Calvin for interesting discussions in
vii
our office, and for putting up with me yelling on the phone occasionally.
The Stanford Computer Science Department, first located in MJH, then in the Gates
building, has provided the backdrop for a big part of my life for the past few years. Thanks
to Suresh, Ashish, Piotr, Aris and Donald for making it memorable.
I also want to thank my friends Suresh, Yiannis Kontoyiannis, Menelaos, Panos, Yiannis
Orginos and Christos for all the wonderful memories and the great times we had.
This thesis is the culmination of a long educational journey. My parents guided my first
steps and cultivated in me the pursuit of excellence and love of learning that carried me
through that journey. They let me move to the other end of the earth to continue that
journey at a time that was difficult for both of them, and that my departure made more
difficult. My parents and my sister have always offered me their unwavering support and
love. I cannot thank them enough.
Generous support for my work by the NSF, DARPA, the Air Force, the Bodossakis
Foundation and the L. Voudouri Foundation is gratefully acknowledged.
viii
Contents
Abstract iv
vi
Acknowledgments vii
1 Introduction 1
1.1 Information integration and the challenges of autonomy and heterogeneity . 1
1.2 Semistructured data model . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Equivalence of OEM databases . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 OEM and XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Modeling and using source query capabilities . . . . . . . . . . . . . . . . . 10
1.4.1 Problem definitions: CBR and query expressibility . . . . . . . . . . 11
1.4.2 CBR and query rewriting using views . . . . . . . . . . . . . . . . . 11
1.5 TSIMMIS for information integration . . . . . . . . . . . . . . . . . . . . . 12
1.6 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Related Work 15
3 DSL: A Language for Semistructured Data 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 The DAG Specification Language . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Syntactic restrictions on DSL rules . . . . . . . . . . . . . . . . . . . 25
ix
3.2.4 Expressive power and complexity . . . . . . . . . . . . . . . . . . . . 28
3.2.5 Normal forms of DSL rules . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Query composition for DSL . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Equivalence of DSL queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.1 Mappings and containment mappings . . . . . . . . . . . . . . . . . 39
3.4.2 Extending the chase for set variables . . . . . . . . . . . . . . . . . . 40
3.4.3 Deciding DSL query equivalence . . . . . . . . . . . . . . . . . . . . 42
3.5 DSL and other semistructured languages . . . . . . . . . . . . . . . . . . . . 44
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Query Rewriting for Semistructured Data 46
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 DSL Query Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Rewriting of Queries with a Single Path Condition . . . . . . . . . . 48
4.3 Using structural constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.1 General case of query rewriting . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 Completeness and Complexity . . . . . . . . . . . . . . . . . . . . . 55
4.4 Capability-based rewriting in the TSIMMIS mediator . . . . . . . . . . . . 57
4.4.1 Query Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.2 Physical Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.3 Source Capabilities Description: Templates . . . . . . . . . . . . . . 59
4.4.4 Capability Based Plan Generation . . . . . . . . . . . . . . . . . . . 60
4.4.5 Rewriting algorithm and capability-based plan generation . . . . . . 62
4.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 The Capability Description Language p-Datalog 64
5.1 The p-Datalog Source Description Language . . . . . . . . . . . . . . . . . . 65
5.1.1 Formal description of p-Datalog. . . . . . . . . . . . . . . . . . . . . 69
5.2 Deciding query expressibility with p-Datalog descriptions . . . . . . . . . . 71
5.2.1 Expressibility and translation . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Answering Queries Using p-Datalog Descriptions . . . . . . . . . . . . . . . 80
5.3.1 CBR with binding requirements . . . . . . . . . . . . . . . . . . . . 83
5.4 An interesting and more efficient class of p-Datalog descriptions . . . . . . . 86
x
5.4.1 Lattice Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.2 QED and Ploop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Expressive Power of p-Datalog . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.6.1 Describing binding requirements in p-Datalog . . . . . . . . . . . . . 93
5.7 Conclusions and open problems . . . . . . . . . . . . . . . . . . . . . . . . . 95
6 The Capability Description Language RQDL 97
6.1 The RQDL description Language . . . . . . . . . . . . . . . . . . . . . . . . 98
6.1.1 Using RQDL for query description . . . . . . . . . . . . . . . . . . . 99
6.1.2 Semantics of RQDL . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2 RQDL and mediator capabilities . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3 Reducing RQDL to p-Datalog with function symbols . . . . . . . . . . . . . 106
6.3.1 Reduction of a database to standard schema database . . . . . . . . 106
6.3.2 Reduction of queries to standard schema queries . . . . . . . . . . . 108
6.3.3 Reduction of RQDL programs to Datalog programs over the standard
schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4 QED and CBR for RQDL descriptions . . . . . . . . . . . . . . . . . . . . . 114
6.4.1 The query expressibility problem for RQDL . . . . . . . . . . . . . . 114
6.4.2 The CBR problem for RQDL . . . . . . . . . . . . . . . . . . . . . . 122
6.5 Conclusions and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 124
A Enabling Integration: TSIMMIS Wrappers 126
A.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.2 Implemented Wrappers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
B Sort program 132
Bibliography 133
xi
List of Tables
4.1 Matcher Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
xii
List of Figures
1.1 Example OEM objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Text representation of OEM objects . . . . . . . . . . . . . . . . . . . . . . 4
1.3 XML representation of data of Figure 1.2 . . . . . . . . . . . . . . . . . . . 7
1.4 A common integration architecture . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 Declarative specification of wrappers and mediators . . . . . . . . . . . . . 12
1.6 Mediator architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Result of (Q2) on database of Figure 1.1 . . . . . . . . . . . . . . . . . . . . 23
3.2 Body graph examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Head graph of (P5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 OEM database and result for Example 3.2.5 . . . . . . . . . . . . . . . . . . 28
4.1 DSL query rewriting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2 TSIMMIS CBR architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Algorithm QED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Algorithm QED-T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Supporting set lattice for fact f for a database of size 5 . . . . . . . . . . . 88
5.4 Supporting sets and least common ancestor . . . . . . . . . . . . . . . . . . 89
6.1 Default rules for generation of attr tuples . . . . . . . . . . . . . . . . . . . 112
6.2 Extended facts produced by Algorithm QED-T for Example 6.4.7 . . . . . . 125
A.1 Wrapper architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.2 Wrapper components and procedure calls . . . . . . . . . . . . . . . . . . . 128
B.1 A logic program implementing selection sort . . . . . . . . . . . . . . . . . . 132
xiii
Chapter 1
Introduction
1.1 Information integration and the challenges of autonomy
and heterogeneity
Information today (2000) resides on a variety of information sources that are increasingly
interconnected. File systems, databases, document retrieval systems, workflow systems,
ERP (enterprise resource planning) systems, data warehouses, and other sources of valuable
information are accessible inside corporate intranets. Moreover, they are also becoming
increasingly available to the “outside world,” through extranets or the Internet. Being able
to access and make sense of the available information is a significant challenge: organizations
(and individuals) represent, maintain, and export the information using a variety of formats,
data models, interfaces and semantics.
In order to use the information productively, it is important to get integrated access to
it, i.e., to be able to request information (that may be found split among various sources),
and get a consistent, integrated response, regardless of which information sources store the
information in the answer and how they export it.
An instance of this problem is the problem of integrating relational or object-oriented
databases. The database community has identified and worked on this problem since
the 1980s, making substantial progress on database integration techniques [A+91; Gup89;
LMR90; T+90]. But that line of work made a number of assumptions — namely fixed
schemas, unrestricted access to the information and, in general, control over the informa-
tion sources — that are increasingly often invalid: data is often residing in autonomous
1
CHAPTER 1. INTRODUCTION 2
sources containing heterogeneous information. Moreover, these information sources have
different and often limited query capabilities.
The challenges of autonomy of sources and heterogeneity of information can be addressed
by using a more flexible data model and a nonintrusive, online integration architecture that
relies on interpreted, declarative specifications to guide the integration task. Using a more
flexible, semistructured data model allows to deal with data that do not conform to a rigid
schema, or whose schema is not fully known in advance or evolves rapidly. Using a non-
intrusive, online integration architecture respects the autonomy of the information sources,
lowers the development time, and provides users with always correct, always up-to-date
answers. These points are briefly discussed in the next sections.
1.2 Semistructured data model
Much of today’s electronically stored information does not conform to traditional relational
or object oriented data models. Several applications store their data in nonstandard data
formats, legacy systems, structured documents like HTML or SGML etc. These data often
have irregular structure: some objects may have missing attributes and others may have
multiple occurrences of the same attribute. Moreover, as explained above, traditional data
models are not well-suited to the task of integrating heterogeneous data sources: often these
sources belong to external organizations or partners not under the application’s control;
even if the data is internally modelled as object-oriented data, their structure is often
only partially known, and may change without notice. Finally, data among these sources
are usually syntactically heteregeneous: the same attribute may have different types in
different objects, and semantically related information may be represented differently in
various objects. Data like the above, characterized by the presence of some structure but the
absence of a rigid, known schema, have recently been called semistructured data [PGMW95].
Important, desirable properties for a semistructured data model include that the data are
self-describing1 (because of the frequent absence of an a-priori schema), that the model is
not strongly typed, that it supports nesting and is not in first normal form [Ull89].
A powerful and intuitive way to model semistructured data that satisfies the above
requirements is to represent them as labeled graphs. The database is then self-describing
in that the schema of a database instance is moved into the graph, in the form of labels
1In the sense that it can be parsed without reference to an external schema.
CHAPTER 1. INTRODUCTION 3
Figure 1.1: Example OEM objects
<&1,DBPL,set,{&2,&3}><&2,Book,set,{&4,&5,&6,&7}>
<&4,Title,string,‘Materialized Views’><&5,ISBN,integer,999><&6,Keyword,string,‘Relational’><&7,Author,string,‘A. Gupta’>
<&3,Article,set,{&7,&8,&9}><&8,Title,string,‘Constraint Checking’><&9,Conference,set,{&10,&11,&12}>
<&10,Name,string,‘SIGMOD’><&11,Year,integer,1993><&12,Location,string,‘Washington, DC’>
Figure 1.2: Text representation of OEM objects
attached to the graph nodes or edges. Moreover, graph edges naturally model object-
subobject relationships.
The most popular semistructured data model, which is also the semistructured data
model used in this thesis, is the Object Exchange Model (OEM), proposed by the TSIMMIS
project and originally described in [PGMW95].
In the OEM data model, the data are represented as a rooted graph with labeled nodes2
that have unique object ids. Figure 1.1 is an example of bibliographic data modeled as an
OEM graph. A textual representation of the same data is shown in Figure 1.2.
Each OEM object consists of an object-id (e.g., &2), a label that explains its meaning
(e.g., Title), a type (e.g., string), and a value.3 Labels are strings that are meaningful
to applications or end-users. Labels may have different meanings at different information
sources.
Objects can be either atomic or complex (set objects). The value of an atomic object is
of the specified atomic type(e.g., ‘SIGMOD’). In the rest of the thesis, we assume that the
2In a later version of the model, used in the LORE database system, labels are attached to edges. Thisapproach leads to only minor differences in the description of information and in the corresponding queryand view definition languages. The techniques and algorithms described in this thesis apply with littlechange to the LORE version of the data model.
3Note that, for simplicity, type information has been omitted from Figure 1.1.
CHAPTER 1. INTRODUCTION 4
type of all atomic objects is string and we omit type information from objects. The value
of a complex object is a set of objects. Notice that this definition is inherently recursive,
since the value of an object is part of the object.
Thus in the OEM data graph the nodes are the objects and the edges denote the object-
subobject relationships. The leaf nodes have an associated atomic value. The set of objects
pointed to by the outgoing edges from o is the value of o; because of the recursive nature
of this definition, the value of o is essentially the OEM subgraph rooted at o.4 The OEM
graph has roots, i.e., distinguished, top-level objects with all other objects accessible from
them.
The object ids are typically atomic data. Formally, they are terms from the Herbrand
universe [End72] consisting of
• a set of atomic data, which includes but is not necessarily confined to, the atomic
data appearing as labels and values, like &10 and Smith, and
• an arbitrary set of freely interpreted function symbols. For example, f(&10, ashish)
is a possible object id, and the function symbol f “defines” the term. The function
symbols are interpreted “freely”, in the sense that two terms are considered equal only
if they are syntactically identical.
Object ids may be symbols with no particular meaning, or they may carry semantic
meaning. For example, if the object is a Web page, then it is typically a good idea to
have the URL be the object id. Furthermore, meaningful term object ids can facilitate the
integration tasks, since they carry information about how they were created. Thus a term
object id can describe how an integrated object was constructed. That information can be
useful to both human users and the query processor. We will discuss this issue more in
Chapter 3.
Note that OEM objects are self-describing, in that semantic and structural information
about objects is encoded in the labels, the semantic object ids and the object-subobject
relationships. Note also that OEM poses no restrictions on the labels of subobjects. For
example, some Article object can have a single Title object, others may not have any
Title, and others may have multiple Titles. In this way we allow Articles to have irreg-
ular structures while at the same time regularities in the structure are implicitly reflected
4Excluding o itself.
CHAPTER 1. INTRODUCTION 5
in the OEM object. Indeed, even if Articles have a regular structure we still gain by using
OEM in integration, since Articles from different information sources will often have differ-
ent, albeit regular, structures: the regularities can be reflected in the integrated structure,
that also accommodates the structural differences gracefully.
Even though OEM can model data that can naturally be represented as an arbitrary
graph, in many applications data is naturally represented as a directed acyclic graph, or as
a tree. When the underlying graph is a tree, object ids can be omitted from the textual
representation without loss of information.
1.2.1 Equivalence of OEM databases
Two OEM databases D1 and D2 are equivalent if they are isomorphic, i.e., there exists a
bijection θ of the object ids in the two databases, such that for every pair (o, theta(o)) of
objects ids, with o ∈ D1 and θ(o) ∈ D2, the two objects identified by o and θ(o)(i) have the
same label l (ii) both of them have an atomic value or both of them have a set value (iii) if
they are atomic objects they have the same atomic value v and (iv) if they are set objects
they have isomorphic sets of subobjects.
Expressed differently, two OEM databases are equivalent if they are identical up to
object id renaming.
1.2.2 OEM and XML
Other semistructured data models that have been proposed [Suc98; BDHS96] are very
similar to OEM. Recently, the Extensible Markup Language (XML) [BPSM] has emerged as
the new, lightweight standard for the description, exchange and integration of information
on the Web.5 XML data are self-describing and bear a striking similarity to OEM data —
as well as other semistructured models — as it is illustrated in Figure 1.3 that presents the
OEM data of Figure 1.2 in XML syntax.
There are a few differences between XML and OEM; most of them stem from the
document-oriented nature of SGML [SGM], which is the standard XML is derived from.
A detailed comparison of the (mostly superficial) differences between OEM/semistructured
models and XML can be found in [Suc98]. The following differences are most relevant to
5In [MFDG98], John Bosak, a leader in the XML initiative, mentions information-integration applicationsas a major motivation for the lightweight XML standard.
CHAPTER 1. INTRODUCTION 6
<DBPL id="&1"><Bookid="&2">
<Title id="&4">Materialized Views
</Title><ISBN id="&5">
999</ISBN><Keyword id="&6">
Relational</Keyword><Author id="&7">
A. Gupta</Author>
</Book><Article id="&3" idref="&7">
<Title id="&8">Constraint Checking
</Title><Conference id="&9">
<Name id="&10">SIGMOD
</Name><Year id="&11">
1993</Year><Location id="&12">
Washington, DC</Location>
</Conference></Article>
</DBLP>
Figure 1.3: XML representation of data of Figure 1.2
CHAPTER 1. INTRODUCTION 7
mediation applications:
Ordering The subobjects of an XML object (called element) are always ordered, while
OEM objects are unordered. Ordering brings determinism to the textual represen-
tation of objects or, for that matter, to the representation of an object transmitted
over a network. Furthermore, it allows the natural modeling of lists. However, query
languages with order semantics have higher complexity and more complicated seman-
tics; it is unnessary to incur these costs when the order is semantically meaningless.
Hence, it is likely that an XML version that combines support for lists and sets will
emerge. Many important query processing and optimization problems in information
integration are still open when a data model and language with order semantics is
used.
Subelements versus Referenced Elements OEM supports only one relationship across
elements: an object x may point to an object y. Thus OEM models both an object-
subobject and a reference relationship with directed edges between objects. In con-
trast, in XML (as well as in object-oriented models [AHV95]) there is a distinction
between element y being a subelement of x and x refering to an element y. The graph
representation of an XML document thus potentially involves two kinds of edges: the
“subelement” ones and the “reference” ones.
This distinction may be important from an information modeling point of view. In
addition this distinction provides some benefits for the query processing and storing
of semistructured objects. For example, the edges that correspond to the subelement
relationship are usually more numerous than the “reference” edges, and the subele-
ment edges form a tree. A query processor may be able to exploit the tree structure.
Similarly, a storage manager can exploit the natural nesting of the tree objects.
Nonstring Data XML does not allow nontextual data as values of atomic objects. In-
stead, binary data are linked as external entities. Separating textual from binary data
adds complexity that seems to be useful only in the context of a document model —
as opposed to a data model such as OEM.
Schema XML provides for flexible schema definition languages to capture the islands of
structure existing in semistructured information. OEM can use similar languages for
the same purpose, even though originally it was designed as a schemaless data model.
CHAPTER 1. INTRODUCTION 8
The benefits and uses of available schema information for semistructured data are
an important topic for information integration that is not addressed in this thesis —
we only discuss briefly one of the uses of flexible schema information to information
integration in Section 4.3. For some recent work on the topic, see [PV00; MS99;
MSV00].
1.3 Information Integration
Information integration systems provide integrated query access to the sources via the
mediator/wrapper architecture of Figure 1.4.
Figure 1.4: A common integration architecture
Conceptually, an integration system must perform three main tasks in response to a
user query:
• identify and locate the pieces of data in the information sources that make up the
answer to the user query,
• create and execute a query plan for correctly and efficiently retrieving the data from
the sources, and
CHAPTER 1. INTRODUCTION 9
• construct the answer to the user query by piecing together appropriately and manip-
ulating the returned data.
These tasks are performed by the mediator. The mediator is a distributed query process-
ing, optimization and execution engine for distributed information residing on autonomous,
heterogeneous sources. The mediator combines, integrates, and refines source data, provid-
ing applications with a “cleaner” integrated view of the source information. For example,
a web car-shopping mediator provides access to a set of dealers’ pages and even to other
car-shopping mediators. Users accessing the mediator would see a single collection of on-
sale cars, with, for example, duplicates removed,6 format discrepancies resolved, and cars
ranked according to some criterion such as having the current best offers appear first. The
same or different mediators can provide different integrated views of source information,
geared towards different integration uses or applications, either focusing on different parts
of the source information (e.g., new, Japanese sedans with at least one review), or per-
forming different kinds of integration (e.g., summarization versus information fusion versus
schema/structure normalization).
The integration system uses a uniform data representation (most appropriately, a se-
mistructured one, for the reasons explained in Section 1.2) and a common query language.
Wrappers present a logical view of the data of each source represented into the common data
model. The wrappers also accept queries in the common query language on the exported
logical view. When the wrappers receive a query, they translate it into one or more source-
specific queries or commands that are issued to the source. They also translate the source
result into the common data model. The wrappers abstract away the implementation and
interface details of information sources from the rest of the integration system. Applications
can access data directly through wrappers, although most applications will typically access
them through mediators.
1.4 Modeling and using source query capabilities
In the architecture of Figure 1.4, the mediator decomposes incoming client queries, which
refer to the integrated view presented by the mediator and are expressed in some common
query language, into new common-language queries that refer to data found in the individual
6Duplicates could have been introduced because multiple sites may be advertising the same car.
CHAPTER 1. INTRODUCTION 10
sources and are sent to the wrappers. The information sources connected to the wrappers
are accessible through interfaces that have varying query capabilities; the queries emitted
by the mediator must conform to these capabilities, because otherwise it is not possible for
the wrappers to translate them into native queries and commands. Let us use an example
to illustrate the query processing steps followed by the mediator.
Consider a bibliographic mediator that combines the data of multiple bibliographic
sources into a single “union” view. The user query requests all “SIGMOD 97” publications.
The mediator decomposes the user query into multiple “SIGMOD 97” queries, of which
each is source-specific, i.e., it refers to one source only. To do the decomposition correctly
and efficiently, the mediator must figure out how to extract the necessary information from
the sources using their query capabilities. This is the Capability-Based Rewriting (CBR)
problem. In our example, if one source only supports selection queries on “year”, solving
the CBR the mediator will decide that a query that retrieves the “97” publications will be
sent to this source. The rest, i.e., filtering for “SIGMOD,” will be done at the mediator.
After such decisions are made, and the mediator formulates a query plan that respects the
query capabilities of the sources, each query is sent to a wrapper, where it is translated into
the native query language of the corresponding source. Then the individual query results,
are collected, the information is filtered appropriately and consolidated into one entity by
the mediator, and the combined result is presented to the user.
In order to be able to perform capability-based rewriting, the mediator needs formal
descriptions of the query capabilities of the information sources. A capability-based rewriter
takes as input these descriptions and the query, and it infers query plans for retrieving the
required data that are compatible with the source query capabilities. Solving the CBR
typically produces more than one candidate plans for the query.
The wrappers also need descriptions of the source capabilities in order to translate
the supported common-language queries into queries and commands understood by the
source interface. Conceptually, descriptions are associated with actions that perform the
translation, in the same style as Yacc [ASU87].
1.4.1 Problem definitions: CBR and query expressibility
Source query capabilities can be described in terms of the queries that the source supports: a
capability description is a finite encoding of the set of queries (in some query language) that
CHAPTER 1. INTRODUCTION 11
the source can answer.7 Therefore, the semantics of the description is a set of queries. A
query is described by the capability description if it belongs in that set. A query is expressible
at the information source, or, equivalently, is expressible by the capability description, if it
is equivalent to a query described by the source’s capability description.
The CBR problem can then be formulated as follows: Given a description of the sources’
capabilities, how can we answer a query using only queries expressible at the sources? The
(related, but simpler) query expressibility problem is as follows: Given a description of the
source’s capabilities and a query, is the query expressible at the source?
1.4.2 CBR and query rewriting using views
The CBR problem is strongly related to the problem of query rewriting using views [Lev;
Ull97], defined as follows: Given a query q accessing some database8 D and a set of views
V = {V1, . . . , Vn} over D, find rewriting queries. A rewriting query of q given V is a query
that accesses at least one view of V and returns the same result as q (for any D). If the
rewriting query uses views only (i.e., it does not access directly the database D) then it is
called a total rewriting query.
If the query capabilities of the information source are modelled as a set of queries, or
equivalently views, over the source contents, then the CBR problem is indeed a rewording
of the problem of query rewriting using views in the context of information integration.
Notice that the problem of answering queries using views [GM99; AD98] is related, but
different:9 Given a query q, a set of views and the view extensions, find the set of tuples
t such that t is in the answer to q for all the databases that are consistent with the view
extentions (i.e., the certain answers to q).
1.5 TSIMMIS for information integration
The information integration system developed in the TSIMMIS project follows the general
architecture of Figure 1.4. In good engineering tradition, the project emphasized the au-
tomation of wrappers’ and mediators’ development. In particular, we developed tools that
7For finite sets of supported queries, the source capabilities can be described by fully enumerating them.8The database may be distributed over multiple sites.9The distinction between these two problems is drawn in [CGLV99; CGLV00].
CHAPTER 1. INTRODUCTION 12
Figure 1.5: Declarative specification of wrappers and mediators
allow the implementation of wrappers and mediators from high level specifications of their
functionality, as shown in Figure 1.5.
One can develop a mediator by providing a declarative specification of the integrated
view, expressed in a semistructured language that is the common query and view definition
language of the integration system, to the generic mediator specification interpreter that
was developed in TSIMMIS. Similarly, one develops a wrapper by providing a wrapper
specification to the generic wrapper generator. The wrapper specification shows how queries
expressed in the common query language are translated into queries expressed in the native
query language of the underlying sources.
The TSIMMIS data model is OEM and it also uses a semistructured query and view
definition language. During run time, when the mediator receives a client query, it composes
the query with the mediator view definition. Then the mediator creates a plan that sends
queries to the wrappers. These queries are translated by the wrappers into native queries on
the underlying systems. The wrappers translate the source results into the OEM model and
CHAPTER 1. INTRODUCTION 13
they ship them to the mediator, where the plan combines them into the client query result.
Detailed end-to-end presentations of query processing in TSIMMIS can be found in [GM+97;
GMPVY; Pap97].
Figure 1.6: Mediator architecture
The TSIMMIS system uses parametrized views to describe source query capabilities.
The Capabilities-Based Rewriter uses the source capability descriptions to adapt to the
query capabilities of the sources (see Figure 1.6). The rewriting algorithm employed by the
CBR module of the mediator is discussed in Chapter 4.
Finally, a cost optimizer provides cost estimates. The TSIMMIS approach is based on
a loose coupling of the CBR with the optimizer. Systems and algorithms where a CBR
module and the optimizer are tightly coupled are described in [HKWY97] and [PGH98].
We are not concerned in this thesis with estimating the cost of the plans. Relevant work
can be found in [ACPS96; DKS92].
1.6 Thesis Overview
Chapter 2 discusses related work in the area of information integration. Chapter 3 discusses
the semistructured language DSL (DAG Specification Language) and presents query con-
tainment and query composition algorithms for it. DSL is a variant of the view and query
language used in the TSIMMIS system, and a DSL-based template language is also used
by TSIMMIS as a source capability description language. Chapter 4 presents a rewriting
CHAPTER 1. INTRODUCTION 14
algorithm for DSL, which is the first rewriting algorithm for a semistructured language. It
also presents an efficient rewriting heuristic, based on the general-purpose algorithm, that is
used in the TSIMMIS system. Chapter 5 discusses the query capability description language
p-Datalog (including expressibility results) and presents solutions to the CBR problem and
the query expressibility problem for p-Datalog and p-Datalog variants. Chapter 6 presents
algorithms for the CBR and query expressibility problems for the more powerful capabil-
ity description language RQDL. It also presents a reduction of RQDL to p-Datalog with
function symbols. Finally, Appendix A presents the TSIMMIS wrapper architecture.
Chapter 2
Related Work
This chapter summarizes related work on integrated querying of autonomous, heterogeneous
sources. Notice that detailed comparisons between more specific topics addressed in this
thesis and related work are found in the related work section of the corresponding chapters.
Earlier work on database integration [A+91; K+93; BLN86; LMR90; T+90; Gup89;
FLNS88] focused on the integration of well-structured databases, with fixed schemas, that
support powerful query languages. Significant amount of work was devoted to the of-
fline integration of the database schemas, in order to produce a schema for the integrated
database. As described in the introduction, the assumptions made in much of this work are
increasingly invalid. This thesis focuses on technologies for integrating heterogeneous and
autonomous information sources.
Recently, a new generation of systems has focused on the integration of sources that
may not necessarily be structured databases. We briefly describe them here.
The TSIMMIS project has recently focused on various query optimization and modelling
issues in integrated querying, in addition to the work described in this thesis and earlier work
in semistructured models and languages and query processing [Pap97]. In particular, the
problem of automatic computation of mediator capabilities has been studied in [YLGMU99]
(compare with Section 6.2). Integrated querying over sources supporting disjunction is
studied in [GMLY99], while [YLUGM99] study the problem of choosing efficient integrated
query plans, and propose provably efficient heuristics. Finally, [AGMPY98] addresses the
issue of optimizing large fusion queries.
HERMES [S+] attempts to solve the integration problem by a mediator specification
language where literals explicitly specify the parameterized calls that are sent to the sources.
15
CHAPTER 2. RELATED WORK 16
Unfortunately, the HERMES solution reduces the interface between the integration system
and the sources to a limited set of explicitly listed parameterized calls.
Garlic [C+95; HKWY96; ROH99] focuses on integrating heterogeneous databases or
multimedia data stores that are autonomous but cooperative. That means that their data
have a fixed, known schema, either relational or object-oriented, they provide a minimum of
database-like functionality, like support for either general-purpose or media-specific query
operators, and they provide access to a wealth of metadata, such as the schemas, query
plans, cost models for their operators, etc. In that sense, Garlic represents the continua-
tion of research on federated databases [FGL+98]. The focus of Garlic is on effective and
efficient integrated query optimization, making the best use of the metadata and taking
advantage of the media-specific operators at the sources. Our work assumes a looser inte-
gration scenario: information sources are not assumed to be cooperative, nor are the source
data assumed to be strongly typed. Correspondingly, we place considerable emphasis on
describing source contents and capabilities using flexible descriptions that assume minimal
knowledge about source structure and content. The problem of discovering feasible query
plans, a necessary first step before query optimization, receives more attention in our work,
making it complementary to Garlic. Finally, the Garlic wrapper architecture is very similar
to the TSIMMIS wrapper architecture (see Appendix A) [RS97]. Mapping queries and op-
erators from the Garlic data model and language to the native data model language in the
wrappers is a quite simpler operation though, given that the Garlic data model is essentially
ODL [AHV95] and the mediated data sources are also databases.
The Tukwila data integration system [IFF+99] introduces the concept of adaptive query
optimization and execution: interleaving planning and execution with partial optimization
allows Tukwila to recover from query planning decisions based on inaccurate estimates.
Moreover, Tukwila proposes a new adaptive operator for join, the doubly pipelined join,
that is better suited to the requirements of integrated querying over autonomous informa-
tion sources. As with Garlic, the Tukwila work is complimentary to ours: our produced
feasible query plans can immediately benefit from more efficient and intelligent query en-
gines. Moreover, the Tukwila system originally was designed for processing only relational
data.1
The MIX project [LPV00] is following in the steps of TSIMMIS, putting emphasis on
1Tukwila is currently (May 2000) being extended to process XML data [Tuk].
CHAPTER 2. RELATED WORK 17
query processing in the mediator. The MIX mediator exports a virtual view XML document
into which the client navigates. Client navigation commands are translated online into
source/wrapper sequences of navigation commands and/or queries. MIX also provides a
query-by-example user interface that is driven by the DTD of the virtual view [MP00].
In the SIMS project [ACHK93] and the follow-on project Ariadne [AAB+98] the under-
lying assumption is that for each application there is a unifying domain model that provides
a single ontology for the application. These projects focus on providing powerful knowledge
representation techniques (ontologies, description logics) to create expressive domain mod-
els for the application (see also [CGL+98b]). Each source model is described in terms of the
unifying domain model (the so-called local-as-view approach). Because of the power and
complexity of the knowledge representation techniques used, the use of powerful query lan-
guages make the problem of integrated query planning intractable. Thus SIMS and Ariadne
have focused on limited query languages (the join operator is not supported), as well as on
query planning heuristics rooted in the AI planning literature [AK97]. Other recent work in
query planning heuristics inspired from the AI planning literature includes [KW96; FW97;
AKL97; FLM99]. In contrast, in our work the emphasis is on sound and complete query
processing algorithms, and on description languages that, while expressive, are tractable.
The Information Manifold [LRO96] and Infomaster [GKD97] projects follow the same
local-as-view approach as SIMS and Ariadne. Both projects use expressive yet tractable
description languages for the universal domain model: a simple description logic for the
Information Manifold and Datalog for Infomaster.2 The Information Manifold also ad-
dresses the issue of dealing with sources with limited capabilities by modeling the source
capabilities using capability records. At the core of the Infomaster query planner is a novel,
computationally cheap algorithm for rewriting queries using views [DG97; DL97]. An in-
teresting comparison of the assumptions and relative strengths of the Information Manifold
and TSIMMIS can be found in [Ull97].
The Distributed Information Search Component DISCO [TRV98] describes the capa-
bilities of the sources using context-free grammars appropriately augmented with actions.
DISCO enumerates plans initially ignoring limited wrapper capabilities. It then checks
the queries that appear in the plans against the wrapper grammars and rejects the plans
containing unsupported queries.
2Infomaster initially used KIF [G+92], a very general knowledge representation language.
CHAPTER 2. RELATED WORK 18
UnQL (Unstructured Query Language) [BDHS96] was among the first languages and
systems (together with TSIMMIS) for semistructured data. UnQL uses a graph-based data
model very similar to OEM and XML. UnQL is a functional language that uses a structural
recursion paradigm. The emphasis in UnQL was on developing mathematical constructs for
querying semistructured data. The structural recursion paradigm of UnQL is very similar
to XSL (XML stylesheet language) [Adl].
The FLORID system [LHL+98] is a deductive, object-oriented system for managing and
integrating semistructured data, based on F-logic [KL89]. Semistructured data are modeled
using an object-oriented data model and queried using F-logic, an object-oriented logic
language that supports object identity through Skolem functions, a concept that proved
useful for semistructured languages for integration (see also Chapter 3). FLORID was used
to define and query the structure and content of Web sites declaratively.
The Strudel system [FFK+98] has applied concepts from information integration to
the task of building complex Web sites that serve information derived from multiple data
sources. Strudel separates site content from structure: a Web site is a declaratively-defined
site graph over the semistructured data graph of the contents of the information sources. If
we only have access to the information through the Web site(s), queries asked over the data
graph need to be rewritten as queries over the Web site structure and contents. The Web site
definitions are just view definitions over the data graph. The system is declaratively defined
using the StruQL language [FFLS97]. StruQL has also been a vehicle for the investigation
of theoretical problems related to semistructured languages, such as query containment in
the presence of regular path expressions and intensional3 constraint checking.
Tiramisu [ALW99] is a follow-on project to Strudel that separates the implementation
of the site from the design of the site, and supports a top-down development of the site.
Tiramisu allows quick integration with external implementation tools that create web con-
tent. Tiramisu also allows a web master to graphically view the web site as a graph of web
content connected together by hypertext and inclusion links.
The WEAVE system [FLSY99] extends the Strudel work by allowing compile-time pre-
computation and other optimizations to improve querying and navigating the site graph.
The COIN system [SBGJ+97] focuses primarily on issues of semantic integration, namely
how to use limited contextual information to automatically identify and resolve differences
3Using only the constraints and the view definitions.
CHAPTER 2. RELATED WORK 19
in measurement units, terminology or schema definitions. These issues are orthogonal to
the issues discussed in this thesis.
More optimization issues There has been some preliminary work on utilizing source
overlap and redundancy information for query optimization in a mediation environment
[FKL97; VP98; DL99]. [FLMS99] offers an analysis of the search space for query opti-
mization in the presence of sources with limited capabilities and describes a heuristic for
capability-sensitive query optimization as well as experimental evidence of its performance.
Theoretical work A lot of theoretical work has been done on the problems of rewriting
queries using views and answering queries using views, and the related problems of query
containment in the presence of views, mainly for relational languages [LMSS95; LRU96;
RSU95; VP00; VP97; DG97; DL97; AD98; Qia96; MLF00; PL00; GM99], for description
and higher-order logics [CGL98a; CGL99; BLR97], and recently also for languages with
regular expressions [CGLV99; CGLV00]. As we discussed in Chapter 1, these problems are
at the core of query planning algorithms for integrated querying systems. Some of this work
is discussed in more detail in Chapters 3, 4, 5 and 6. The first paper on rewriting queries
using views for a semistructured language is [PV99]. A comprehensive survey on this topic
is [Lev].
The importance and relevance of research in technology for integrated querying of au-
tonomous, heterogeneous sources is highlighted by the arrival of commercial products in-
corporating results of this research, as well as of companies further developing this tech-
nology. Products include IBM’s DataJoiner [GL94], the DataLinks enhancements to IBM’s
DB2 UBDB [PWDN99], and Microsoft’s OLECOM [Bla96]. Companies include Junglee
[GHR98], Mergent [Mer] and Cadabra [Cad], based on technology developed in the Info-
master project, Nimble [Nim], based on research in the Tukwila project, Fetch Technologies
[Fet], founded around technology developed in the SIMS and Ariadne projects at ISI, and
Enosys Markets [Eno], based on technology developed in the TSIMMIS and MIX projects.
Chapter 3
DSL: A Language for
Semistructured Data
3.1 Introduction
In this chapter we present DSL (DAG Specification Language), a language for semistruc-
tured data. DSL uses the OEM data model and is an object-oriented, rule-based query
and view definition language in the spirit of Datalog [Ull89]. DSL is a variant of the Me-
diator Specification Language [PGMU96] and, like MSL, it is especially well-suited to the
definition of integrated views over semistructured, heterogeneous sources; the definition of
such views is an essential component of our integrated querying architecture, as explained
in Chapter 1.
The distinguishing characteristics of DSL are its ability to manipulate semistructured
data and its effectiveness in information fusion. In particular, DSL has most of the features
identified as important for semistructured languages in recent literature (e.g., in [FSW99]),
such as support for paths, object nesting, label variables, and restructuring capabilities. As
we will see shortly, DSL rule syntax is similar to XML. Finally, DSL supports object identity
through the definition of semantic object-ids using Skolem functions [End72], in the spirit of
F-Logic [KL89] and ILOG [HY90]. Semantic object-ids enable fusion of heterogeneous query
results, as explained in [PAGM96]. Together, these characteristics make DSL a candidate
for an XML query language, like Lorel [AQM+97] or XML-QL [DFF+].
The purpose of this chapter is not to present a new semistructured query language,
since DSL is a variant of MSL. Its purpose is to describe the formal semantics of DSL
20
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 21
and to identify the syntactic and semantic tradeoffs that had to be made (compared to
other semistructured languages) to enable us to discover solutions to the problems of query
composition and query equivalence for DSL, without significantly affecting the power of
DSL. We also present these solutions, namely query composition and query equivalence
algorithms for DSL.
We start by presenting the syntax and semantics of DSL. We present query composition
and query equivalence algorithms for DSL in Sections 3.3 and 3.4 respectively. Finally, in
Section 3.5, we discuss other semistructured languages.
3.2 The DAG Specification Language
Queries over OEM data need to impose conditions on semistructured data graphs, as well
as to output new OEM data as results. Therefore, a query language over OEM data needs
to have an object-selection component as well as a transformation component that allows
creation of new OEM objects and graphs. To express queries and view definitions over
OEM data, we use the DAG Specification Language (DSL).
3.2.1 Syntax
A DSL rule is a rule of the form head :- body in the style of Datalog [Ull88]. Intuitively, the
head describes the result objects in the answer graph, whereas the body is a conjunction of
one or more conditions that must be satisfied by the source objects. The head and the body
conditions are based on object patterns of the form <object-id label value>. If the object id is
omitted from a rule head object pattern, it is assumed to be a unique constant. If the object
id is omitted from a body object pattern, it is assumed to be a fresh variable. The object-id
is a term: a variable, an atomic constant, or a function symbol followed by a list of terms.
The label can be a variable or a constant and the value can be either a variable, an atomic
constant, or a set value pattern that contains zero or more object patterns. Section 3.2.3
discusses the syntactic restrictions placed on DSL rules.
For example, the following query returns information about SIGMOD conferences held
after 1992.1
1The information about the unit of measure for year is not present in the query. Specification andintegration of the semantics of data is of crucial importance for information integration, but it is an issueorthogonal to the issues we are examining. For more information on semantic interoperability, see [SBGJ+97].
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 22
(Q1) <cnf(C) sigmod {<f(X) Y Z>}> :-
<D dblp {<A article {<C conference {<N name "SIGMOD"> < year Y> <X Y Z>}>}>}>
AND (Y> 1992)
The head of the query consists of one object pattern, whereas the body of the query is
a conjunction of one or more
1. object patterns, tagged with their originating information source (such as db), and
2. external or built-in predicates applied to variables or constants, such as Y > 1992,
or isatomic(A), where isatomic is a predicate for testing if variable A has an atomic
value.
A DSL program is a collection of DSL rules. DSL queries and DSL view definitions are
DSL programs. If a query (or a view definition) happens to consist of a single rule, then
query (or view definition) will be used interchangeably with rule. Moreover, we will use
view and view definition interchangeably in the remainder of this thesis.
3.2.2 Semantics
DSL rules have minimal model semantics. We illustrate the semantics with the following
example, which is a simplification of (Q1).
(Q2) <cnf(C) sigmod {<f(X) Y Z>}> :-
<D dblp {<A article {<C conference {<N name "SIGMOD"> <X Y Z>}>}>}>
The semantics of the above query are
if there is a tuple of bindings d, a, c, n, x, y and z for the variables D, A, C, N, X, Y,
and Z such that
the data source contains a top-level (root) object with label dblp identified by d,
the d object has an article subobject with object id a,
the a object has a conference subobject with object id c
the object a may also have subobjects other than the c
the c object has a name subobject with value "SIGMOD" and object id n
the c object has a y subobject with value z and object id x
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 23
Figure 3.1: Result of (Q2) on database of Figure 1.1
then the query result has
a sigmod object, whose object-id is the term cnf(c),
the object with object id cnf(c) has a y subobject with value z and object id f(x).
the object cnf(c) may have subobjects other than y
because the result of another rule may “fuse” more subobjects into the object cnf(c)
Note that z could be a subgraph of the data in the source. The answer to the query is
a graph consisting of new objects with fresh, unique object ids and the structure denoted
by the query head. The bindings of the variables “fill in” this explicitly created structure.
The result of applying (Q2) to the database of Figure 1.1 is shown in Figure 3.1.
Formally, for an OEM database D, let PD be the set of all subgraphs2 of D, O be the
set of all object ids in D, and C be the set of all labels and atomic values. Let VO be the
set of all object id variables3 and VC be the set of all other (label and value) variables in
the body of the rule, with VO ∩ VC = φ. Let V = VO ∪ VC be the set of all variables. The
meaning of the rule body is the set of assignments θ : V → O ∪ C ∪ PD that satisfy all
conditions in the body. Each assignment maps object id variables to O, label variables to
C, and value variables to C ∪ PD.
The meaning of the rule head is defined as follows. We create and label the new
nodes of the answer graph, by instantiating the object id and label fields of the query
head, and we make the objects resulting from the instantiation of the top-level object pat-
tern of the query the roots of the answer graph. In particular, for each object pattern
<f(X1, . . . , Xm) L V> in the query head, and for each assignment θ above, create a new
object with object id f(θ(X1), . . . , θ(Xm)), label θ(L) and value θ(V ). If instead of V , the
object pattern above has {objpattern1, . . . , objpatternn}, the value of the created object is
{θ(objpattern1), . . . , θ(objpatternn)}.Notice that when two assignments produce the same term as the object id of two objects,
one object is created, and the values of the two objects are “fused,” which means that the
set of outgoing edges from the “fused” object is the union of the sets of outgoing edges from
2Remember that the value of a set object is the OEM subgraph rooted at that object.3Object id variables are variables appearing in the object id field of object patterns in the bodies of rules.
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 24
the two objects. Moreover, the incoming edges to any of the two objects will point to the
one fused object.
Space of function symbols and object ids Since object ids need to be unique, and
to avoid accidental “fusion” between objects in the query result and the OEM database,
function symbols appearing in the head of a DSL query belong to a different sort4 than
function symbols appearing in the body of the query or in the input database.
A direct consequence of using fresh function symbols in the query result is shown in
Lemma 3.2.1 below. Let us first give the definition of a template of a term.
Definition: For term or atom T , the template of T , denoted temp(T ) is the character
string obtained from T by replacing the i-th variable occurence in T by the new variable
Vi. For example, temp(f(X, g(X, Y ))) = f(V1, g(V2, V3)). 2
Lemma 3.2.1 Two terms T1, T2 appearing in the head of a DSL query can be unified
[GN88] by a satisfying variable assignment θ only if they have the same template.
Proof: If T1 and T2 do not have the same template, then, when parsing them left to
right, we will eventually reach a position where, in one of them we encounter some variable
X, and in the other one we encounter some term, for example (without loss of generality)
f(Y1, . . . , Yn). The variable X can only be unified with f(Y1, . . . , Yn) with a unifier that
maps X to an appropriate f -term. But, as we have explained, the space of function symbols,
such as f , used in the head of the query is disjoint from the space of function symbols in
the database, therefore a satisfying assignment θ cannot map X to an f -term. 2
Safe DSL rules A DSL rule is safe if every variable appearing in the rule head also
appears in the rule body. Thus, the same simple syntactic condition that is used by [Ull88]
to define safety of conjunctive queries can be used to define safety in DSL. In the remainder
of this thesis we are only discussing safe DSL rules.
3.2.3 Syntactic restrictions on DSL rules
DSL rules have to obey some simple syntactic restrictions. In particular, cyclical object
conditions are not allowed in the query body, and object id invention is limited to con-
structing DAGs of OEM objects. In what follows, I first describe formally the restriction
4In other words, they are“picked” from a disjoint set of symbols.
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 25
(a)Bodygraphof(R3)
(b)Bodygraphof(R4)
Figure 3.2: Body graph examples
imposed on query bodies and then the restriction on object id invention.
Restricting rule conditions
In order to define the restriction on query bodies, we first define the body graph bgraph(R)
of a rule R.
Definition: For DSL rule R, the body graph of R, denoted bgraph(R), is a labeled graph
(N,E), where N is the set of object ids appearing in the object patterns in the body of R,
each node labeled by the label in its corresponding object pattern, and E is the set of edges
naturally defined by the nesting of object patterns in the body of R. 2
Example 3.2.2 The body graph of the following DSL rule (R3) is shown in Figure 3.2(a).
(R3) <bk(T) book {<f(X) Y Z>}> :-
<D dblp {<B book {<T author A> <X Y {<B book Z>}>}>}>AND <R report {<T author {<N fn F>}>}>
2
We can now state formally the restriction imposed on the bodies of rules: every legal
DSL rule R has an acyclic body graph. Checking acyclicity of bgraph(R) is linear in the size
of R. Given the above restriction, rule (R3) above is not legal DSL. In contrast, rule (R4)
is legal DSL because its body graph is acyclic, as shown in Figure 3.2(b).
(R4) <bk(T) book {<f(X) Y Z>}> :-
<D dblp {<B book {<T author A> <X Y Z>}>}>}>AND <R report {<T author {<N fn F>}>}>
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 26
Figure 3.3: Head graph of (P5)
Restricting object id invention
In general, a DSL query constructs answer objects that are restructurings of source data.
Using terms in the object id field of the head object pattern allows semantic object id
invention; the function symbols used are known in the literature as Skolem functions. The
mechanism of inventing object ids using Skolem functions has been proposed in [Mai86;
HY90; KL89]. This object id invention mechanism is very powerful; namely, it allows the
construction of arbitrary graphs in the query output.
The unbridled power of object id invention significantly complicates the composition of
DSL queries, as will be explained in Section 3.3. Moreover, the full power of the object
id invention mechanism is not essential for most applications. That is why DSL imposes
a simple syntactic restriction on invented object ids. In brief, we require that the use of
Skolem terms does not create cycles in the query result; in other words, we require that the
structure created by the query head is acyclic. The formal definition follows. Let us first
give an additional definition to facilitate the rest of the presentation.
Definition: For DSL program P , the head graph of P , denoted hgraph(P ), is a graph
(N,E), where N is the set of templates for the object id terms appearing in the heads of
rules of P , and E is the set of edges naturally defined by the nesting of object patterns in
the heads of rules of P . 2
Example 3.2.3 The head graph of the following DSL program (P5) is shown in Figure 3.3.
(P5) <bk(T) book {<f(X) Y Z>}> :- <dblp {<book {<title T> <X Y Z>}>}><f(X) review {<g(A) reviewer C>}> :-
<nyrb {<X review {<A writtenby C>}>}>
2
We can now state formally the restriction on the use of Skolem functions to invent object
ids: For every legal DSL program P , the head graph of P is acyclic. Checking acyclicity of
hgraph(P ) is linear in the size of the heads of the rules of P . Given this restriction, we can
show the following:
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 27
(a)OEMdatabaseD
(b)(Q6)(D)
Figure 3.4: OEM database and result for Example 3.2.5
Theorem 3.2.4 A DSL query constructs answer objects that form a DAG.
Proof: A DSL query Q constructs new objects through object id invention. New edges are
constructed when the query head specifies that a newly constructed object o1 is a subobject
of constructed object o2. A cycle can be created only when newly constructed objects have
the same object id and therefore are fused. The only objects that can be fused are those
whose object ids have the same term templates (from Lemma 3.2.1). Therefore, a cycle in
the constructed answer objects of Q corresponds to a cycle in hgraph(Q). Since hgraph(Q)
is acyclic, the constructed objects in the result of Q cannot have cycles, which proves the
theorem. 2
We refer to the result of a DSL query as an answer DAG. Notice that the result of
Q can include cycles, since it can include copied subgraphs from the input database (c.f.
Section 3.2.2), as the following example shows.
Example 3.2.5 The result of the following query on the OEM database D of Figure 3.4(a)
is given in Figure 3.4(b).
(Q6) <f(X) new Z> :- <a {<X b Z>}>
2
The generalization of the semantics as defined above to a DSL program (i.e., a collection
of DSL rules) is straightforward.5
3.2.4 Expressive power and complexity
DSL is more expressive than conjunctive queries. In particular, the distinctive features
of DSL, namely function symbols and copying semantics for value variables, give DSL
5Notice that DSL does not support recursion.
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 28
the power of Datalog with limited forms of recursion. Specifically, DSL queries can be
expressed with linear Datalog programs, that is, Datalog with only linear recursion [AHV95;
Ull89], as the following theorem proves.
Theorem 3.2.6 DSL is strictly less expressive than StruQL [FFLS97] and linear Datalog.
Proof: Let L-DATALOG be the set of queries expressible by linear Datalog pro-
grams, and let TC-DATALOG be the set of queries expressible by Datalog programs
with only transitive closure recursive rules. StruQL is a semistructured language described
in [FFLS97], where it is also stated that it is equal to TC-DATALOG. It is shown in
[CM90] that TC-DATALOG =L-DATALOG. StruQL allows regular path expressions in
the query body. Regular path expressions are not expressible in DSL. To see this, notice
that DSL rule bodies can only enforce conditions on objects appearing at a fixed depth in
the input database, whereas, using regular path expressions, conditions can be applied to
objects in arbitrary depth. On the other hand, StruQL includes all the constructs found in
DSL, including, importantly, Skolem functions for defining object ids. 2
We know from [CM90] that L-DATALOG ⊂ QNLOGSPACE. An immediate conse-
quence of the theorem above is the following:
Corollary 3.2.7 DSL queries are in QNLOGSPACE.
3.2.5 Normal forms of DSL rules
To simplify the presentation of query composition in Section 3.3 and query rewriting in the
next chapter, we define normal form rules and full normal form programs. Every DSL rule
and program can be easily converted into normal form or full normal form, hence the focus
on normal forms does not limit the power of the language. First, let us define single path
object conditions:
Definition: Single path object condition is an object condition in which all the set-
valued value fields contain at most one object pattern. 2
We next define the notion of correspondence between a complete path and a single path
object condition in a rule or body graph.6 We make the definition through the following
example.
6A complete path is a path from a root to a leaf in the graph.
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 29
Example 3.2.8 Let us consider again DSL rule (R4), whose body graph is shown in Fig-
ure 3.2(b). The path
(D, dblp)→ (B, book)→ (X, Y )
corresponds to object condition
<D dblp {<B book {<X Y Z>}>}>
in the body of (R4) and vice versa.
Let us also consider again (P5), whose head graph appears in Figure 3.3. If we consider
the first rule of this program separately, its head graph contains only the complete path
bk(V )→ f(V ). This path corresponds to the object condition
<bk(T) book {<f(X) Y Z>}>
in the head of the rule. The path bk(V ) → f(V ) → g(V ) in the head graph of (P5)
corresponds to the object condition
<bk(T) book {<f(X) review {<g(A) reviewer C>}>}>
Notice that in this case the object condition does not appear in the head of any of the rules
of (P5); instead, it is created by unifying an object condition from the first rule of (P5)
with an object condition from the second rule. The intuition is that the path
bk(V )→ f(V )→ g(V )
is created through fusion of objects returned by the first rule of (P5) with objects returned
by the second rule. 2
Finally, let us define the notion of correspondence between a complete path in a head
graph and a rule, again through an example.
Example 3.2.9 Let us consider the first rule of (P5), whose head graph contains only one
complete path, bk(V )→ f(V ). The rule corresponding to this path is
<bk(T) book {<f(X) Y Z>}> :- <dblp {<book {<title T> <X Y Z>}>}>
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 30
If we consider now both rules of (P5), as explained in Example 3.2.8, the path
bk(V )→ f(V )→ g(V )
corresponds to the object condition
<bk(T) book {<f(X) review {<g(A) reviewer C>}>}>
which is created by appropriately unifying object conditions in the heads of the two rules.
The rule corresponding to this path is
<bk(T) book {<f(X) review {<g(A) reviewer C>}>}> :-
<D dblp {<B book {<title T>}>}> AND <D dblp {<B book {<X Y Z>}>}>AND <nyrb {<X review {<A writtenby C>}>}>
that is created as follows:
• The head of the rule is the object condition corresponding to the path.
• The body of the rule is the conjunction of the bodies of the rules whose heads “con-
tribute” to the creation of the object condition that corresponds to the path.7
2
Definition: Normal Form DSL rules are the DSL rules whose body is a conjunction of
single path object conditions. Additionally, a normal form rule with just one condition in
its body is called a single path rule. 2
The query (Q2) can be easily transformed into the following normal form query:
(Q7) <cnf(C) sigmod {<f(X) Y Z>}> :-
<D dblp {<A article {<C conference {<N name "SIGMOD">}>}>}>AND <D dblp {<A article {<C conference {<X Y Z>}>}>}>
Normalization makes all paths present in the body graph of a rule into separate object
conditions. In particular, to transform a query Q into normal form, it suffices to replace
the query body of Q with a conjunction of the single path object conditions corresponding
to the complete paths of bgraph(Q).
7Variables in the bodies may need to be appropriately renamed to avoid accidental variable equalization.
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 31
Definition: Full normal form DSL program is a DSL program P consisting of normal
form DSL rules, such that
• All set-valued value fields in the heads of the rules contain at most one object pattern,
and
• Every path present in the head graph hgraph(P ) of the program is present in the
head of some rule of P .
2
The following example explains in detail the translation of a DSL program into full
normal form.
Example 3.2.10 The DSL program (P5), which is repeated below, can be transformed
into the following full normal form program, as follows.
• Consider the first rule of (P5) alone. Its head graph has only one complete path,
bk(V )→ f(V ), which corresponds to rule
<bk(T) book {<f(X) Y Z>}> :- <dblp {<book {<title T> <X Y Z>}>}>
therefore that rule (which is the first rule in (P5)) is added to (P8).
• Consider the second rule of (P5) alone and similarly add it to (P8).
• Consider both rules of (P5) together. The head graph again has only one complete
path, bk(V )→ f(V )→ g(V ) that, as explained in Example 3.2.9, corresponds to rule
<bk(T) book {<f(X) review {<g(A) reviewer C>}>}> :-
<D dblp {<B book {<title T>}>}>AND <D dblp {<B book {<X Y Z>}>}>AND <nyrb {<X review {<A writtenby C>}>}>
We add this rule to (P8).
(P5) <bk(T) book {<f(X) Y Z>}> :- <dblp {<book {<title T> <X Y Z>}>}><f(X) review {<g(A) reviewer C>}> :-
<nyrb {<X review {<A writtenby C>}>}>
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 32
(P8) <bk(T) book {<f(X) Y Z>}> :-
<D dblp {<B book {<title T>}>}>AND <D dblp {<B book {<X Y Z>}>}>
<f(X) review {<g(A) reviewer C>}> :-
<nyrb {<X review {<A writtenby C>}>}><bk(T) book {<f(X) review {<g(A) reviewer C>}>}> :-
<D dblp {<B book {<title T>}>}>AND <D dblp {<B book {<X Y Z>}>}>AND <nyrb {<X review {<A writtenby C>}>}>
In the case of programs consisting of n rules, all subsets of rules of size k = 1, . . . , n need
to be considered.8 2
Because of object fusion, databases that are answers to DSL queries can contain paths
that are not explicitly present in the rule heads of the program (but are present in the head
graph). Full normalization makes these paths explicit by adding rules to the program that
create these paths. Therefore, full normalization extends normalization of rules to programs,
by making all paths in the head graph of a program into separate object conditions.
3.3 Query composition for DSL
The composition of DSL queries Q and V is a query Qc = V ◦ Q, such that for any OEM
database D, Qc(D) = Q(V (D)). Query composition is accomplished by resolving each
condition in the body of Q with the head of V in all possible ways, using unification (which
generalizes [GN88; Ull88]).9 Query composition is easily generalized to multiple queries
V1, . . . , Vn.
Let us look at the following two detailed examples:
Example 3.3.1 Let us consider the following query:
(Q9) <f(P) ans {<g(D) m V>}> :- <P p {<A l V>}> AND <Q q {<D m V>}>
8A simple optimization of the naive full normalization algorithm would consider only subsets of ruleswhose head graph includes complete paths not considered already.
9Unification and resolution for DSL is the same as for MSL and is presented formally in Section 5.4.2 of[Pap97].
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 33
and two views
(V1) <h(A′,B) p {<i(A′) l V′> <j(B) Y "abc">}> :-
<X label1 {<A′ label2 V′> <B Y W> <C label3 T>}>(V2) <k(F) L E> :- <G l {<F L E>}>
The reduction of (V1) to full normal form gives
(V3) <h(A′,B) p {<i(A′) l V′> }> :-
<X label1 {<A′ label2 V′>}> AND <X label1 {<B Y W>}>AND <X label1 {<C label3 T>}>
<h(A′,B) p {<j(B) Y "abc">}> :-
<X label1 {<A′ label2 V′>}> AND <X label1 {<B Y W>}>AND <X label1 {<C label3 T>}>
There exist two unifiers for the first condition of (Q9) and the heads of (V3):
θ1 = [P 7→ h(A′, B), A 7→ i(A′), V 7→ V′]
θ2 = [P 7→ h(A′, B), A 7→ j(B), Y 7→ l, V 7→ “abc”]
There exists one unifier for the first condition of (Q9) and the head of (V2):
θ3 = [P 7→ k(F), L 7→ p, E 7→ {<A l V>}]
The existence of three unifiers means that the result of resolving the first condition of
(Q9) with the views gives three DSL queries:
(Q10) <f(h(A′,B)) ans {<g(D) m V′>}> :-
<X label1 {<A′ label2 V′>}> AND <X label1 {<B Y W>}>AND <X label1 {<C label3 T>}>AND <Q q {<D m V′>}>
(Q11) <f(h(A′,B)) ans {<g(D) m "abc">}> :-
<X label1 {<A′ label2 V′>}> AND <X label1 {<B l W>}>AND <X label1 {<C label3 T>}>AND <Q q {<D m "abc">}>
(Q12) <k(F) ans {<g(D) m V>}> :-
<G l {<F p {<A l V>}>}> AND <Q q {<D m V>}>
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 34
For the second condition of each one of (Q10,Q11,Q12), there exists one unifier with (V2):
θ4 = [Q 7→ k(F), L 7→ q, E 7→ {<D m V′>}]
θ5 = [Q 7→ k(F), L 7→ q, E 7→ {<D m "abc">}]
θ6 = [Q 7→ k(F), L 7→ q, E 7→ {<D m V>}]
respectively. Therefore, Qc consists of 3 DSL rules:
(Q13) <f(h(A′,B)) ans {<g(D) m V′>}> :-
<X label1 {<A′ label2 V′> <B Y W> <C label3 T>}>AND <G l {<F q {<D m V′>}>}>
(Q14) <f(h(A′,B)) ans {<g(D) m "abc">}> :-
<X label1 {<A′ label2 V′> <B l W> <C label3 T>}>AND <G l {<F q {<D m "abc">}>}>
(Q15) <k(F) ans {<g(D) m V>}> :-
<G l {<F p {<A l V>}>}> AND <G l {<F q {<D m V>}>}>
Of those, (Q14) is subsumed by (Q13), in that obviously every object constructed by
(Q14) is also constructed by (Q13). So Qc consists of (Q13) and (Q15). 2
Example 3.3.2 Consider the following view definition:
(V4.1) <trep(RN1) tr {<TID title T>}> :-
<Ro1 r {<RNo1 rn RN1> <TID title T>}>(V4.2) <tp(FN,LN) pr {<n(FN,LN) name N>
<w(FN,LN) work {<trep(A) tr {<Sid subject S>}>}>}> :-
<P person {<FN1 fn FN> <LN1 ln LN>
<A article {<Sid subject S>}>}>AND mergename(N,FN,LN)
View rules (V4.1) and (V4.2) contribute information to tr and pr objects. The full
normal form reduction of the view is given below:
(V5.1) <trep(RN1) tr {<TID title T>}> :-
<Ro1 r {<RNo1 rn RN1>}> AND <Ro1 r {<TID title T>}>(V5.2) <tp(FN,LN) pr {<n(FN,LN) name N>}> :-
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 35
<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename(N,FN,LN)
(V5.3) <tp(FN,LN) pr {<w(FN,LN) work {<trep(A) tr {<Sid subject S>}>}>}> :-
<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename(N,FN,LN)
(V5.4) <tp(FN,LN) pr {<w(FN,LN) work {<trep(A) tr {<TID title T>}>}>}> :-
<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename(N,FN,LN)
AND <Ro1 r {<RNo1 rn A> <TID title T>}>
Let us now consider the query (Q16) that asks for all titles of John Smith’s reports.
(Q16) <f(P1,TID1) titles {<g(TID1) title T1>}> :-
<P1 pr {<N1 name "John Smith">}> AND
<P1 pr {<W work {<TR1 tr {<TID1 title T1>}>}>}>
The following unifier is produced for the name condition of (Q16) and the head of (V5.2):
θ1 = [P1 7→ tp(FN, LN), N1 7→ n(FN, LN), N 7→ ”JohnSmith”]
Applying θ1 to the query and the view, we produce
(Q17) <f(tp(FN,LN),TID1) titles {<TID1 title T1>}> :-
<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename("John Smith",FN,LN)
AND <tp(FN,LN) pr {<W work {<TR1 tr {<TID1 title T1>}>}>}>
The following unifier is produced for the second source condition (on pr) of (Q17) and
the head of (V5.4):
θ2 = [FN 7→ FN, LN 7→ LN, W 7→ w(FN, LN), TR1 7→ trep(A), TID1 7→ TID, T1 7→ T]
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 36
After renaming the variables in (V5.4) and applying θ2 to the query and the view, we
produce:
(Q18) <f(tp(FN,LN),TID) titles {<TID title T>}> :-
<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename("John Smith",FN,LN)
AND <P′ person {<FN1′ fn FN>}> AND <P′ person {<LN1′ ln LN>}>AND <P′ person {<A′ article {<Sid′ subject S′>}>}>AND mergename(N,FN,LN)
AND <Ro1 r {<RNo1 rn A> <TID title T>}>
Removing oviously redundant conditions gives:
(Q19) <f(tp(FN,LN),TID) titles {<TID title T>}> :-
<P person {<FN1 fn FN>}> AND <P person {<LN1 ln LN>}>AND <P person {<A article {<Sid subject S>}>}>AND mergename("John Smith",FN,LN)
AND <Ro1 r {<RNo1 rn A> <TID title T>}>
2
Notice that in DSL there are multiple mgus or most general unifiers. The practical
consequence is that the result of V ◦Q, can be a DSL program consisting of multiple rules.
In particular, Qc could consist of an exponential number of rules (each of at most polynomial
length.) This observation gives us the following theorem.
Theorem 3.3.3 (Composition Complexity) Query composition in DSL is in EXP-
TIME.
Notice that the order of resolving query conditions with view heads does not matter. Also
notice that query composition “implements” view dereferencing: it transforms a query that
refers to the object patterns in a view head to a query that refers to the objects in the
information source(s) that the view is defined over.
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 37
Query composition for MSL
A query composition algorithm is presented in Chapter 5 of [Pap97] as the Query Decom-
position algorithm for MSL, that takes as input a query over views and the view definitions
over some information sources, and produces a datamerge program over the information
sources that is equivalent to the original query. The algorithm as presented is incorrect:
it misses possible unifiers, as MSL programs are not in full normal form. In particular,
unifier θ2 in Example 3.3.2 above would be missed. Moreover, MSL programs can construct
cyclical answer graphs, which also defeat the unification algorithm (and consequently the
composition algorithm).
3.4 Equivalence of DSL queries
Two queries Q1, Q2 are equivalent if and only if for all OEM databases D, their results
Q1(D) and Q2(D) are equivalent. In this section, we develop a compile-time test of equiv-
alence of DSL queries, based on a simple extension of containment mappings [CM77].
Deciding query equivalence of DSL queries syntactically is complicated by the distin-
guishing characteristics of a semistructured language, like the restructuring capabilities
of the query language, support for value and label variables, and semantic object id in-
vention. The fact that semantic object ids complicate query processing and optimiza-
tion was observed already in [HY90] for intentional queries in the relational model. Be-
cause of DSL’s restrictions in object id invention, presented in Section 3.2.3, we are able
to come up with a simple query equivalence algorithm that is based on extending the
machinery used in the equivalence test for unions of conjunctive queries [Ull89; CM77;
SY80]. Namely, the algorithm is based on discovering containment mappings between
queries. We extend mappings and containment mappings in Section 3.4.1.
Moreover, object identity introduces key dependencies from the object id to the label and
value. The chase technique [Ull89] is used to deal with these dependencies. The technique
is extended to deal with the case of value variables that can bind to sets of object patterns.
In Section 3.4.2, we present our extension to the chase for the case of key dependencies on
object ids. The extension applies to any functional dependency with value variables on the
right hand side.
In order to decide the equivalence of chased DSL queries, we need to make sure that
all the components (i.e., nodes and edges) of the result graphs are the same. To do that,
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 38
we decompose a normal form DSL query into graph component queries that correspond to
the components of the result graph: edges, nodes and root, i.e., top-level, objects.10 In
particular, every DSL query Q is decomposed into three types of finer-grain rules:
• one top rule corresponding to the top level condition of the head of Q (this query
corresponds to the root of the OEM graph constructed by the head of Q)
• as many member rules as there are object-subobject relationships in the head of Q
(these queries correspond to the edges of the OEM graph constructed by the head of
Q) and
• as many object rules as there are object conditions in the query head of Q (corre-
sponding to the objects of the OEM graph constructed by the head of Q and describing
their labels and values).
The decomposition is illustrated by the following example. Note that top, member and
object rule heads depart from DSL syntax: they are simply relational predicates.
Example 3.4.1 Consider the query (Q20); it consists of two rules, and in order to make
the graph component query decomposition clearer, we have let both queries have the same
body. Notice that the first rule of (Q20) creates an l object for every a object and an n′
object for every c object, while the second rule creates an l′ object for every a and an n
object for every c.
(Q20) <l(X) l {<f(Y) m {<n′(Z) n′ V>}>}> :- <X a {<Y b {<Z c V>}>}><l′(X) l′ {<f(Y) m {<n(Z) n V>}>}> :- <X a {<Y b {<Z c V>}>}>
The following rules are the graph component queries of (Q20).
top(l(X)) :- <X a {<Y b {<Z c V>}>}>top(l′(X)) :- <X a {<Y b {<Z c V>}>}>member(l(X),f(Y)) :- <X a {<Y b {<Z c V>}>}>member(l′(X),f(Y)) :- <X a {<Y b {<Z c V>}>}>member(f(Y),n′(Z)) :- <X a {<Y b {<Z c V>}>}>member(f(Y),n(Z)) :- <X a {<Y b {<Z c V>}>}>
10Recall that OEM graphs are rooted.
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 39
object(l(X),l,set) :- <X a {<Y b {<Z c V>}>}>object(l′(X),l′,set) :- <X a {<Y b {<Z c V>}>}>object(f(Y),m,set) :- <X a {<Y b {<Z c V>}>}>object(n′(Z),n′,V) :- <X a {<Y b {<Z c V>}>}>object(n(Z),n,V) :- <X a {<Y b {<Z c V>}>}>
2
Decomposition into graph component queries takes time linear to the size of the heads
of the DSL rules.
3.4.1 Mappings and containment mappings
In this section, we define mappings and containment mappings for DSL rules and graph
component queries.
Definition: A mapping h assigns to each variable appearing in a DSL rule either a
variable, a constant, a term or a set of object patterns. It substitutes the function symbols
appearing in the head of a rule for a function symbol. The mapping extends naturally to
terms and object patterns, with h being the identity mapping on all constants, predicate
symbols and function symbols appearing in the body of a rule. 2
Definition: Valid Application of a Mapping on an OEM Object Pattern The
result of applying a mapping h on an OEM object pattern is a pattern where every variable
V is replaced by h(V ) and every function symbol f is replaced by h(f). The mapping is
applicable to the object pattern if (i) the resulting pattern has valid OEM syntax, i.e., set
patterns do not appear in object-id or label positions, and (ii) it is compatible with key
dependencies imposed by the object-id’s. 2
Definition: Containment mapping Let R1, R2 be two graph component queries or
normal form DSL rules:
R1 : H : − < object pattern1 > AND . . . AND < object patternn >
R2 : I : − < object pattern′1 > AND . . . AND < object pattern′k >
A mapping h is said to be a containment mapping if h turns R2 into R1; that is if
h(I) = H, and for every i there exist a j such that
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 40
h(< object pattern′i >) =< object patternj >
2
3.4.2 Extending the chase for set variables
Object identity introduces a functional dependency in OEM: a key dependency from the
object id to the label and value. Moreover, structural constraints, such as those described
by a DTD [Gol90], introduce functional dependencies, as we will see in Section 4.3. We use
the chase technique [Ull88] to deal with these dependencies. The technique is extended for
the case of value variables, which can bind to sets of object patterns. In the rest of the
section, we present our extension to the chase for the case of key dependencies on object id.
The extension applies in general to any functional dependency with value variables in the
right hand side. Recall that DSL queries are not allowed to contain cyclic object patterns.
This is necessary for the described simple extension to the chase to terminate.
Chase extension for dependency on object id Let o1, o2 be object patterns in the
body of a query q with the same term in the object id field.
• If o1 and o2 have L1, V1 and L2, V2 in their label and value field respectively, then we
replace all occurrences of L2, V2 in q with L1, V1 respectively.
• If o1 has object patterns {oi, . . . , oj} in its value field and o2 has V2, then replace all
occurrences of V2 in q with {<X Y Z>}, where X, Y, Z are variables not appearing in
q.
• If o1 has {oi, . . . , oj} in its value field and o2 has {ck, . . . , cm}, replace the value fields
of both o1 and o2 with {oi, . . . , oj , ck, . . . , cm}.
• If one of o1, o2 has a constant in one of the fields, and the other has a variable, replace
all occurrences of that variable in q with the constant.
• If both o1 and o2 have constants in one of the fields, then, if the constants are different,
halt with an error (this query cannot be chased to an equivalent query satisfying the
object id key dependency). If the constants are the same, do nothing for this field.
• If o2 is identical to o1, drop o2 from q.
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 41
In order to “chase” functional dependencies that do not involve value variables, we
can use the “regular” chase rule [Ull88]. It is easy to see that our extension to the chase
terminates. It is also easy to see that all terminal chasing sequences [AHV95] are the same.11
Example 3.4.2 Consider query (Q21):
(Q21) <f(P) stan student V> :- <P person {<U university stanford>}>AND <P person V>
(Q21) is chased to (Q22), since V is a set variable. Using the key dependency on object
id, we infer that V is a set variable and thus transform (Q21) to (Q22). Notice how the
value variable is transformed into a set pattern because of the chase.
(Q22) <f(P) stan student {<X Y Z>}> :-
<P person {<U university stanford>}> AND <P person {<X Y Z>}>
2
Notice also that all OEM databases satisfy the object id functional dependency by
definition. That means that the following theorem holds.
Theorem 3.4.3 Let Q be a DSL query and chase(Q) be a terminal chasing sequence
of Q. Also let the only dependency be the functional dependency on object id. Then
Q ≡ chase(Q).
A direct consequence of Theorem 3.4.3 is the following.
Corollary 3.4.4 In the presence of only object id dependencies, two DSL queries are equiv-
alent if and only if their chased counterparts are equivalent.12
The next subsection presents the syntactic test for DSL query equivalence.
11A terminal chasing sequence is the query that results from successive applications of the chase rulesuntil no more applications are possible.
12Theorem 3.4.3 and Corollary 3.4.4 also hold in the presence of arbitrary functional dependencies. Theproof is analogous to the relational case ([AHV95], pp 173-177).
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 42
3.4.3 Deciding DSL query equivalence
Before presenting the condition for equivalence of DSL queries, let us discuss equivalence of
graph component queries. The condition for equivalence for each of the three types of graph
component queries is a simple generalization of the condition for equivalence of relational
conjunctive queries with object ids [CM77; HY91].13
Theorem 3.4.5 Two (top, member, object) queries Q1 and Q2 are equivalent if and only
if there exists a containment mapping from Q1 to Q2 and another containment mapping
from Q2 to Q1.
The condition for equivalence of sets of graph component queries is then easily derived:
Theorem 3.4.6 Two sets S1 = {P1, . . . , Pn} and S2 = {T1, . . . , Tm} of graph component
queries are equivalent if and only if for each Pi there exists an equivalent Tj , and for each
Ti there exists an equivalent Pj .
Theorem 3.4.6 is a generalization of the containment theorem for unions of relational
conjunctive queries [SY80; HY91]. The proof is analogous. The condition for DSL equiva-
lence then follows:
Theorem 3.4.7 (DSL query equivalence) Two DSL queries are equivalent if and only
if their decompositions into graph component queries are equivalent.
Proof: The proof for the IF direction is straightforward. For the ONLY IF, if the
graph component decompositions are equivalent, then for every OEM database D, the
result databases will have the same objects (because of the equivalence of the object rules)
and the same root objects (because of the equivalence of the top rules). They will also
have the same member tuples, which means that they will have the same object-subobject
relationships. This concludes the proof. 2
Example 3.4.8 Consider again query (Q20) and query (Q23) given below. Notice that
(Q20) and (Q23) have the same query body. Also notice that query (Q23), though it has
different path conditions in its rule heads from query (Q20), creates exactly the same result.
The intuition is that the different query heads create different “parts” of the same answer
graph for (Q20) and (Q23).
13Our notion of query equivalence is a variant of the notion of exposed equivalence in [HY91].
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 43
(Q23) <l(X) l {<f(Y) m {<n(Z) n V>}>}> :- <X a {<Y b {<Z c V>}>}><l′(X) l′ {<f(Y) m {<n′(Z) n′ V>}>}> :- <X a {<Y b {<Z c V>}>}>
The graph component queries for (Q23) are shown below:
top(l(X)) :- <X a {<Y b {<Z c V>}>}>top(l′(X)) :- <X a {<Y b {<Z c V>}>}>member(l(X),f(Y)) :- <X a {<Y b {<Z c V>}>}>member(l′(X),f(Y)) :- <X a {<Y b {<Z c V>}>}>member(f(Y),n′(Z)) :- <X a {<Y b {<Z c V>}>}>member(f(Y),n(Z)) :- <X a {<Y b {<Z c V>}>}>object(l(X),l,set) :- <X a {<Y b {<Z c V>}>}>object(l′(X),l′,set) :- <X a {<Y b {<Z c V>}>}>object(f(Y),m,set) :- <X a {<Y b {<Z c V>}>}>object(n′(Z),n′,V) :- <X a {<Y b {<Z c V>}>}>object(n(Z),n,V) :- <X a {<Y b {<Z c V>}>}>
It is obvious that (Q20)≡(Q23) 2
Example 3.4.9 Consider the following two queries:
(Q24) <f(P) stan student {<X Y Z>}> :-
<P person {<U university stanford>}> AND <P person {<X Y Z>}>
(Q25) <f(P) stan student {<X Y Z>}> :-
<P person {<X Y Z′>}> AND <P person {<X Y′ Z>}>AND <P person {<U university Z′′>}>AND <P person {<U Y′′ stanford>}>
Notice that unless we chase (Q25), there is no mapping from the body of the query (Q24)
to the body of (Q25). By chasing (Q25), we infer that Y ≡ Y′, Z ≡ Z′, Y′′ ≡ university,
and Z′′ ≡ stanford, which turns (Q25) into (Q26):
(Q26) <f(P) stan student {<X Y Z>}> :- <P person {<X Y Z>}> AND
<P person {<U university stanford>}>
Queries (Q24) and (Q26) are obviously equivalent. 2
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 44
3.5 DSL and other semistructured languages
DSL is a variant of the Mediator Specification Language (MSL) [PGMU96]. MSL supports
recursion and does not impose any restrictions on the query conditions or the invented object
ids. The result is that the query equivalence and composition problems for MSL become
unnecessarily complex: there are no known composition and query equivalence algorithms
for MSL.
StruQL [FFLS97] is a logic-based semistructured language very similar to DSL and
MSL. As mentioned in Section 3.2.4, its distinguishing characteristic compared to DSL is
support for regular path expressions in the query body. The query composition problem
does not have a general solution for StruQL, i.e.for StruQL rules Q, V , there does not
always exist a StruQL query Qc = V ◦Q, such that for any database D, Qc(D) = Q(V (D))
[FFLS97]. The containment and equivalence problems for a subset of StruQL14 are studied
in [FLS98].
XML-QL [DFF+] is a query language for XML that uses a syntax inspired from SQL
and Lorel and whose semantics follow that of StruQL and MSL.
Lorel [AQM+97] is an OQL-based semistructured language that is the query language
of Lore [MAG+97], the first semistructured database management system, which was de-
veloped at Stanford. Lorel uses OEM as its data model. Optimization techniques for Lorel
are described in [MW99].
XMAS is a semistructured language for XML that has been proposed by the MIX
project [LPV00]. XMAS does not support the notion of object identity, and can perform
restructuring through a groupby operator. Comparing the restructuring power of an explicit
groupby operator versus object id invention, it is observed already in [AK89] that object id
based set formation (as provided by the object id) can replace explicit grouping operators.
The reverse holds in a limited fashion: Suciu and Vianu show in [MSV00] that the explicit
groupby operator in XMAS has the same restructuring power as Skolem-based object id
invention in DSL and XML-QL, under some additional constraint on the form of invented
object ids.
The restructuring capabilities of XMAS are a subset of those of DSL, as XMAS can
only construct trees in the query output, whereas DSL can construct DAGs.
Quilt [CRF00] is a proposed XML query language that incorporates the best features
14The subset considered is conjunctive StruQL without restructuring capabilities in the head.
CHAPTER 3. DSL: A LANGUAGE FOR SEMISTRUCTURED DATA 45
of other XML and semistructured query languages, including the ones mentioned above.
Quilt includes all important constructs for querying XML data and returning XML results,
except a grouping operator or Skolem functions for object invention. Instead, complex XML
results can be constructed using nested queries.
3.6 Conclusion
DSL has features essential for querying and integrating semistructured data, namely the
ability to query and copy arbitrarily nested, schemaless data, the ability to restructure such
data through the use of semantic object ids, and the ability to query the “structure” of
the data through the use of label variables. These features make it a strong choice for
information integration, as well as a good basis for an XML query language.
For a semistructured language to be useful, query optimization techniques must be
available. DSL allows flattening of nested queries through query composition. In the next
chapter, an algorithm for query rewriting using views is presented for DSL, that uses the
algorithms for query composition and query equivalence developed in this chapter.
Chapter 4
Query Rewriting for
Semistructured Data
4.1 Introduction
As we explained in Chapter 1, the capability-based rewriting problem and the related prob-
lem of query rewriting using views are at the core of integration systems. Moreover, as
described in Chapter 1, integration systems benefit by using a semistructured data model
and a semistructured query language. In this chapter, we present a query rewriting al-
gorithm for DSL that is the first rewriting algorithm for a semistructured language. The
algorithm builds on the theory developed in the previous chapter. The rewriting algorithm
is at the core of the CBR module of a mediator, as shown in Figure 1.6. Moreover, it has
various applications beyond information integration:
Rewriting in semistructured repositories A rewriting algorithm can be used to an-
swer queries using materialized views and cached queries of repositories for semistruc-
tured data, such as Lore [MAG+97].
For example, if a cached query result contains all “SIGMOD” publications, our rewrit-
ing algorithm can create a rewriting query where “SIGMOD 97” publications are ob-
tained by filtering the cached query for “1997” publications. The rewriting algorithm
only needs the query and the cached query statements - it does not need to examine
the source data. The cached queries play in this case the role of views.1
1Given the autonomy of the bibliographic sources and the mediator, the rewriting query may deliver a
46
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 47
Materialized views and cached queries were the main original motivation for rela-
tional query rewriting [YL87], and they are as important for semistructured or XML
databases.
Web site management and structured Web search Recent work [FFLS97] has ap-
plied concepts from information integration to the task of building complex Web sites
that serve information derived from multiple data sources. In this scenario, a Web site
is a declaratively-defined site graph over the semistructured data graph of the contents
of the information sources. If we only have access to the information through the Web
site(s), queries asked over the data graph need to be rewritten as queries over the Web
site structure and contents. The Web site definitions are just view definitions over
the data graph; the necessary query rewriting can thus be handled by the rewriting
algorithm.
Our algorithm solves the problem of rewriting a query Q using views, described in
Section 1.4.2, by outputting a finite set Q of rewriting queries, i.e., queries equivalent to Q
that have at least one condition referring to one of the views.
The algorithm operates in two steps: First it generates candidate rewriting queries by
discovering mappings from the views to the original query Q. Then it keeps the candi-
date rewriting queries which are equivalent to the original query by performing equivalence
checks.
Testing the equivalence of the candidate rewriting query with the original query is
accomplished in two steps. First, the algorithm composes the rewriting query Qr and the
views to obtain an expanded query Qc that is equivalent to the rewriting query but does
not refer to the views any more. Then Qc is tested for equivalence with the original query
Q.
The rewriting algorithm is extended to make use of structural constraints on the source
data. In particular, we consider constraints that can easily be expressed by standards such
as the XML DTDs or XML Schemas. The existence of such constraints allows us find
rewritings in cases where, in the absence of constraints, the algorithm would fail.
The reader may wonder whether, given a reduction of semistructured data to relations
such as the one presented in [Pap97], the DSL rewriting problem can be fully reduced to
stale result to the user. This result may still be very useful to the user. Furthermore, if an update-propagationsystem is in place, it can account for the “deltas” between the cache and the sources [ZGMHW95]. In thispaper we will not deal any further with these consistency issues. Instead we focus on the rewriting algorithm.
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 48
the well-understood relational conjunctive query rewriting problem. The answer is negative
because DSL queries cannot be reduced to conjunctive relational queries. DSL queries and
views are reducible to Datalog with function symbols and with a limited form of recursion,
as follows from the MSL-to-Datalog reduction presented in [Pap97]. There are no known
applicable results for rewriting of Datalog programs using Datalog views.
Content Section 4.2 restates the rewriting problem for DSL, presents the rewriting
algorithm, proves its correctness and discusses its complexity. Section 4.3 in particular
extends the algorithm to take into account the existence of structural constraints on the
semistructured data that are given in the form of a Document Type Definition (DTD).
Section 4.4 describes the capability-based rewriting heuristic used by the CBR module of
the TSIMMIS mediator. Finally, Section 4.5 discusses related work and Section 4.6 offers
some concluding remarks.
4.2 DSL Query Rewriting
Given a DSL query Q over an OEM database D and conjunctive views V = V1, . . . , Vn over
D, the rewriting problem is to find a DSL query Q′ such that (i) Q′ refers to at least one
of V1, . . . , Vn and (ii) Q is equivalent to Q′.
We call Q′ the rewriting query. In general, there may be more than one rewriting queries.
4.2.1 Rewriting of Queries with a Single Path Condition
We informally present an algorithm which decides whether a single-rule, normal form query
Q, having one single path condition in its body, can be rewritten using a single-rule, nor-
mal form view V . This algorithm, though a special case of the complete rewriting algo-
rithm, illustrates the basic steps of our technique. The general algorithm is presented in
Section 4.3.1. It is proven sound and complete for DSL and its complexity is studied in
Section 4.3.2.
Step 1: Find Candidate Queries We first find mappings from the view to the condition
and then we develop a candidate query for each mapping.
Step 1A: Find Mappings Find, if one exists, a mapping from the body of V to the
body of Q. Notice that there can be at most one mapping from the body of V
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 49
to the one single path condition in the body of Q. However, in the general case
(Section 4.3.1) we may have multiple mappings. If a mapping exists, then we can
be sure that, if there is a variable binding that satisfies the body of Q, then there
is also a binding that satisfies the body of V . Hence mappings are a necessary
condition for the relevance of the view to the query condition. Furthermore, the
mapping indicates which conditions of Q do not appear in V ; these conditions
will have to be checked by the rewriting query.
Example 4.2.1 Consider the view (V6), which restructures the person objects
into objects that “group” their labels in property subobjects, and their values
in value subobjects. Notice that (V6) “loses” information in the sense that it
only shows the labels and values that appear in the source data, but the label-
value correspondence has disappeared. Queries such as (Q27), that ask whether
the value leland appears in the source, can be answered using the view (V6),
because they do not need information on the label-value correspondence. The
example below shows how our algorithm finds a rewriting query for (Q27).
(V6) <g(P′) person {<pp(P′,Y′) property Y′> <h(X′) value Z′>}> :-
<P′ person {<X′ Y′ Z′>}>(Q27) <f(P) stanford yes> :- <P person {<X Y leland>}>
The only mapping from the body of (V6) to the body of (Q27) is (M1). Intu-
itively, (M1) indicates that the condition Z′ = leland must be enforced on the
view in order to get objects relevant to the query.
(M1) [ P′ 7→ P, X′ 7→ X, Y′ 7→ Y, Z′ 7→ leland ]
2
Step 1B: Generate Candidate Query Apply the mapping to V , resulting in an
“instantiation” of V , namely V ′. Then build the rewriting query Q′ as follows:
The head of Q′ is identical to the head of Q. The body of Q′ is the head of V ′.
Example 4.2.1 continued The only candidate rewriting query (Q28) is cre-
ated from the head of (Q27) and the result of applying (M1) to the head of
(V6).
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 50
(Q28) <f(P) stanford yes> :-
<g(P) person {<pp(P,Y) property Y> <h(X) value leland>}>
Step 2: Test Correctness of Candidate Query Check whether the composition of V
and Q′, denoted by V ◦Q′, is equivalent to Q. Step 2 is accomplished in two sub-steps:
Step 2A: Computation of Composition We compute V ◦Q′ using the compo-
sition algorithm in Section 3.3.
Step 2B: Testing Equivalence of V ◦Q′ and Q Equivalence testing is done as
described in Section 3.4.
Example 4.2.1 continued We test whether (Q28) is a valid rewriting query by
first transforming it into the normal form (Q28)n, then composing it with (V6), and
finally comparing the resulting query (V6)◦(Q28)n to (Q27). Indeed, (V6)◦(Q28)n is
equivalent to (Q27) because (i) the containment mapping (M2) maps (V6)◦(Q28)n to
(Q27) and (ii) the containment mapping (M3) maps (Q27) to (V6)◦(Q28)n.2
(Q28)n <f(P) stanford yes> :- <g(P) person {<pp(P,Y) property Y>}>AND <g(P) person {<h(X) value leland>}>
(V6)◦(Q28)n <f(P) stanford yes> :- <P person {<X′ Y Z′>}> AND
<P person {<X′′ Y′′ leland>}>(M2) [ P 7→ P, X′ 7→ X, Y 7→ Y, Z′ 7→ leland, X′′ 7→ X, Y′′ 7→ Y ]
(M3) [ P 7→ P, X 7→ X′′, Y 7→ Y′′ ]
Let us look at a second example:
Example 4.2.2 Consider the query (Q29) and the view (V6). It is clear that Z′ must bind
to set values that contain a <Z last stanford> subobject. The algorithm captures this
intuition by finding the mapping (M4) from the body of (V6) to the body of (Q29). Notice
that Z′ is mapped to {<Z last stanford>}>.
(Q29) <f(P) stanford yes> :- <P person {<X Y {<Z last stanford>}>}>(M4) [ P’ 7→ P, X’ 7→ X, Y’ 7→ Y, Z’ 7→ {<Z last stanford>} ]
(Q30) <f(P) stanford yes> :- <g(P) person {<pp(P,Y) property Y>
<h(X) value {<Z last stanford>}>}>
2We skip the step of graph component decomposition because of the simplicity of the example.
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 51
(Q30) is the candidate query created from the head of (Q29) and the result of applying
(M4) to the head of (V6). 2
Mappings are necessary but not sufficient for the existence of a rewriting query, as the
following example illustrates. That is why a containment test is needed, as in Step 2B of
the algorithm.
Example 4.2.3 Consider query (Q31) and view (V6).
(Q31) <f(P) stanford yes> :- <P person {<X name {<Z last stanford>}>}>
Intuitively, there is no rewriting query for (Q31) because the view “loses” the correspon-
dence between labels and values. Hence, if the database contains a name attribute and a
value v containing the <last stanford> subobject, it is impossible for the rewriting query
to discover whether there is a name object with value v or name and v appear in different
objects of the data source. Notice that despite the non-existence of a rewriting query there
is the mapping (M5). Based on this mapping, the algorithm derives the candidate rewriting
query (Q32). However, the composition of the candidate rewriting query with the view
results in the query (Q33) which is not equivalent to the original query (Q31). Notice that
name is the label of the object X′ while <last stanford> is a subobject of another object
X′′.
(M5) [ P′ 7→ P, X′ 7→ X, Y′ 7→ name, Z′ 7→ {<Z last stanford>} ]
(Q32) <f(P) stanford yes> :- <g(P) person {<pp(P,Y) property name>
<h(X) value {<Z last stanford>}>}>(Q33) <f(P) stanford yes> :- <P person {<X′ name Z′>}> AND
<P person {<X′′ Y′′ {<Z last stanford>}>}>
2
Section 4.3 discusses how the algorithm can exploit structural constraints, such as DTDs,
on source data.
4.3 Using structural constraints
Semistructured data is often accompanied by constraints that partially define the structure
of objects. Such structural constraints can be expressed as a DTD [Gol90], a DataGuide
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 52
[GW97], or an XML Schema [TBMM; BM]. For instance, we could know that the data in
the previous examples conform to the following DTD:3
<!ELEMENT person (name, phone, address*)>
<!ELEMENT name (last, first, middle?, alias?)>
<!ELEMENT alias (last, first)>
<!ELEMENT address CDATA>
<!ELEMENT phone CDATA>
<!ELEMENT last CDATA>
<!ELEMENT first CDATA>
<!ELEMENT middle CDATA>
This DTD describes in a flexible way the structure of the source data. For example,
it specifies that objects labeled person have exactly one subobject each with labels name
and phone, and zero or more address subobjects. It also specifies that subobjects phone
and address are atomic. Given such a DTD, we can infer information in the form of
dependencies between labels or object ids, that will allow the rewriting algorithm to discover
rewritings in cases where it would have otherwise failed.
Example 4.3.1 Given the above DTD, we can infer automatically that, in a source that
conforms to the DTD, the only subobject of a person object with a last subobject is a
name object. Therefore, if we look at (Q33) in Example 4.2.3, Y′′ has to be name. Moreover,
there exists a “labeled” functional dependency from object id P with label p to object id
X with label name, since according to the DTD a p object has exactly one name subobject.
This implies that X′′ has to be X′ (by application of the chase rule). Therefore (Q33) can
be rewritten as
(Q34) <f(P) stanford yes> :- <P person {<X′ name Z′>}> AND
<P person {<X′ name {<Z last stanford>}>}>
It is obvious that (Q34) is equivalent to (Q31), and therefore a valid rewriting query.
2
As illustrated in the previous example, we identify two cases where information can easily
be inferred from a structural description, such as a DTD:
3Since OEM does not support order, we ignore the order in the DTD description as well.
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 53
• (label inference) Given a “path expression” of labels a.?.c, if the structural constraint
specifies that the only subobject of an a object with a c subobject is a b subobject,
we can infer that ? = b.
• (functional dependency) If the structural constraint specifies that objects labeled a
have only one subobject labeled b, we can infer the functional dependency between
object id variables Xa → Yb.
The rewriting algorithm takes advantage of this information by performing label in-
ference and the chase on the query, the views and the candidate queries, as illustrated in
Example 4.3.1. It is straightforward to show that applying label inference and the chase
always terminates in time polynomial to the length of the queries and the constraints de-
scription. Moreover, it is easy to show that label inference and the chase do not affect the
soundness of the rewriting algorithm.
In the presence of structural constraints, there are clearly more opportunities for query
simplification and query rewriting. Identifying these opportunities is an open research
problem.
The next section presents the general algorithm for query rewriting.
4.3.1 General case of query rewriting
We now treat the general case of the query rewriting problem, with any number of views
in V and any number of conditions in the body of the (single-rule) query Q. For the sake
of simplicity, the following example uses a view set with only one view. The method used
generalizes trivially to view sets of any size; the algorithm described in Figure 4.1 covers
the general case.
Example 4.3.2 Consider the following view (V7). Notice that the semantic object ids of
property and value objects retain information about the object that originally had that
property and value. Then consider query (Q35).
(V7) <view(P′) person {<pp(X′) property Y′> <val(X′) value Z′>}> :-
<P′ person {<X′ Y′ Z′>}>(Q35) <f(P) stan student {<X Y Z>}> :- <P person {<X Y Z>}>
AND <P person {<U university stanford>}>
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 54
Intuitively, (Q35) can be answered using only (V7) as follows: First use (V7) to find the
P’s that have a university subobject with value stanford. The mapping (M6) from the
body of (V7) to the first condition of (Q35) implies that this is possible. Then for every P
that qualifies, pick all its subobjects <X Y Z>. Mapping (M7) from the body of (V7) to the
second condition of (Q35) implies that this is also possible. Then, the head of the rewriting
query (Q36) is the head of (Q35) and the body of (Q36) is the conjunction of θ6(head(V 7))
and θ7(head(V 7)).
(M6) θ6 = [P′ 7→ P, X′ 7→ U, Y′ 7→ university, Z′ 7→ stanford]
(M7) θ7 = [P′ 7→ P, X′ 7→ X, Y′ 7→ Y, Z′ 7→ Z]
(Q36) <f(P) stan student {<X Y Z>}> :-
<view(P) person {<pp(X) property Y> <val(X) value Z>}> AND
<view(P) person {<pp(U) property university>
<val(U) value stanford>}>
Let us now check whether (Q36) is a valid rewriting query. Performing the check means
transforming (Q36) and (V7) into full normal form and checking whether
(Q37) = (V 7) ◦ (Q36)n
is equivalent to (Q35).
(Q37) <f(P) stan student {<X Y Z>}> :-
<P person {<X Y Z′>}> AND <P person {<X Y′ Z>}> AND
<P person {<U university Z′′>}> AND
<P person {<U Y′′ stanford>}>
Notice that unless we make use of the key dependency Oid → LabelValue there is no
mapping from the body of the query (Q35) to the body of (Q37). By chasing (Q37), we
infer that Y ≡ Y′, Z ≡ Z′, Y′′ ≡ university, and Z′′ ≡ stanford. 2
We now give the algorithm for the general case of the query rewriting problem. In what
follows, the bodies of the query Q and the views in V are converted into normal form and
label inference and the chase are applied before we apply the algorithm.
Notice that the above algorithm constructs and tests all candidate queries (in Step
1B). The efficiency of the algorithm can be substantially improved with the use of simple
heuristics. A particularly effective heuristic is the following:
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 55
Algorithm 4.3.3
Input: A DSL query Q with k single path conditions in the bodyand a set of DSL views V = {V1, . . . , Vn}.
Output: A set of rewriting queries.Step 1A: Find the mappings θij from the body of each Vi ∈ V to the body of Q.Step 1B: Construct candidate rewriting queries Q′ as follows:
• head(Q′) is head(Q)• body(Q′) is any conjunction of l conditions, 1 ≤ l ≤ k, where eachcondition is either a view “instantiation” θij (head(Vi)) or a condition of Q.If the resulting query is unsafe, then continue with next candidate.
Step 1C: Perform label inference and chase Q′.Step 2: Test whether each constructed Q′ is correct.
• Construct the composition Q′(V1, . . . , Vn) of Q′ with V1, . . . , Vn.• Perform label inference and chase Q′(V1, . . . , Vn).• If Q′(V1, . . . , Vn) is equivalent to Q, include Q′ in the output;else continue with the next candidate.
2
Figure 4.1: DSL query rewriting algorithm
• Keep track of which conditions of the query body each instantiated view θij (head(Vi))
maps into. These are the conditions that are “covered” by θij (head(Vi)).
• Only construct candidate queries Q′ such that the views and conditions in the body
of Q′ “cover” all the conditions in the body of Q.
A variation of the above heuristic is implemented in the capability-based rewriting mod-
ule of the TSIMMIS system, as explained in more detail in Section 4.4.
4.3.2 Completeness and Complexity
The soundness of the algorithm in Figure 4.1 is established by its second step, that checks
the correctness of the rewriting. We will show that the algorithm is complete, i.e., that it
always finds a rewriting query if one exists. For this, we assume that there are no structural
constraints, and therefore no functional dependencies except the key dependencies on object
id.
To prove the completeness of the algorithm, we first observe that if there is no mapping
from a view body to the query body, then the view is not “relevant” to the query.
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 56
Lemma 4.3.4 Let Q and V be DSL queries. There is a rewriting query Q′ of Q using view
V only if there is a mapping from the body of V to the body of Q.
Moreover, we can bound both the number of conditions and the variables appearing in
the rewriting.
Lemma 4.3.5 Let Q be a DSL query and V be a set of DSL views. If there exists a
rewriting of Q using V, then there exists such a rewriting consisting of at most k view
heads, where k is the number of single path conditions in the body of the query.4
Lemma 4.3.6 If there exists a rewriting of query Q using the set of views V, then there
exists a rewriting of Q using V that doesn’t use variables that don’t exist in Q.
The above lemmata directly extend the theory of relational query rewriting, presented
in [LMSS95], to DSL.
The following lemma justifies why completeness is not compromised by only constructing
rewriting queries Q′ that have a head identical to the head of the query Q. Notice, this
is an issue that is particular to semistructured and nested data models, while it is trivial
in the relational model (where it is easier to see that Q′ must have a head identical, up to
variable renaming, to the head of Q).
Lemma 4.3.7 If there exists a valid rewriting query Q′′ such that head(Q′′) is not the same
as head(Q), then there exists a valid rewriting query Q′ such that head(Q′) = head(Q).
To see that Lemma 4.3.7 holds, notice that if there exists such a query Q′′, then we can
always apply our rewriting algorithm to it, to derive a query Q′ equivalent to Q′′ (and
therefore to Q) whose head is identical to the head of Q.
Theorem 4.3.8 The rewriting algorithm of Figure 4.1 is sound and complete.
Proof: The algorithm is obviously sound, because its last step is a correctness test. It is
complete because it exhaustively searches the space of possible candidate rewriting queries,
as defined by the above lemmata (i.e., it generates all the candidate rewriting queries in
that space.) 2
4Notice that, since view heads do not have to be single path, the number of single paths in the rewritingcan be greater than k.
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 57
Complexity of DSL rewriting
The algorithm described in Section 4.3.1 takes exponential time. First, Step 1 can generate
a number of mappings that is exponential in the size of the view bodies . Then Step 2 can
generate an exponential number of candidate rewritings, each of polynomial length. Finally,
the construction of Q′(V1, . . . , Vn) using the query composition algorithm takes exponential
time. Checking the equivalence of each of these queries (that are of polynomial length) to
the original query takes time that is non-deterministic polynomial. Therefore overall the
complexity of the DSL view rewriting algorithm is exponential time.
4.4 Capability-based rewriting in the TSIMMIS mediator
In this section, we describe the implemented capability-based rewriting and plan generation
module of the TSIMMIS mediator.5 In particular, we show how the mediator translates
the user queries into a set of relevant source queries in Sections 4.4.1 and 4.4.2. We also
explain in Section 4.4.3 how the source capabilities are described by using query templates.
The capability-based plan generation algorithm implemented in the TSIMMIS mediator is
a heuristic based on the rewriting algorithm presented in the previous sections. We discuss
its limitations in Section 4.4.5.
4.4.1 Query Translation
As explained in Chapter 1, the mediator encodes the relationship between the integrated
views and the source views with a set of view definitions. Specifically, it uses DSL to define
integrated views. For example, the integrated view (V8) is defined as follows:
(V8) <paper {<title T> <author A> <abs B> <conf C>}> :-
<entry {<title T> <author A> <abs B>}>@s1,
<entry {<title T> <conf C>}>@s2
The above view is essentially a join of the views exported by s1 and s2, with title
being the join attribute.
Suppose the user wants to find the title and abstract of each paper written by ‘Smith’
in ‘SIGMOD-97’. The user formulates the following query, based on the user view paper:
5The design of the described rewriting and plan generation algorithm was done with Ramana Yerneniand Chen Li. The implementation in the TSIMMIS mediator was done by Ramana Yerneni and Chen Li.
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 58
(Q38) <ans {<title T> <abs B>}> :-
<paper {<title T> <author "Smith"> <abs B> <conf "SIGMOD-97">}>
When the user query arrives at the mediator, the mediator uses the view definitions to
translate the query on the user views into a logical plan [PGMU96] (i.e., a set of DSL rules
that refer to the source views instead of the integrated views). The following is the logical
plan for the example user query:
(P1) <ans {<title T> <abs B>}> :-
<entry {<title T> <author "Smith"> <abs B>}>@s1,
<entry {<title T> <conf "SIGMOD-97">}>@s2
We refer to the source queries specified on the right hand sides of the logical query plan
rules as conditions. In the logical plan above, there is only one rule with two conditions.
The rule states that answers to the user query can be computed by sending two source
queries. The first one, to s1, gets the title and abstract of each entry (for a paper)
corresponding to ‘Smith’, while the second one, to s2, gets the title of each paper in
conference ‘SIGMOD-97’. From the results of the two source queries, the bindings for
variables T and B are obtained to construct the answers to the user query.
4.4.2 Physical Plans
The logical query plans in TSIMMIS do not specify the order in which the conditions are
processed (e.g., the order in which source queries are sent to the sources). This is done in
the physical plans generated in the subsequent stages of the TSIMMIS mediator.
Three possible physical plans for the logical plan of the example user query are:
• P1: Send query <entry {<title T> <author "Smith"> <abs B>}> to s1;6 send
query <entry {<title T> <conf "SIGMOD-97">}> to s2; join the results of these
source queries on the title attribute.
• P2: Send query <entry {<title T> <author "Smith"> <abs B>}> to s1; for each
returned title, send query <entry {<title T> <conf "SIGMOD-97">}> to s2, with
T bound.
6Strictly speaking, source queries are DSL rules. For simplicity, we just show the tail of the source queryrule.
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 59
• P3: Send <entry {<title T> <conf "SIGMOD-97">}> to s2; for each returned title,
send query <entry {<title T> <author "Smith"> <abs B>}> to s1, with T bound.
4.4.3 Source Capabilities Description: Templates
In order to describe the capabilities of sources, the TSIMMIS system uses templates to
represent sets of queries that can be processed by each source (see also Chapter 1 and
Appendix A). Suppose s1 and s2 have the following templates.
T11: X :- X:<entry {<title $T> <author A> <abs B>}>@s1
T21: X :- X:<entry {<title T> <conf $C>}>@s2
T22: X :- X:<entry {<title $T> <conf C>}>@s2
The first template T11 says that source s1 can return all the information it has about a
paper given its title. T21 says that s2 can return all the information it has about papers
given the conference. T22 says that s2 can also return the information about a paper given
its title. Assume that these are the only templates for s1 and s2. That is, s1 and s2
cannot answer any other kinds of queries. The pattern variable X in this example binds
to the contents of the whole object pattern. The use of pattern variables is essentially a
shortcut.
TSIMMIS templates are DSL views of a limited form that also allow the use of place-
holders to allow required binding patterns by the query interface of a source. The templates
have a significant limitation: query heads can consist of only one pattern variable. There-
fore, a template, viewed as a DSL view, returns one condition and allows no projections or
restructuring of the result.
Given the above capabilities, P1 is not feasible since s1 cannot answer the query
< entry{< titleT >< author”Smith” >< absB >} >
because the title value is not specified. P2 is also infeasible for the same reason. Only
P3 is feasible, as the mediator first gets the title of each paper in ‘SIGMOD-97’ from s2
and uses this title value to get the corresponding abstract information from s1 and check
that the author is indeed ‘Smith’. Notice that the queries to s1 are now feasible because
they specify the title values.
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 60
Source Source Condition BindingQuery Template Processed RequirementM1 T11 C1 TM2 T21 C2 NoneM3 T22 C2 T
Table 4.1: Matcher Result
In the next section, we show how the plan generation process in TSIMMIS takes into
account the source capabilities in producing feasible query plans.
4.4.4 Capability Based Plan Generation
Figure 4.2: TSIMMIS CBR architecture
A block diagram of the TSIMMIS mediator is shown in Figure 1.6. The logical plan
generated by the query decomposer is passed onto the plan generator module, which com-
putes a feasible physical plan for the query. As shown in Figure 4.2, it accomplishes this in
three stages.
Matcher
The first step in the plan generation process is to find all the templates that represent
source queries that can process parts of the logical plan. Some of these templates have
requirements indicating the list of variables that need to be bound. To illustrate, consider
the logical plan of Section 4.4.1. Let the two conditions of the logical plan be denoted C1
and C2. There is one template to process C1, with the requirement that variable T be bound.
There are two templates that can process C2: one with T bound and another without any
binding requirements.
Table 4.1 describes the result of the Matcher. It has a row for each source query that
processes some conditions of the logical plan. For instance, the first row M1 in the table
indicates that s1 can process C1 by using template T11 with the requirement that variable
T be bound.
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 61
Sequencer
The second step in the plan generation process is to piece together the source queries for
processing the conditions of the logical plan in order to construct feasible plans. Here, what
matters is not just the specific source queries chosen to cover all the conditions of the logical
plan but also the sequence of processing these queries.
The Sequencer uses the table output by the Matcher to find the set of feasible sequences
of source queries. Each query in a feasible sequence has the property that the variables
in its binding requirements are exported by the source queries that appear earlier in the
sequence. For instance, in our example logical plan, the Sequencer finds that the only
feasible sequence is < M2, M1 >, with source query M1 being parameterized (variables
bound) from the result of the source query M2. Sequence < M1, M2 >, though it can
also process all the conditions, is not feasible because M1’s binding requirements cannot be
satisfied. Other sequences like < M2, M3 > cannot process all the conditions of the logical
plan.
Optimizer
Having found the feasible sequences of source queries, the third step of the plan generation
process is to optimize over the set of corresponding feasible plans and choose the most
efficient among these. The Optimizer uses standard optimization techniques to pick the
best feasible plan and translates it into a physical plan. In our example case, there is only
one feasible sequence of queries < M2, M1 > and this leads to the physical plan P3 of
Section 4.4.2.
4.4.5 Rewriting algorithm and capability-based plan generation
The plan generation algorithm described in this section is a heuristic based on the general
purpose rewriting algorithm described in this chapter. Essentially, the algorithm exploits
the simplicity of the templates to construct feasible candidate queries faster. The plan-
generation algorithm is sound, meaning the generated plans are indeed equivalent to the
user queries, but not complete, that is, the algorithm could fail to find a plan when one
exists (the rewriting algorithm on the other hand as discussed in Section 4.3.2 is complete).
In particular, the plan generation algorithm treats each object condition appearing on
the right hand side of a user query as an atomic condition: the algorithm tries to find (in
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 62
the Matcher) one or more templates that “cover” the condition. Thus, the algorithm can
miss a plan where two conditions can be covered together by one template. For example, if
the logical plan (P1) above were instead expressed as (P2) below, then the Matcher would
not be able to discover any matching templates.
(P2) <ans {<title T> <abs B>}> :-
<E entry {<title T> <author "Smith">}>@s1 AND
<E entry {<title T> <abs B>}>@s1 AND
<entry {<title T> <conf "SIGMOD-97">}>@s2
4.5 Related work
There is little work on the problem of rewriting semistructured queries using views [FLS98;
CGLV99; CGLV00]. In [FLS98], the related problem of query containment in StruQL is
addressed. The paper deals with queries and views containing “wildcards” and regular
path expressions, but it does not deal with the restructuring capabilities of the StruQL
language. Recently, Calvanese et al. [CGLV99; CGLV00] proposed an elegant solution to
the problem of rewriting a regular expression in terms of other regular expressions. The
problem is closely related to the problem of rewriting semistructured queries using views,
but the solution is applicable to a narrow class of queries and views, the ones that consist
of only one regular path expression and return its “endpoints.”
The problem of query rewriting for conjunctive relational views is discussed, among
others, in [LMSS95; DL97] and for recursive queries (but not recursive views) in [DG97].
The problem of query equivalence for relational languages with object ids has been studied
in [HY91]. Our notion of query equivalence corresponds roughly, in the terminology of
[HY91], to exposed equivalence.
Our work is also related to the problem of object oriented query rewriting. Previous
work on the problem of containment and equivalence of object oriented queries [Cha92;
LR96] relies on the existence of a static class hierarchy. The problem of containment of
queries on complex objects has been addressed recently in [LS97].
Finally, there has been some recent work on using structural information about a semi-
structured source (such as graph schemas [BDFS97] or DTDs) in query processing [FS98;
PV00; MS99; MSV00].
CHAPTER 4. QUERY REWRITING FOR SEMISTRUCTURED DATA 63
4.6 Conclusions
We describe an algorithm that given a semistructured query q expressed in DSL and a
set of semistructured views V, finds rewriting queries, i.e., queries that access the views
and are equivalent to q. Our algorithm is based on appropriately generalizing containment
mappings, the chase, and unification. The first step uses containment mappings to produce
candidate rewriting queries. The second step composes each candidate rewriting query with
the views and checks whether the composition is equivalent to the original query.
Moreover, we extend the algorithm to use structural constraints to discover rewritings
in cases where, in the absence of constraints, there would be no rewritings.
It is an open problem to extend our algorithm to semistructured languages with regular
path expressions, like StruQL or Lorel [AQM+97].
Chapter 5
The Capability Description
Language p-Datalog
In the previous chapter we studied the capability-based rewriting problem in the context of
the semistructured data model. We have also shown how the salient features of the semi-
structured model and of semistructured languages affect the solution of the CBR problem.
In this chapter and in Chapter 6 we are solving the CBR problem for relational data. Using
a simpler data model allows us to focus our attention on the CBR and related problems,
in particular the query expressibility problem introduced in Chapter 1, in the context of
more powerful capability description languages. We propose the use of Datalog variants
as more powerful description languages. We define and study the CBR and query express-
ibility problem for these description languages. Moreover, we present some results on the
expressiveness of these languages.
We focus on sources that support conjunctive queries, i.e., their capabilities are a subset
of the set CQ of all conjunctive queries ([AHV95]). The topics discussed in this chapter are
as follows:
• We introduce the description language p-Datalog. We formally define the semantics
of p-Datalog as a capability description language, and present complete and efficient
algorithms to (i) decide whether a query is described by a p-Datalog description (the
query expressibility problem) and (ii) decide whether a query can be answered by
combining supported queries (the CBR problem). The CBR algorithm runs in time
nondeterministic exponential in the size of the query and the description, a substantial
64
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 65
improvement of the state of the art over the algorithm described in [LRU99], which
is doubly exponential.
• We extend the CBR algorithms for descriptions with binding requirements. We also
compare the expressive power of our proposed description language with an alternative
description of binding requirements: Datalog queries with binding patterns [RSU95].
• We study the expressive power of p-Datalog. We reach the important result that p-
Datalog cannot describe the query capabilities of certain powerful sources. The most
important result is that there is no Datalog program that can describe all conjunctive
queries over a given schema. Indeed, there is no program that describes all boolean
conjunctive queries over the schema.
• We identify an important class of descriptions, Ploop descriptions, covering sources
such as document retrieval systems, lookup catalogs, and object repositories, and we
show that the complexity of the CBR problem for Ploop descriptions is significantly
lower than the complexity for the general case.
The next section introduces the p-Datalog description language. Section 5.2 describes
the query expressibility decision procedure. Section 5.3 describes the CBR algorithm for p-
Datalog. Section 5.4 studies a useful large class of descriptions, for which the CBR problem
has lower computational complexity. Section 5.5 discusses expressive power issues. The
chapter concludes with Section 5.6 that discusses the related work, including the relationship
between the proposed description language and Datalog queries annotated with binding
patterns.
5.1 The p-Datalog Source Description Language
It is well known that the most popular real-life query languages, like SPJ queries [AHV95]
and Web-based query forms are equivalent to conjunctive queries. A Datalog program is a
natural encoding of many sets of conjunctive queries: the set is described by the expansions
of the Datalog program. First, we describe informally a Datalog-based source description
language and illustrate it with examples. A formal definition follows in the next subsection.
In the simple case, when we deal with a weak information source, the source can be
described using a set of parameterized queries. Parameters, called tokens in this thesis,
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 66
specify that some constant is expected in some fixed position in the query [PGGMU95;
PGH96; LRU96; LRO96]. Without loss of generality, we assume the existence of a desig-
nated predicate ans that is the head of all the parametrized queries of the description.
Example 5.1.1 Consider a bibliographic information source, that provides information
about books. This source exports a predicate
books(isbn, author, title, publisher, year, pages)
The source also exports “indexes”:
author index(author name, isbn),
publisher index(publisher, isbn)
title index(title word, isbn)
Conceptually, the tuple (X, Y ) is in author index if the string X resembles the actual
name of an author and Y is the ISBN of a book by that author. Similarly, (X, Y ) is in
title index if X is a word of the actual title and Y is the ISBN of a book with word X in
the title. The following parameterized queries describe the wrapper that answers queries
specifying an author, a title or a publisher.
ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg),
author index($c, Id)
ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg), title index($c, Id)
ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg),
publisher index($c, Id)
where $c denotes a token. The query
ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg),
author index(“Smith”, Id)
can be answered by that source, because it is derived by the first parameterized query by
replacing $c by the constant ‘‘Smith". 2
In the previous example, the source is described by parameterized conjunctive queries. Note
that if, for instance, the source accepts queries where values for any combination of the three
indexes are specified, we would have to write 23 = 8 parameterized conjunctive queries. The
next example uses IDB predicates (i.e., predicates that are defined using source predicates
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 67
and other IDB predicates) to describe the abilities of such a source more succinctly. Finally,
example 5.1.3 uses recursive rules to describe a source that accepts an infinite set of query
patterns.
Example 5.1.2 Consider the bibliographical source of the previous example. Assume that
the source can answer queries that specify any combination of the three indexes. The
p-Datalog program that describes this source is the following:
(D1)(R1) ans(Id,Aut, T itl, Pub, Y r, Pg) ← books(Id,Aut, T itl, Pub, Y r, Pg),
ind1(Id), ind2(Id), ind3(Id)
(R2) ind1(Id) ← title index($c, Id)
(R3) ind1(Id) ← ε
(R4) ind2(Id) ← author index($c, Id)
(R5) ind2(Id) ← ε
(R6) ind3(Id) ← publisher index($c, Id)
(R7) ind3(Id) ← ε
ε denotes an empty body, i.e., an ε–rule has an empty expansion. Notice that ε–rules are
unsafe [Ull89]. In general, p-Datalog rules can be unsafe, but that is not a problem under
our semantics, defined in Section 5.1.1. Note also that the number of rules is only polyno-
mial in the number of the available indexes, whereas the number of possible expansions is
exponential.
The query
ans(Id,Aut, T itl, Pub, Y r, Pg)← books(Id,Aut, T itl, Pub, Y r, Pg),
author index(“Smith”, Id)
can be answered by that source, because it is derived by expanding rule (R5.1.2) using
rules (R3), (R4) and (R7), and by replacing $c by the constant “Smith”. We can easily
modify the description to require that at least one index is used. 2
In general, a p-Datalog program describes all the queries that are expansions of an
ans-rule of the program. In particular, p-Datalog rules that have the ans predicate in
the head can be expanded into a possibly infinite set of conjunctive queries. Among the
expansions generated, some will only refer to source predicates.1 We call these expansions
1We stated that source predicates are the EDB predicates of our descriptions.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 68
terminal expansions. A p-Datalog program can have unsafe terminal expansions. We say
that the p-Datalog program describes the set of conjunctive queries that are its safe terminal
expansions (see formal definitions in the next subsection).
Example 5.1.3 Consider again the bibliographical source of Example 5.1.1. Assume that
there is an abstract index abstract index(abstract word, Id) that indexes books based on
words contained in their abstracts. Consider a source that accepts queries on books given
one or more words from their abstracts. The following p-Datalog program can be used to
describe this source.
(D2) ans(Id,Aut, T itl, Pub, Y r, Pg) ← books(Id,Aut, T itl, Pub, Y r, Pg), ind(Id)
ind(Id) ← abstract index($c, Id)
ind(Id) ← ind(Id), abstract index($c, Id)
The source describes the following infinite family of conjunctive queries:
ans(I, A, T, P, Y, Pg) ← books(I, A, T, P, Y, Pg), abstract index(c1, I)
ans(I, A, T, P, Y, Pg) ←books(I, A, T, P, Y, Pg), abstract index(c1, I),
abstract index(c2, I)
etc.
which agrees with our conceptual description of the source given above. 2
As another example of a recursive source description, we can think of a transportation
company, such as FedEx, that has an information source that contains information about
flights. The source is capable of answering certain queries about flights. In particular,
assume that the source can answer only whether there exists a flight between cities A and
B that makes exactly n stopovers. We can model such a source with the following p-Datalog
program:
(D3) exists(A,B, 1) ← flight(A,B)
exists(A,B, n)← flight(A,C), exists(C,B, n− 1)
Notice that the content of the source is essentially the relation flight, whereas the
p-Datalog program above describes the query capabilities of the source.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 69
5.1.1 Formal description of p-Datalog.
We assume familiarity with Datalog, e.g., [Ull89; AHV95]. Besides the constant and variable
sorts, we use a third disjoint set of symbols, the set of tokens.
Definition: p-Datalog Program Syntax A parametrized Datalog rule or p-Datalog rule
is an expression of the form
p(u)← p1(u1), . . . , pn(un)
where p, p1, p2, . . . , pn are relation names, and u, u1, u2, . . . , un are tuples of constants, vari-
ables and tokens of appropriate arities. A p-Datalog program is a finite set of p-Datalog
rules. 2
Tokens are variables that have to be instantiated to form a query. We now formalize
the semantics of p-Datalog as a source description language.
Definition: Set of Queries Described/Expressible by a p-Datalog Program Let
P be a p-Datalog program with a particular IDB predicate ans. The set of expansions EPof P is the smallest set of rules such that:
• Each rule of P that has ans as the head predicate is in EP ;
• If r1: p ← q1, . . . , qn is in EP , r2: r ← s1, . . . , sm is in P (assume their variables and
tokens are renamed, so that they don’t have variables or tokens in common) and a
substitution θ is the most general unifier of some qi and r then the resolvent
θp← θq1, . . . θqi−1, θs1, . . . , θsm, θqi+1, . . . , qn
of r1 with r2 using θ is in EP .
The set of safe terminal expansions TP of P is the subset of all expansions e ∈ EP containing
only EDB predicates in the body that are safe [Ull89]. The set of queries described by P
is the set of all rules ρ(r), where r ∈ TP and ρ assigns arbitrary constants to all tokens in
r. The set of queries expressible by P is the set of all queries that are equivalent to some
query described by P . 2
Unification extends to tokens in a straightforward manner: a token can be unified with
another token, yielding a token. When unified to a variable, it also yields a token. When
unified to a constant, it yields the constant. The above definitions can easily be extended
to accommodate more than one “designated” predicate (like ans).
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 70
In the context of the above description semantics, we will use the terms p-Datalog
program and description interchangeably.
Informally, we observe that expansions are generated in a grammar-like fashion, by
using Datalog rules as productions for their head predicates and treating IDB predicates as
“nonterminals” [ASU87]. Resolution is a generalization of non-terminal expansion; rules of
context-free grammars can be thought of simply as Datalog rules with 0 arguments.
Rectification: For deciding expressibility as well as for solving the CBR problem the
following rectified form of p-Datalog rules simplifies the algorithms. We assume the following
conditions are satisfied:
• No variable appears twice in subgoals of the query body. Instead, multiple occurrences
of the same variable are handled by using distinct variables and making equalities
explicit with the use of the equality predicate equal.
• No variable appears twice in the head of the query. Again, equalities are made explicit
with use of the predicate equal.
• No constants or tokens appear among the ordinary2 subgoals. Instead, every constant
c or token $c is replaced by a unique variable C, and an equality subgoal equal(C, c)
or equal(C, $c) is added to equate the variable to the constant.
• No variables appear only in an equal subgoal of a query.
Example 5.1.4 Consider the query
(Q39) ans(X, X, Z)← r(X, Y, Z), p(a, Y )
which contains a join between the second columns of r and p, a selection on the first column
of p, and the same variable in two columns of ans. Its rectified equivalent is
(Q40) ans(X1, X, Z)← r(X, Y, Z), p(A, Y1), equal(X, X1), equal(Y, Y1), equal(A, a)
22We refer to the EDB and IDB relations and their facts as ordinary, to distinguish them from facts of
the equal relation.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 71
Notice that we treat the equal subgoal not as a built-in predicate, but as a source
predicate. We call rules that obey these conditions rectified rules and the process that
transforms any rule to a rectified rule rectification. We call the inverse procedure (that
would give us rule 39 from rule 40) de-rectification.
In Sections 5.2 and 5.3 we provide algorithms for deciding whether a query is expressible
by a description and for solving the CBR problem.
5.2 Deciding query expressibility with p-Datalog descriptions
In this section we present an algorithm for query expressibility of p-Datalog descriptions.
In so doing, we develop the techniques that will allow us in the next section to give an
elegant and improved solution to the problem of answering queries using an infinite set of
views described by a p-Datalog program.
Our algorithm, the Query Expressibility Decision algorithm, is an extension of the
classic algorithm for deciding query containment in a Datalog program that appears in
[RSUV89] (also see [Ull89]). The algorithm takes as input a conjunctive query and a p-
Datalog description and it tries to identify one expansion of the p-Datalog program that is
equivalent to our query. We next illustrate the workings of the algorithm with an example.
Example 5.2.1 Let us revisit the bibliographic source of previous examples. Assume
that the source contains a table books(isbn, author, publisher), a word index on titles,
title index(title word, isbn) and an author index au index(au name, isbn). Also assume
that the query capabilities of the source are described by the following p-Datalog program:
(D4) ans(A,P ) ← books(Id,A, P ), ind1(Id1), ind2(Id2), equal(Id, Id1), equal(Id, Id2)
ind1(Id) ← title index(V, Id), equal(V, $c)
ind1(Id) ← ε
ind2(Id) ← au index(V, Id), equal(V, $c)
ind2(Id) ← ε
Let us consider the query
(Q41) ans(X, Y )← books(Id,X, Y ), title index(“Zen”, Id), au index(“Smith”, Id)
First we produce its rectified equivalent
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 72
(Q42) ans(X, Y )← books(Id,X, Y ), title index(V1, Id1), au index(V2, Id2),
equal(V1, “Zen”), equal(V2, “Smith”), equal(Id, Id1), equal(Id, Id2)
Apparently the above query is expressible by the description. Intuitively, our algorithm
discovers expressibility by “matching” the p-Datalog program rules with the subgoals. In
particular, the “matching” is done as follows: first we create a DB containing a “frozen
fact” for every subgoal of the query. Frozen facts are derived by turning the variables into
unique constants which will be denoted with a bar.
Moreover, we want to capture all the information carried by equal subgoals into the
DB. If, for example, subgoals equal(X, Y ), equal(X, Z) exist in the query, we will generate
“frozen” facts for all implicit equalities as well, i.e., equal(Y, X), equal(Y, Z) etc. In the
interests of space and clarity, we will write equal(X, Y, Z) to mean that all the previously
mentioned facts are in the DB. The DB for our running example is then
books(id, x, y), title index(v1, id1), au index(v2, id2), equal(id, id1, id2),
equal(v1, “Zen”), equal(v2, “Smith”)
We then evaluate the p-Datalog program 4 on the DB, deriving more facts for the IDB’s.3
In addition, we keep track of the set of frozen facts, called the supporting set, that we use
for deriving each fact. Below is the set of facts and supporting sets derived by a particular
3Notice that, because of the existence of unsafe rules in the p-Datalog program, evaluating the p-Datalogprogram will result in the production of non-ground facts. The semantics of these facts are that they aretrue for all values of their variables.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 73
evaluation of the Datalog program.4
< ind1(Id), {} >
< ind2(Id), {} >
(1) < ans(x, y), {books(id, x, y), equal(id, id1)} >
< ind1(id1), {title index(v1, id1), equal(v1, “Zen”)} >
< ind2(id2), {au index(v2, id2), equal(v2, “Smith”)} >
(2) < ans(x, y), {books(id, x, y), title index(v1, id1), equal(v1, “Zen”),
au index(v2, id2), equal(v2, “Smith”), equal(id, id1, id2)} >
Every ans fact that is identical to the frozen head of the client query “corresponds” to a
query that contains the client query. Furthermore, we can derive the containing query from
the <fact, supporting set> pair by translating “frozen” facts back into subgoals. In our
running example, the two containing queries5 correspond to (1) and (2). If the supporting
set is identical to the DB that we started with (modulo redundant equality subgoals) then
the corresponding query is equivalent to the client query. Indeed, the query corresponding
to (2) is
ans(X, Y )← books(Id,X, Y ), title index(Id, “Zen”), au index(Id, “Smith”)
which is equivalent (actually identical) to our given query. 2
Algorithm QED starts by mapping the subgoals of the given query into “frozen” facts,
such that every variable maps to a unique constant, thus creating the canonical database
[RSUV89; Ull89] of the query, and then evaluates the p-Datalog program on it, trying to
produce the “frozen” head of the query. Moreover, it keeps track of the different ways
to produce the same fact; that is achieved by “annotating” each produced fact f with its
supporting facts, i.e., the facts of the canonical DB that were used in that derivation of f .
4Notice that the supporting set for fact (1) below contains equal(id, id1). Fact (1) is produced from rule
ans(A, P )← books(Id, A, P ), ind1(Id1), ind2(Id2), equal(Id, Id1), equal(Id, Id2)
and facts books(id, x, y), equal(id, id1) and ind1(Id), ind2(Id). As we explained earlier, the semantics ofind1(Id) is that it stands for ind1(x) for every x, and the same for ind2(Id). Setting x to id1 produces thegiven supporting set.
5Algorithm QED uses pruning to eliminate (1) from the output.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 74
We next formalize the notion of the canonical database. A formal definition of support-
ing facts follows.
Definition: Canonical DB of Query Q Let Q : H ← G1, . . . , Gk, . . . , E1, . . . , Em be a
rectified conjunctive query, where G1, . . . , Gk are the ordinary subgoals and E1, . . . , Em are
the equality subgoals. Select a mapping τ that assigns to every variable X of Q a unique
“frozen” constant τ(X) = x and is the identity mapping on constants and predicate names.
This way we construct k “frozen” ordinary facts: τ(G1), . . . , τ(Gk). We also construct m
“frozen” facts of the EDB predicate equal: τ(E1), . . . , τ(Em). These m facts constitute
an instance of the equal relation. We create additional equal facts so that we get the
smallest set of equal facts that includes this instance and is an equivalence relation. All the
constructed facts constitute the canonical DB of query Q. 2
Notice that this DB contains two “kinds” of constants: “regular” constants and frozen
constants.
Example 5.2.2 Consider the rectified query:
(Q43) ans(Y )← p(X, X1), q(X2, Y, Z), equal(X, X1), equal(X1, X2),
equal(X, X3), equal(Z, c)
The canonical DB produced by this query is
p(x, x1), q(x2, y, z), equal(x, x1), equal(x1, x2), equal(z, c), equal(x, x), equal(x1, x1),
equal(x2, x2), equal(x, x2), equal(x2, x), equal(x1, x), equal(x2, x1), equal(c, z),
equal(z, z), equal(c, c)
2
Shorthand notation: Before we proceed, let us formalize the shorthand notation intro-
duced in Example 5.2.1. It is obvious that if the equal facts form an equivalence relation, the
constants and frozen constants appearing in equal facts are divided in equivalence classes.
Let us look at the canonical DB of some query Q. If variables X1, . . . , Xk appearing
in the canonical DB belong to the same equivalence class, we replace all equal facts in-
volving X1, . . . , Xk by equal(X1, . . . , Xk). For example, equal(X1, X2, X3) “stands for” all
equal(Xi, Xj), 1 ≤ i, j ≤ 3.
The canonical DB produced by query (Q43) above can be written as
p(x, x1), q(x2, y, z), equal(z, c), equal(x, x1, x2)
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 75
It is easy to see that
equal(Y1, . . . , Yl) is a subset of equal(X1, . . . , Xm) iff ∀i ≤ l, Yi ∈ {X1, . . . , Xm}
Definition: Supporting Set of Fact Let h be an ordinary fact produced by an appli-
cation of the p-Datalog rule
r : H ← G1, . . . , Gk, E1, . . . , Em
of a p-Datalog description P on a database DB that consists of a canonical database
CDB and other facts, and let µ be a mapping from the rule into the database such that
µ(Gi), µ(Ej) ∈ DB and h = µ(H). The set Sh of supporting facts of h, or supporting set of
h, with respect to P , is the smallest set such that
• if µ(Gi) ∈ CDB, then µ(Gi) ∈ Sh,
• if µ(Gi) 6∈ CDB and S ′ is the set of supporting facts of µ(Gi), then S ′ ⊆ Sh,
• if E is the set of all µ(Ei) ∈ Sh, then the smallest set of equality facts that includes
E and is an equivalence relation is included in Sh.
2
Let us notice that Sh is the set of leaves of a proof tree [Ull89] for h. We can further
annotate the produced fact with the “id” of the rule used in its production, thus generating
the whole proof tree for this fact.
Example 5.2.3 We can apply the rule
ans(X1, Z1)← author(X1, Z1), publisher(Z2,W ), equal(Z1, Z2), equal(W, $w)
on the following canonical DB
author(a, b), author(a, a), publisher(d, f), publisher(g, h), equal(b, d), equal(a, g),
equal(f , “PrenticeHall”)
to produce fact ans(a, b). The supporting set S is
{author(a, b), publisher(d, f), equal(b, d), equal(f , “PrenticeHall”)}
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 76
2
We next define the notions of extended facts and extended canonical DB:
Definition: Extended Facts and Extended Canonical DB An extended fact is a pair
of the form < h,Sh >, where h is a fact and Sh is the supporting set for h, with respect to
some description P . Let Q be a rectified conjunctive query. The extended canonical DB of
Q is a database of extended facts < h,Sh >, such that every h belongs in the canonical DB
of Q. 2
Referring to Example 5.2.3, the extended fact “associated” with our production of ans(a, b)
is
< ans(a, b), {author(a, b), publisher(d, f), equal(b, d), equal(f , ‘PrenticeHall′)} >
We now introduce the notion of the corresponding query for a fact, that makes our intuition
about the supporting set explicit.
Definition: Corresponding Query Let < h,Sh > be an extended fact of the DB.
Then, for every fact gi ∈ Sh, we can define a mapping ρ that is the identity on constants
and predicate names and maps every frozen constant to the variable which it came from.
It is easy to see that this mapping is well-formed. Moreover, it maps Sh into a query
body and the fact h into a query head. The query Q:ρ(h) ← ρ(g1), . . . , ρ(gk) is called the
corresponding query for extended fact < h,Sh >. 2
Intuitively, the corresponding query is an instantiated expansion of the rules of the
description that can prove h and it uses only source and equality predicates.
Algorithm QED produces a set of candidate queries: these are the corresponding queries
for the produced extended facts. Candidate queries are described by the p-Datalog descrip-
tion; they are the only “interesting” expansions, in that they could be equivalent to the
given query. As we will show later, each candidate query has an important property: its
projection over the empty list of attributes contains the projection over the empty list of
attributes of the given query Q. Said otherwise, the body of a candidate query contains the
body of the given query. That means that if there exists a candidate query whose head is
identical to the head of Q, then obviously this a containing query for Q with respect to P .
Moreover, Q is expressible by P iff one of the candidate queries in the set is equivalent to
Q.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 77
The algorithm is presented in detail in Figure 5.1. Notice that the algorithm only gen-
erates maximal supporting sets for each produced fact. Therefore, the produced candidate
queries are in a sense “minimal.” We will formalize that notion later in this section.
Algorithm 5.2.4
InputMinimized [Ull89] (non-rectified) conjunctive query Q of the formH ← G1, G2, . . . , Gk, where H is of the form ans(X1, . . . , Xn).(Non-rectified) p-Datalog description P .
OutputA set of candidate queries.
MethodRectify P and QConstruct the extended canonical DB of QApply the rules of P to the facts in DB to generate all possible extended factsusing bottom up evaluation [Ull89] modified in the following ways:
%items 1 and 2 guarantee the generation of extended facts%with maximal supporting sets
1. populate IDB relations with extended facts, i.e., if fact h is produced bythe rule, compute Sh and then enter < h,Sh > in the database iff• < h,Sh > is not already in the database and• No < h,S ′
h > where Sh ⊆ S ′h is present in the DB.
2. when a new fact < h,Sh > is added to the DB, delete from the DB allfacts of the form < h,S ′
h >, where S ′h ⊂ Sh.
3. if a rule is unsafe, i.e., some distinguished variables do not appearin the rule body, simply leave those variables in the produced fact.
In the end:4. if < h,Sh > is an extended fact, h is an ans fact and h contains
variables, delete the extended fact.5. de-rectify the resulting extended facts, and the query Q.6. Create the corresponding queries of the extended facts.
2
The treatment of unsafe rules is the same as in generalized magic sets [Ull89].
Figure 5.1: Algorithm QED
We proceed to give results on the correctness and running time of the algorithm. Before
that, let us just demonstrate with an example why rectification is necessary.
Example 5.2.5 To illustrate why rectification is necessary in identifying the candidate
queries, let us consider the query
ans(X)← p(X, c)
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 78
and the p-Datalog description6
ans(A)← p(A,B)
Evaluating the description on the canonical DB {p(x, c)} (without rectification), would
produce the extended fact < ans(x), {p(x, c)} >. The corresponding query is
ans(X)← p(X, c)
which is not a correct candidate query, because it is not expressible (by Definition 5.1.1) by
the given description. If on the other hand we use rectification, we get the canonical DB
{p(x, y), equal(y, c)}. Evaluating the description on it, we get the candidate query
ans(X)← p(X, Y )
which is a containing query for our given query (but not equivalent). 2
Now we are ready to state some formal results about algorithm QED. We ultimately
formally state and prove its correctness criterion (i.e., solving the expressibility problem)
and state and prove its computational complexity.
Lemma 5.2.6 Algorithm QED produces extended facts with maximal supporting sets.
By maximal, we mean that if < h,Sh >,< h,S ′h > are two extended facts for the same
fact h, it cannot be that Sh ⊆ S ′h or that S ′h ⊆ Sh. Thus Lemma 5.2.6 directly follows
from Algorithm 5.2.4
Theorem 5.2.7 Soundness and Completeness of Set of Candidate Queries Let Q
be a query, P be a p-Datalog description and {Qi} be the set of candidate queries that is
the result of algorithm QED on Q and P . Then the following are true:
1. For all i, πφQ ⊆ πφQi.
2. For all i the identity mapping can map the body of Qi to the body of Q.
3. If R is a query described by P and is not in {Qi} then
• πφR does not contain πφQ or
6This is obviously the description of a source with a very simple query interface
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 79
• there exists an i such that the heads of R and Qi are identical and Qi ⊆ R.
Moreover, the identity mapping µ is a containment mapping from R to Qi.
4. If R is a query described by P and is not in {Qi}, R ≡ Q only if there exists i such
that Qi ≡ Q.
Proof: (2) is derived directly from the Algorithm and (1) is a direct consequence of the
existence of the mapping. (4) is a direct consequence of (1) and (3). For (3): Algorithm
QED is exhaustive, i.e., it generates all “relevant” (in the sense of (1)) candidate queries,
with the exception of those that are pruned due to Lemma 5.2.6. So let R : HeadR ← BodyR
be “relevant” and not in the candidate set. Then, for the extended fact7 < HeadR, BodyR >,
BodyR is not a maximal supporting set. That means that there exists an extended fact F :<
HeadR,S > such that BodyR ⊆ S. It is then clear from the definition of a corresponding
query that the corresponding query QF to F is contained in R, and that the mapping from
R to QF is the identity. 2
Theorem 5.2.7 says that, for any described query R that is not in the candidate set,
either R is not equivalent to Q, or there already exists a “smaller” query Qi in the candidate
set that still “contains” Q. In the above sense, the candidate set contains “minimal” queries.
Moreover, it says that queries not in the candidate set are not “interesting”: even if R ≡ Q,
there is always a query Qi in the candidate set that is also equivalent to Q.
Algorithm QED produces output that allows us to correctly decide query expressibility.
To that effect, we prove the following:
Lemma 5.2.8 Expressibility Criterion Q is expressible by P iff the set of supporting
facts for some extended fact < h,Sh > of the frozen head h of Q is identical8 to the canonical
DB for Q.
Proof:
IF: It is obvious from the way the “corresponding” query is defined, that if DB ≡ Sh,
then the corresponding query is equivalent to Q.
ONLY IF: The output of algorithm QED contains candidate queries for which Theo-
rem 5.2.7 holds, i.e., there is no expansion that is a “tighter fit” to the given query than
the queries in the output. If, for every Sh, there exists some fact in the canonical DB that
7Where Head, Body are “frozen.”8After de-rectification of both.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 80
is not in Sh, then the corresponding query cannot be equivalent to Q: Q is minimized, and
minimization is unique up to isomorphism, so all subgoals (i.e., all facts in the canonical
DB) are necessary for equivalence. 2
The number of extended facts that can be generated per “real” fact is equal to the
number of different maximal supporting sets for the fact, i.e., it is exponential in the size
of the canonical DB. The number of facts is exponential in the size of the description, so
we have the following:
Theorem 5.2.9 Algorithm QED produces an answer in time exponential to the size of the
description and the size of the query.
Notice that the problem of query containment in Datalog is reducible to the problem of
query expressibility described here. Query containment in Datalog is EXPTIME-complete
[RSUV89]. Hence we have the following:
Theorem 5.2.10 Query expressibility is EXPTIME-complete.
Therefore, Algorithm 5.2.4 meets the theoretical lower bound.
5.2.1 Expressibility and translation
Let us consider the case of a wrapper that receives a query. It is easy to see that we
could extend Algorithm 5.2.4 so that it annotates each fact not only with its supporting
set, but also with its proof tree. The wrapper then can use the parse tree to perform the
actual translation of the user query in source-specific queries and commands, by applying
the translating actions that are associated with each rule of the description [PGGMU95;
JBHM+97].
5.3 Answering Queries Using p-Datalog Descriptions
Mediators are faced with a tougher problem than wrappers, as explained in Chapter 1:
Given the descriptions for one or more wrappers, the mediator has to answer the user
query by sending to the wrappers only queries expressible by the wrapper descriptions and
consequently combine the answers to produce the answer to the given query. This is the
Capabilities-Based Rewriting (CBR) problem. The mediator considers rewritings of the
user query that are conjunctive rules, as described below.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 81
Definition: Rewriting of Query Given a conjunctive query Q and a set of queries
{Q1, . . . , Qn}, of the form
ans i ← body i , i = 1, . . . , n
a rewriting of Q using {Qi} is a rule Q′ of the form
ans ← ans1, . . . , ansn, optional equalities
such that Q′ ≡ Q 2
As we have said in previous sections, a source description defines the (possibly infinite)
set of conjunctive queries answerable by the source. So, the CBR problem is equivalent
to the problem of answering the user query using an infinite set of views described by a
Datalog program [LRU96].
Our CBR algorithm proceeds in two steps. The first step finds a finite set of expan-
sions. The second step uses an algorithm for answering queries using views [LMSS95;
Qia96] to combine some of these expansions to answer the query. The first step uses the
Algorithm 5.2.4 to generate a finite set of expansions (see Figure 5.1). We prove that if
we can answer the query using any combination of expressible queries, then we can answer
it using a combination of expansions in our finite set. In [LRU99], a solution is presented
for the problem whose complexity is doubly exponential in the size of the query and the
description. The solution is based on “signatures” for the expansions of the description,
that divide the queries that are expressible by the description into equivalence classes. We
will show that our solution is non-deterministic exponential in the size of the query and the
description. Moreover, the proof of our solution is much more intuitive and simpler.
Given a user query Q and a wrapper description P in p-Datalog, Algorithm QED pro-
duces all9 the candidate queries of Q with respect to P . We can show that there is at most
an exponential number of those:
Lemma 5.3.1 The output of Algorithm 5.2.4 contains at worst an exponential number of
queries, whose length is at most linear to the size of the given user query.
Moreover, we can prove that these are the only queries expressible10 by P that are
“relevant” in answering Q.
9Modulo variable renaming.10The corresponding queries Qi, that are the output of Algorithm 5.2.4, actually are described by P .
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 82
Theorem 5.3.2 (CBR) Assume we have a query Q and a p-Datalog description P without
tokens, and let {Qi} be the result of applying Algorithm 5.2.4 on Q and P . There exists a
rewriting Q′ of Q, such that Q′ ≡ Q, using any {Qj |Qj is expressible by P} if and only if
there exists a rewriting Q′′ , such that Q′′ ≡ Q, using only {Qi}.
Proof: The if direction is trivial. For the only if: It must be that πφ(Q) ⊆ πφ(Qj)
[LMSS95]. Since Qj is expressible by P , Qj could be a candidate query. But {Qi} con-
tains all the “interesting” candidate queries of Q with respect to P by Theorem 5.2.7.
Therefore, for any Qj , either Qj ∈ {Qi} or there exists some “corresponding” Qi such
that Qi ⊆ Qj , and the containment mapping from Qj to Qi is the identity mapping. Let
Q′:Qj1 , . . . , Qjk, . . . , Qjm be the rewritten query. If we replace each Qjk
with its “corre-
sponding” Qik identified above, then Q′′:Qi1 , . . . , Qim is also equivalent to Q. In proof:
• There exists a containment mapping from Q′′ to Q. In particular, the identity mapping
is a containment mapping from Q′′ to Q
• There exists a containment mapping from Q to Q′ and from Q′ to Q′′, and therefore
also from Q to Q′′.
Therefore, by the containment mapping theorem [CM77], Q′′ and Q are equivalent. 2
Since all we need to solve the rewriting problem is to compute the candidate queries
(using Algorithm 5.2.4), we need an algorithm to combine some of the candidate queries
into a rewriting of the given query. The problem of finding an equivalent rewriting of a
query using a finite number of views is known to be NP-complete in the size of the query
and the view set [LMSS95] and there are known algorithms for solving it in the absence of
tokens [LMSS95; Qia96]. Hence, the total computational complexity of our CBR scheme in
the worst case is
• First stage (QED): Exponential in the size of the query and the description.
• Second stage (answering queries using views): NP in the size of its input. The size of
the input is the cardinality of the candidate set times the size of the largest candidate.
Since the QED algorithm has output of exponential size, the second stage dominates and
the total complexity of the algorithm in the worst case is nondeterministic exponential. In
particular, the cardinality of the candidate set is exponential in the arity of the head of the
candidate queries and, more importantly, in the size of the canonical database. (See also
Section 5.4.2.)
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 83
5.3.1 CBR with binding requirements
The discussion in the previous section ignores the presence of tokens. To handle tokens in
the p-Datalog description, we need to modify both steps of our CBR scheme. Let us discuss
what changes are necessary.
To correctly solve the CBR problem in the presence of binding requirements, we first
modify the QED algorithm. Let us consider an example that will show that algorithm
QED, if used unchanged, is inadequate for the solution of the CBR problem with binding
patterns.
Example 5.3.3 Let the “target” query be
(Q44) ans(X)← p(c, Y ), p(Y, X)
and let the description be
(D5) v(X)← p($c,X)
The rectified query is
(Q45) ans(X)← p(A, Y ), p(Y1, X), equal(A, c), equal(Y, Y1)
The rectified p-Datalog description of the source is
(D6) ans(W )← p(B,W ), equal(B, $c)
Algorithm QED produces the following candidate query (after de-rectification):
(Q46) ans(Y )← p(c, Y )
There is no rewriting of (Q44) using only (Q46) that is equivalent to (Q44). But there is a
way to answer (Q44) using our p-Datalog description. To see that, let us rewrite the query
and the view to make the binding patterns explicit:
(Q′44) ansfb(X, A) ← p(A, Y ), p(Y, X)
(D′5) vfb(X, A) ← p(A,X)
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 84
Then we can rewrite (Q44) as follows:
(Q′′44) ansfb(X, A)← vfb(Y, A), vfb(X, Y )
This rewriting respects the binding requirements of the views, is processed by passing Y
bindings, and is equivalent to the target query. 2
Algorithm 5.3.4
InputMinimized [Ull89] (non-rectified) conjunctive query Q of the formH ← G1, G2, . . . , Gk, where H is of the form ans(X1, . . . , Xn).(Non-rectified) p-Datalog description P .
OutputA set of candidate queries with binding patterns.
MethodRectify P and QConstruct the extended canonical DB of QReplace tokens in P with variables. Annotate rules with binding information.Apply the rules of P to the facts in DB to generate all possible extended factsusing bottom up evaluation [Ull89] modified in the following ways:
1. populate IDB relations with extended facts, i.e., if fact h is produced bythe rule, compute Sh and then enter < h,Sh > in the database iff• < h,Sh > is not already in the database and• No < h,S ′
h > where Sh ⊆ S ′h is present in the DB.
2. when a new fact < h,Sh > is added to the DB, delete from the DB allfacts of the form < h,S ′
h >, where S ′h ⊂ Sh.
3. if a rule is unsafe, i.e., some distinguished variables do not appearin the rule body, simply leave those variables in the produced fact.
4. Update bound variables annotation for the extended fact:A variable gets an annotation when it binds to an already annotated variable.
In the end:5. if < h,Sh > is an extended fact, h is an ans fact and h contains
variables, delete the extended fact.6. de-rectify the resulting extended facts, and the query Q.7. Create the corresponding queries of the extended facts.
Use the binding information to construct their binding patterns.
2
The treatment of unsafe rules is the same as in generalized magic sets [Ull89].
Figure 5.2: Algorithm QED-T
Therefore, we need to modify algorithm QED. The necessary change over QED consists
basically of a pre-processing step: replace tokens in the p-Datalog description with variables,
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 85
but maintain as an extra annotation the information that these variables need to be bound.
In particular, that information can be attached to each extended fact as an extra annotation.
The modified algorithm QED-T is presented in detail in Figure 5.2.
Applying that modification to the previous example, (D5) becomes
(D′′5) v(W )← p(B,W )
where B needs to be bound. Algorithm QED-T on this input produces two candidate
queries:
(Q47) ans(X)← p(Y1, X)
where Y1 needs to be bound, and
(Q48) ans(Y )← p(A, Y )
where A needs to be bound. Finally, QED-T uses the binding information to turn the
candidate queries into queries with binding patterns. So, (Q47),(Q48) turn into
(Q49) ansfb(X, Y1)← p(Y1, X)
and
(Q50) ansfb(Y, A)← p(A, Y )
Queries (Q49),(Q50) together with (Q44) are the input to the second stage of our CBR
scheme, which per Section 5.3 is an algorithm for answering queries using views. The algo-
rithms [LMSS95; Qia96] proposed in the previous section do not deal properly with tokens.
As we have mentioned in Section 5.1, tokens describe binding requirements. Therefore, we
need to take into account the binding requirements of candidate queries. [RSU95] studies
the problem of answering queries using views with binding requirements. The authors use
binding patterns to describe binding requirements. They show that the problem is NP-
complete and they also describe an algorithm for it. The algorithm takes as input a finite
set of conjunctive views with binding patterns and a “target” query with a binding pattern
and rewrites the query using the views in a way that respects the view binding patterns.
Example 5.3.3 is an example of query rewriting using views with binding patterns.
We use this algorithm, henceforth referred to as the AnsBind algorithm, for the second
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 86
part of our CBR scheme, that is, to find a rewriting of the user query using the candidate
queries. Using (Q47), (Q48) and (Q44) as input to AnsBind, we obtain the correct rewriting
of (Q44) that is shown in Example 5.3.3.
Theorem 5.3.5 (CBR-tokens) Assume we have a query Q and a p-Datalog description
P with tokens, and let {Qi} be the result of applying Algorithm 5.3.4 on Q and P . There
exists a rewriting Q′ ≡ Q, using any {Qj |Qj is expressible by P} if and only if there exists
a rewriting Q′′ ≡ Q, using only {Qi}.
Proof: The only difference to QED is that QED-T is “missing” some candidate queries
by ignoring tokens. But it is easy to see that any candidate query we are thus “missing”
is identical to one of the queries in the candidate set of QED-T modulo equality subgoals.
Moreover, if there is a rewriting of a query using some candidate Qi with some binding
pattern, then there is also a rewriting of the query using Qi without a binding pattern. The
theorem then follows. 2
The solution for the CBR problem with binding requirements is also non-deterministic
exponential.
5.4 An interesting and more efficient class of p-Datalog de-
scriptions
We identify an interesting class of p-Datalog descriptions with a simple syntactic characteri-
zation, for which the CBR algorithm of Section 5.3 is much more efficient. In particular, for
this class of descriptions the output of the QED algorithm is exponential only in the arity
of the candidate query head, and does not depend on the size of the canonical database.
Hence, the second stage of the CBR scheme is more efficient, since it receives smaller input.
Overall, the CBR scheme for this class is non-deterministic exponential in the arity of the
head predicate.
Definition: A p-Datalog description P belongs in Ploop if and only if
• P contains only one IDB predicate
• If p is the IDB predicate and
R : p(X1, . . . , Xn)← pred1(A1, . . . , Am) . . . , p(Y1, . . . , Yn), . . . , predk(B1, . . . , Bl)
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 87
is any recursive rule where p appears, Yi is actually Xi for all i.
2
Descriptions in Ploop therefore consist of simple loops and exit rules.
Example 5.4.1 Let us repeat the description of the source of Example 5.1.3. The source
accepts queries on books given one or more words from their abstracts, assuming there exists
an abstract index abstract index(abstract word, Id) The following p-Datalog program is
used to describe this source.
(D7)
ans(Id,Aut, T itl, Pub, Y r, Pg) ← books(Id,Aut, T itl, Pub, Y r, Pg), ind(Id)
ind(Id) ← abstract index($c, Id)
ind(Id) ← ind(Id), abstract index($c, Id)
The above description clearly belongs in Ploop.11 2
We use lattices to help explain why the output of QED on descriptions in Ploop does not
depend on the size of the canonical database but it depends solely on the arity of the ans
facts. The next subsection is a short reminder about lattices.
5.4.1 Lattice Framework
Let us consider the subset relation ⊆ between sets.
We denote a lattice with set of elements (supporting sets in this section) L and the
subset relation ⊆ by 〈L,⊆〉. For elements a and b of a lattice 〈L,⊆〉, a ⊂ b means that
a ⊆ b and a 6= b.
The ancestors and descendants of an element of a lattice 〈L,⊆〉, are defined as follows:
ancestor(a) = {b | a ⊆ b}
descendant(a) = {b | b ⊆ a}
Note that every element of the lattice is its own descendant and its own ancestor. The
immediate proper ancestors of a given element a belong to a set we shall call next(a).
Formally,
next(a) = {b | a ⊂ b, 6 ∃c, a ⊂ c, c ⊂ b}
11The description also happens to be monadic[AHV95]. Descriptions in Ploop in general don’t have to be.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 88
Figure 5.3: Supporting set lattice for fact f for a database of size 5
Figure 5.4: Supporting sets and least common ancestor
In the terminology of lattices, with ⊂ being the ordering relation, a is covered by the
elements of next(a) [DP90].
It is common to represent a lattice by a lattice diagram, a graph where the lattice
elements are nodes and there is an edge from a below to b above if and only if b is in
next(a) (i.e., if and only if a is covered by b). Thus, for any two lattice elements x and y,
the lattice diagram has a path downward from y to x if and only if x ⊆ y.
Figure 5.3 shows the lattice diagram for the possible supporting sets of a fact f for
a database of size 5. The next subsection discusses the size of the output of the QED
algorithm for the Ploop class of p-Datalog descriptions.
5.4.2 QED and Ploop
The cardinality of the candidate set produced by QED can in general be exponential in the
size of the canonical database. Figure 5.3 gives a graphical explanation for the potential
exponentiality of supporting sets of even fixed size for a fact f . Therefore, the number of
candidate queries can also be exponential in the size of the canonical database.
For descriptions in Ploop, let us make the following crucial observation: Let Si and Sj
be two supporting sets for fact f that are produced by algorithm QED with a description
P that is in Ploop. Let S be their least common ancestor, as in Figure 5.4. Then, S is also
produced by QED for f . Since QED only keeps extended facts with maximal supporting
sets, the extended fact < f,S > will be kept for f , and it will replace the extended facts
< f,Si > and < f,Sj >.
Thus, it is easy to see that only one extended fact per fact f will be generated, and
therefore just one candidate query. Therefore, the output of the QED algorithm for Ploop,
and thus the complexity of the second stage of the CBR scheme, is only exponential in the
arity of the head of the candidate queries, and not in the size of the canonical database.
The importance of the class lies in the fact that we have observed that it is expressive
enough to describe a large number of common sources, such as document retrieval systems
and Web-based sources.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 89
5.5 Expressive Power of p-Datalog
We have illustrated the use of p-Datalog programs as a source description language. In
this section, we explore some limits of its description capabilities. It should be noted that
although we focus here on the description of conjunctive queries, similar results hold when
negation and disjunction are introduced.
Clearly, there are sets of conjunctive queries that cannot be described by any p-Datalog
description. Moreover:
Lemma 5.5.1 There exist recursive sets of conjunctive queries that are not expressible by
any p-Datalog description.
Proof: As we have seen in the previous section, the decision procedure for the description
semantics of p-Datalog is exponential. Therefore, any recursive set of conjunctive queries
with a membership function that is superexponential is not expressible by any p-Datalog
description. 2
However, the practical question is whether there are recursive sets of conjunctive queries,
that correspond to “real” sources and cannot be expressed by p-Datalog programs. We show
next that some common sources (intuitively the “powerful” ones) exhibit this behavior.
Before we prove this result, we demonstrate the expressive abilities and limitations of p-
Datalog.
Let us start with an observation: For every p-Datalog description program P , the arity
of the result is exactly the arity of the ans predicate. This restriction is somewhat artificial,
since we can define descriptions with more than one “answer” predicate. However, even
in that case, a given program would still bound the arities of answers. Furthermore, a
more serious bound is the number of variables that occur in any one of the rules of the
program. We will see that this bound is imposing severe restrictions on the queries that
can be expressed.
But first, if we bound the number of variables, we can show the following:
Theorem 5.5.2 Let k be some integer. Let p1, . . . , pm be the EDB predicates of a database.
There exists a p-Datalog program P that describes all conjunctive queries with at most k
variables on this database.
Proof: We show the construction for k = 3 and for the case where p1, . . . , pm are each
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 90
predicates of arity two. The program that can describe all conjunctive queries is the fol-
lowing:
(D8) ans3(Xi, Xj , Xl) ← temp(X1, X2, X3),∀i, j, l ≤ 3
ans2(Xi, Xj) ← ans3(Xi, Xj , Xl),∀i, j, l ≤ 3
ans1(Xi) ← ans2(Xi, Xj),∀i, j ≤ 3
ans0() ← ans1(X)
temp(X1, X2, X3) ← pl(Xi, Xj), temp(X1, X2, X3), ∀l ≤ m, ∀i, j ≤ 3
temp(X1, X2, X3) ← pl(Xi, $c), temp(X1, X2, X3), ∀l ≤ m, ∀i ≤ 3
temp(X1, X2, X3) ← pl($c,Xj), temp(X1, X2, X3), ∀l ≤ m, ∀j ≤ 3
temp(X1, X2, X3) ← pl($c1, $c2), temp(X1, X2, X3), ∀l ≤ m
temp(X1, X2, X3) ← ε
where X1, X2, X3 are distinct variables. It is easy to see that a similar construction can
provide the program that describes all conjunctive queries for k > 3 and larger arities. 2
As mentioned above, a fixed p-Datalog program bounds the arity of the results, but
this bound is not the only cause of limitation. Even if we focus on arity-0 results, i.e.,
queries that answer yes or no and do not provide data, p-Datalog is limited. The limitation
is related to the number of variables. Let FOk be the set of sentences of first order logic
[AHV95] with at most k variables. Note that the same variable can be “reused” as much
as needed using quantification. The following relates the queries described by a p-Datalog
program to formulas expressible in first-order logic with a bounded number of variables.
It states that although one such query may use an arbitrary number of variables, with
appropriate “reuse” only a bounded number of variables suffice.
Lemma 5.5.3 Let P be a Datalog program and k the maximum number of variables
occurring in a rule of P . Then for each Q expressible by P , Q is equivalent to a query in
FOk.
Proof: Let x1, . . . , xk be the variables appearing in the rules of description P . Also, let
Q′ : ans(u1)← p1(u2), p2(u3), . . . , pn(un)
be in descr(P ) such that Q ≡ Q′. We will show that Q′ is equivalent to a first order sentence
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 91
with only k variables.
The proof is by induction on the number of resolution steps used to construct a rule. If
Q′ is a rule of P , then the claim is true. Otherwise, when doing a step of the resolution, let
qi be the literal that is unified with some rule head. Then, the variables not used in qi can
be reused existentially quantified for the extra variables in the rule. To illustrate, let the
rules be as follows:
(1) p(X, Y )← q(Y, Z), r(Z,X, W )
(2) q(X, Y )← s(X, Y, Z,W, A,B)
The maximum number of variables in the rules is 6. The result of resolving rule (1) with
(2) is the following rule:
p(X, Y )← s(Y, Z, Z ′,W ′, A′, B′), r(Z,X, W )
The logical sentence associated with this rule is
∀X, Y (∃Z,Z ′,W ′, A′, B′,W (s(Y, Z, Z ′,W ′, A′, B′) ∧ r(Z,X, W ))→ p(X, Y ))
which is equivalent to
∀X, Y (∃Z,W (r(Z,X, W ) ∧ ∃X, W,A′, B′(s(Y, Z, X,W, A′, B′)))→ p(X, Y ))
The last statement above uses exactly 6 variables. 2
The limitation on the number of variables of the program prohibits the description of the
set of all conjunctive queries over a schema — a set that is supported by common powerful
sources.
Theorem 5.5.4 Let the database schema S have a relation of arity at least two. For every
p-Datalog description P over S, there exists a boolean query Q over S, such that Q is not
expressible by P . (So, in particular, there is no p-Datalog description that could describe
a source that can answer all conjunctive queries, even if we fix the arity of the answer.)
In order to prove the theorem, we first need to prove the following lemma:
Lemma 5.5.5 There exists a boolean conjunctive query with k variables that is not ex-
pressible as a conjunctive query with k − 1 variables.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 92
Proof: Let us consider the following query Q:
ans() ← G(x1, x2), . . . , G(x1, xk), . . . ,
G(xi, x1), . . . , G(xi, xi−1), G(xi, xi+1), . . . , G(xi, xk)
G(xk, x1), . . . , G(xk, xk−1)
This query asks if G has self-loops or G has a clique of size k and is in FOk. Q cannot
be expressed by an FOk−1 formula, as can be shown by playing a pebble game [Imm82]
with k−1 pairs of pebbles on the following two structures: G1, a k-clique without self-loop,
and G2, a (k − 1)-clique without self-loop. These two structures are indistinguishable in
the (k − 1)-pebble game. Therefore, because of Lemma 5.5.3, Q cannot be expressed by a
conjunctive query with k − 1 variables. 2
Now we are ready to prove Theorem 5.5.4.
Proof: Let S (without loss of generality) contain the binary predicate G. Suppose such
a description P exists. Let k be the maximum number of variables in a rule of P . Then
each conjunctive query expressible with P is in FOk by Lemma 5.5.3. Moreover, the k + 1
clique without self-loop is not in P . 2
Theorem 5.5.4 points out a rather serious limitation of p-Datalog descriptions.
5.6 Related Work
Many projects have dealt with data integration of structured sources (e.g., [LMR90; A+91;
HM93; K+93; T+90].) These projects ignored the problem of the different and limited
query capabilities of information sources, which is important for integration systems that
deal with heterogeneous sources. In Chapter 2 we discussed the approaches taken by a
newer generation of projects. In what follows, we discuss some theoretical work in this
area.
Papakonstantinou et al. [PGGMU95] suggested a grammar-like approach for describing
query capabilities and Levy et al. [LRU96] used Datalog with tokens for the same purpose.
These works are focused on showing how we can compute a query Q given a capabilities
description P . The algorithm presented in [PGGMU95] only applies to specific classes of
descriptions. [LRO96] proposes using capability records for source capability description.
Capability records are strictly less expressive than p-Datalog descriptions. More recently,
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 93
Florescu et al. [FLMS99] discussed performance and implementation issues for plan gener-
ation in the presence of access limitations on data modelled by binding patterns.
The following subsection discusses the use of tokens for the description of binding re-
quirements and compares that approach to the use of binding patterns [RSU95; Ull89].
5.6.1 Describing binding requirements in p-Datalog
As we have already noticed, sources often can answer only queries that have specific bind-
ing requirements. As mentioned in Section 5.1, we are using tokens to specify that some
constant is expected in some fixed position in the query, i.e., to implicitly define the bind-
ing requirements of described queries. In contrast, [RSU95] uses explicit enumeration of
accepted binding patterns [Ull89] for each described query to achieve the same goal.
Example 5.6.1 Let us consider the following p-Datalog rule:
(Q51) ans(X, Y )← p(X, Z, $c1), q(Y, Z, $c2,W )
(Q51) describes a join query that requires two bindings, one for the third argument of
relation p and one for the third argument of relation q. Using the notation of [RSU95], also
used in [Ull89], we could write (Q51) above as follows:
(Q52) ansffbb(X, Y,A, B)← p(X, Z, A), q(Y, Z, B, W )
This query describes the same binding requirements as (Q51). 2
Explicitly specifying accepted binding patterns as in (Q52) presents a number of prob-
lems. In particular, it obscures the distinction between variable and constant in the rule.
This complicates answering the query expressibility question. Moreover, and more im-
portantly, explicit specification of binding patterns does not generalize in the presence of
recursion. When query capabilities are described with a p-Datalog program, it is not even
possible to enumerate all posssible binding patterns: the description encodes a possibly
infinite number of described queries that have different bound variables.
On the other hand, using tokens allowed us to naturally extend the description of binding
requirements to the case of p-Datalog programs. The difference is made clearer by the
following example.
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 94
Example 5.6.2 Let us revisit Example 5.1.3, that describes a particular bibliographic
source. The p-Datalog description for that source is the following12
ans(I, A, T, P, Y, Pg) ← books(I,A, T, P, Y, Pg), ind(I)
ind(I) ← abstract index($c, I)
ind(I) ← ind(I), abstract index($c, I)
As we saw in Example 5.1.3, the source describes the following infinite family of con-
junctive queries:
ans(I, A, T, P, Y, Pg) ← books(I, A, T, P, Y, Pg), abstract index(c1, I)
ans(I, A, T, P, Y, Pg) ← books(I,A, T, P, Y, Pg), abstract index(c1, I),
abstract index(c2, I)
etc.
The queries in this family have an increasing number of bound variables, so their binding
patterns would look like this:
ffffffb,
ffffffbb,
etc.
The use of tokens allows us to describe the binding requirements succinctly. 2
5.7 Conclusions and open problems
In this chapter we discussed the problems of (i) describing the query capabilities of sources
and (ii) using the descriptions for mediation. We discussed these problems for a capability-
description language that is a Datalog variant, called p-Datalog. We provided algorithms
for solving (i) the expressibility and (ii) the CBR problems.
The first algorithm decides whether a given query is equivalent to one of the queries
described by a p-Datalog program. Within an integration system such as TSIMMIS, a
variant of this algorithm is run by the wrapper to perform the translation of queries issued
by the mediator or the user to queries and commands in the sources’ native language.
12Variable names are changed
CHAPTER 5. THE CAPABILITY DESCRIPTION LANGUAGE P-DATALOG 95
In particular, as explained in Appendix A, each rule of the capability description can be
associated with a translating action, in the spirit of Yacc. Algorithm 5.1 can be extended
to annotate facts not only with their supporting set, but also with their proof tree. The
wrapper then uses the algorithm to match a submitted query with a query described by the
capability description, and uses the proof tree generated by the algorithm to perform the
translation of the query into source-specific queries and commands, by applying to it the
appropriate translating actions.
The second algorithm is run by the CBR module of the mediators and it finds out if a
given query can be computed using queries which are expressible by a p-Datalog program.
The output of the algorithm is a logical query plan that can be further optimized by a
query optimizer, as shown in Figure 1.6. This conceptually clean separation of the CBR
module from the rest of the query processing modules is challenged in environments where
the number of logical query plans may become huge. When we need to optimize a multiway
join query where the source data can be joined in a large number of permutations, it is
infeasible to have the CBR module generate all possible logical plans and consequently have
the optimizer cost them and pick the least expensive. Instead the CBR module has to aid
the optimizer by pruning the space of plans it generates. In [PGH98], the authors describe
an extension of a CBR algorithm with System R’s dynamic programming pruning technique
[SAC+79]. A more aggressive approach has been followed by the Garlic project [ROH99],
where the capabilities-based rewriting is part of the cost-based query optimizer. Capabilities
are encoded as rewriting rules that describe the subplans (instead of the queries) that are
understood by a wrapper. The interaction of CBR with query optimization, especially
in the context of adaptive query optimization [Lev00], presents fertile ground for further
research.
This chapter also studies the expressive power of p-Datalog. We showed that p-Datalog
is more powerful than using conjunctive queries with binding patterns but we also reached
the important negative result that p-Datalog can not describe the query capabilities of
certain powerful sources. In particular, we showed that there is no p-Datalog program that
can describe all conjunctive queries over a given schema. Indeed, there is no program that
describes all boolean conjunctive queries over the schema. A direct consequence of our
result is that p-Datalog cannot model a full-fledged relational DBMS.
We have focused exclusively on conjunctive queries. It is an interesting open problem to
extend this work to non-conjunctive queries, i.e., queries involving aggregates and negation.
Chapter 6
The Capability Description
Language RQDL
The limitations of p-Datalog for describing the query capabilities of powerful sources make it
useful to explore the use of more powerful capability-description languages. In this chapter
we study RQDL, a novel language proposed recently [PGH96] for capability-description
in the absence of detailed schema information. RQDL extends p-Datalog by allowing the
representation of attribute lists of arbitrary length through attribute vectors. In particular,
in this chapter
• We formally describe and extend RQDL, and prove that it is more powerful than
p-Datalog.
• We provide an algorithm that allows us to build networks of mediators by exporting
the query capabilities of a mediator in terms of the capabilities of its underlying
sources — as described by the TSIMMIS architecture (Figures 1.5 and 1.6). The
algorithm takes as input descriptions of the query capabilities of sources and outputs
a description of all queries supported by a mediator that accesses these sources.
• We provide a reduction of RQDL descriptions into p-Datalog augmented with function
symbols of a specific form. The reduction has important practical and theoretical
value. From a practical point of view, it reduces the CBR and query expressibility
problems for RQDL to the corresponding problems for p-Datalog, thus giving complete
algorithms that are applicable to all RQDL descriptions. From a theoretical point of
96
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 97
view, it clarifies the difference in expressive power between RQDL and p-Datalog. We
present the reduction, as well as the algorithms for the complete RQDL language.
Section 6.1 introduces RQDL. Section 6.2 discusses the use of RQDL to describe media-
tor capabilities accurately. Section 6.3 describes the reduction of RQDL to p-Datalog with
function symbols and Section 6.4 describes the query expressibility and CBR algorithms for
RQDL.
6.1 The RQDL description Language
Given the limitations of p-Datalog for the description of powerful information sources, we are
proposing the use of a more powerful query description language. RQDL (Relational Query
Description Language) is a Datalog-based rule language used for the description of query
capabilities. It was first proposed in [PGH96] and used for describing query capabilities
of information sources. [PGH96] shows its advantages over Datalog when it is used for
descriptions that are not schema specific, i.e., the description does not refer to specific
relations or arities in the schema of the specific source. In this way the descriptions are
more concise and they gracefully handle schema evolution.
In this chapter we present a formal specification of extended-RQDL, which provably
allows us to describe large sets of queries. For example, we can prove that the extended-
RQDL (from now on, we will by default refer to the extended-RQDL as RQDL), unlike
p-Datalog, can describe the set of all conjunctive queries. Furthermore, we reduce RQDL
descriptions to terminating p-Datalog programs with function symbols. Consequently, the
decision on whether a given conjunctive query is expressed by an RQDL description is
reduced to deciding expressibility of the query by the resulting p-Datalog program.
Note, the reduction of RQDL to Datalog with function symbols is important because
• It reduces the comparison between the expressive power of p-Datalog and RQDL to
a comparison between Datalog and Datalog with function symbols.
• It reduces the decision procedure for expressibility to Algorithm 5.2.4. This reduction
allows us to give a complete solution to the CBR problem for RQDL.
Subsections 6.1.1 and 6.1.2 demonstrate the use of RQDL for the description of source
capabilities and define the syntax and semantics of RQDL. Section 6.3 describes the reduc-
tion of RQDL descriptions to p-Datalog programs with function symbols and Section 6.4
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 98
proceeds to give algorithms for query expressibility by RQDL description and for the CBR
problem for RQDL descriptions.
6.1.1 Using RQDL for query description
To support schema-independent descriptions, RQDL allows the use of predicate tokens1 in
place of the relation names. Furthermore, to allow tables of arbitrary arity and column
names, RQDL provides special variables called vector variables, or simply vectors, that
match with sets of relation attributes that appear in a query. Vectors can “stand for”
arbitrarily large sets of attributes. It is this property that eventually allows the description
of large, interesting sets of conjunctive queries (like the set of all conjunctive queries).
Example 6.1.1 illustrates RQDL’s ability to describe source capabilities without referring
to a specific schema. Example 6.1.2 demonstrates an RQDL program that describes all
conjunctive queries over any schema. Subsection 6.1.2 describes the formal syntax and
semantics of RQDL. Before we go ahead with the examples, let us introduce some notation.
Named Attributes in Conjunctive Queries: For notational convenience, we slightly
modify the query syntax so that we can refer to the components of tuples by attribute
names instead of column numbers. For example, consider the relation book with schema
book(title, isbn). We will write book subgoals by explicitly mentioning the attribute names;
instead of writing
ans()← book(X, Z), equal(X, DataMarts)
we will write
ans()← book(title : X, isbn : Z), equal(X, DataMarts)
We will be using named attributes in the rest of this chapter. Every predicate will then
have a set of named attributes (and not a list of attributes). The connection of this scheme
to SQL syntax is evident.
Example 6.1.1 Consider a source that accepts queries that refer to exactly one relation
and pose exactly one selection condition over the source schema.
ans()← $r(→V ), item(
→V , $a,X ′), equal(X ′, $c)
1Predicate tokens belong to the same sort as tokens (see Chapter 5).
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 99
In the above RQDL description, $r is a predicate token, “standing in for” any predicate
name,→V is an attribute vector, and item is a metapredicate describing an element of
→V .
The description2 describes, among others, the query
ans()← books(title : X, isbn : Z), equal(X, DataMarts)
because, intuitively, we can map the predicate token $r to relation books,→V to the set
of attribute-variable pairs {title : X, isbn : Z}, X ′ to X, and $c to DataMarts. The
metapredicate item(→V , $a,X ′) declares that the variable X ′ maps to one of the variables in
the set of attribute-variable pairs that→V is mapped to, i.e., X ′ maps to one of the variables
of the subgoal $r. The token $a maps to the attribute name of the variable X ′ in the
mapping of→V . $a can map to any of the attribute names and hence X ′ can map to either
X or Z.
RQDL descriptions do not have to be completely schema-independent. For example, let
us assume that we can put a selection condition only on the title attribute of the relation.
Then we modify the above RQDL description as follows:
ans()← $r(→V ), item(
→V , title,X ′), equal(X ′, $c)
The replacement of $a by title forces the selection condition to refer to the title attribute
only. 2
Next we present the RQDL description PCQ that describes all conjunctive queries over
any schema.
Example 6.1.2
(i) ans(→V 1) ← cond(
→V ),
→V 1 ⊆
→V
(ii) cond(→V ) ← $p(
→V1), cond(
→V2),
→V =
→V1 ∪
→V2
(iii) cond(→V ) ← item(
→V , $a,X), equal(X, $c), cond(
→V )
(iv) cond(→V ) ← item(
→V , $a1, X1), item(
→V , $a2, X2), equal(X1, X2), cond(
→V )
(v) cond(→V ) ← $p(
→V )
The description above describes any rectified conjunctive query (without arithmetic). The
2Notice that both the RQDL descriptions and the queries are rectified.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 100
description works conceptually as an evaluator of an SQL query: The predicate cond cap-
tures the “Cartesian product” of the FROM clause, constructed by rules (ii) and (v), while
rules (iii) and (iv) apply one selection and join condition respectively over one variable, and
rule (i) describes any projection. The union metapredicate in rule (ii) creates the attribute
vector of the ”augmented” condition.
2
6.1.2 Semantics of RQDL
An RQDL description is a finite set of RQDL rules. The description semantics of RQDL is
a generalization of the description semantics of p-Datalog, to account for the existence of
vectors and metapredicates. We start by defining an expansion of an RQDL description.
Definition: Let P be an RQDL description with a particular IDB predicate ans. The set
of expansions EP of P is the smallest set of rules such that:
• each rule of P that has ans as the head predicate is in EP ;
• if r1: p ← q1, . . . , qn is in EP , r2: r ← s1, . . . , sm is in P , and a substitution θ is the
most general unifier of some qi and r then the resolvent
θp← θq1, . . . θqi−1, θs1, . . . , θsm, θqi+1, . . . , θqn
of R1 with R2 using θ is in EP .
2
Unification: Unification extends to vectors in the following way:
1. a vector can unify with another vector, yielding a vector
2. a vector can unify with a set consisting of attribute-variable pairs, yielding that set;
for example p(→V ) can unify with p(attr1 : X, attr2 : Y ) yielding
p(attr1 : X, attr2 : Y )
Metapredicates: There are three metapredicates, and their argument list has to be of a
specific type: We define
union(→V ,
→V1,
→V2) to mean
→V =
→V1 ∪
→V2
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 101
where→V is a vector and
→V1,
→V2 can be vectors, or sets of attribute-variable pairs. We also
define
item(→V , $a,X) to mean
→V [$a] = X,
and
item(→V , a,X) to mean
→V [a] = X
which means that the variable X belongs to the set of attribute-variable pairs that→V maps
to, with attribute name $a (or a). a is a constant. $a is a token. X is a variable.→V can be
a vector or a set of attribute-variable pairs. Finally, we define
subset(→V ,
→V1) to mean
→V ⊆
→V1
where→V and
→V1 can be vectors or sets of attribute-variable pairs and
→V can only appear in
the head of the rule (in addition to the subset subgoal). The intuition behind subset is that
it allows us to do arbitrary projections. In any RQDL program P , subset can only appear
in rules whose head predicate does not appear in the body of any rule of P .
We call a metapredicate that does not contain any vectors ground.
Safety: Metapredicates must observe some binding pattern constraints. In particular, all
vectors that appear in metapredicates must be safe as defined below:
• If a vector appears in an EDB or IDB subgoal then it is safe.
• If a vector→V appears in a subgoal union(
→V ,
→V1,
→V2), and
→V1 and
→V2 are safe, then
→V
is also safe.
• If a vector→V appears in a subgoal subset(
→V ,
→V1) and
→V1 is safe, then
→V is also safe.
Following the definition of description semantics of Section 5.1, we now define the de-
scription semantics of RQDL.
Definition: Set of Queries Described/Expressible by an RQDL Program The
set of terminal expansions TP of P is the subset of all expansions e ∈ EP containing only
EDB predicates or predicate tokens in the body. A valid terminal expansion is a terminal
expansion where all ground metapredicates evaluate to true.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 102
The set of instantiated terminal expansions IP of RQDL description P is the set of all
(rectified) conjunctive queries τ(r), where r belongs to the set of terminal expansions of P ,
and τ is a mapping of the RQDL rule r to a conjunctive query that:
1. maps every token $c to a constant c. (Note, we consider relation names to be of
constant type.)
2. maps every vector→V to a set of attribute-variable pairs {(a1 : X1), . . . , (an : Xn)}
through a mapping σ such that
(a) after we replace every predicate subgoal p(→V ) with p(a1 : X1, . . . , an : Xn) no
variable appears in more than one predicate subgoal,
(b) for every subgoal of the form union(→V ,
→V1,
→V2), σ(
→V ) = σ(
→V1) ∪ σ(
→V2),
(c) for every subgoal of the form item(→V , a, X), σ(
→V ) includes a pair (a : X),
(d) for every subgoal of the form item(→V , $a,X), σ(
→V ) includes a pair (a,X), for
some a,
(e) for every subgoal of the form subset(→V ,
→V1), σ maps
→V to a subset of σ(
→V1).
3. and drops all metapredicate subgoals.
The set of described queries of an RQDL description P with “designated” predicate ans
(when ans is understood), is the set of safe instantiated terminal expansions of P . 2
Example 6.1.3 Let us refer to the RQDL description PCQ of Example 6.1.2. The RQDL
rule
R : ans(→V ′)← $p1(
→V 1), $p2(
→V 2), union(
→V ,
→V 1,
→V 2), item(
→V , $a1, X1), item(
→V , $a2, X2),
equal(X1, X2), subset(→V ′,
→V )
is a terminal expansion of that RQDL description. In particular, this rule is derived from the
RQDL description PCQ by using rules (i), (iv), (ii) and (v) in that order. The conjunctive
query
Ri : ans(a1 : X, a2 : Y )← p(a1 : X, b : Z), q(a2 : Y, c : Z ′), equal(Z,Z ′)
is an instantiated terminal expansion of the RQDL description, since it is an instantiation
of rule R. In particular,
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 103
• $p1, $p2 map to predicate names p, q respectively.
• $a1, $a2 map to attribute names b, c respectively.
•→V 1 maps to (a1 : X, b : Z),
→V 2 maps to (a2 : Y, c : Z ′) and
→V maps necessarily to their
union, namely to (a1 : X, a2 : Y, b : Z, c : Z ′).
• X1, X2 map to Z,Z ′ respectively.
•→V ′ maps to (a1 : X, a2 : Y ).
All metapredicate subgoals are dropped. 2
If Q is a conjunctive query with head predicate ans and P is an RQDL description, we
say that Q is expressible by P , if there exists Q′ described by P , such that Q ≡ Q′.
Referring to Example 6.1.3, query
Q : ans(a1 : A, a2 : B)← p(a1 : A, b : Z), q(a2 : B, c : Z ′), q(a2 : W, c : U), equal(Z,Z ′)
is expressible by the description PCQ, since it is equivalent to Ri.
Note here that RQDL can be easily extended (e.g., allowing not only tokens but also
variables in place of predicate names) to describe the capabilities of information sources
that understand and can process higher order logics, for example sources that understand
HiLog [CKW93] or F-Logic [KL89]. We do not pursue this issue further in this thesis.
The next section explains how to use RQDL to describe the capabilities of networks of
mediators.
6.2 RQDL and mediator capabilities
Let us revisit the mediation architecture of Figure 1.4. In a dynamic environment such as the
Internet, or the intranet of a big organisation, when integrating information, we would like
to be able to leverage existing integration infrastructure [Wie92]. Specifically, if a mediator
exists that offers an integrated view of some information we want to access, we would like to
be able to use that mediator, instead of accessing each one of the sources it integrates. Using
a mediator as an integrated information source means creating a networks of mediators, as
in Figure 1.4. Using a mediator as a “source” to another mediator also means that we must
be able to describe the mediator capabilities. As explained in the introduction, mediators
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 104
often have query processing capabilities that allow them to “handle” every conjunctive
query over the data that they integrate.
Given the expressiveness results of Section 5.5, p-Datalog cannot describe the capabilities
of such a mediator. However, RQDL is powerful enough for that task. Let us consider a
mediator M that integrates sources S1, . . . , Sn and let the descriptions of these sources be
D1, . . . , Dn. Also, assume that each wrapper understands one answer predicate, and let
these be ans1, . . . , ansn. Then, the RQDL program DM that describes the capabilities of
the mediator is the following:
ans(→V 1) ← cond(
→V ), subset(
→V 1,
→V )
cond(→V ) ← choose(
→V1), cond(
→V2), union(
→V ,
→V1,
→V2)
cond(→V ) ← item(
→V , $a,X), equal(X, $c), cond(
→V )
cond(→V ) ← item(
→V , $a1, X1), item(
→V , $a2, X2), equal(X1, X2), cond(
→V )
cond(→V ) ← choose(
→V )
choose(→V ) ← ans1(
→V )
...
choose(→V ) ← ansn(
→V )
D1
...
Dn
The similarity of this description to PCQ of Example 6.1.2 is evident. DM describes all
conjunctive queries that the mediator can answer, that is, any conjunctive query that com-
bines results from queries that are accepted by the sources the mediator integrates; thus
the concatenation of D1, . . . , Dn in DM . Given D1, . . . , Dn, the description DM obviously
can be generated automatically.
Next we will discuss an efficient algorithm for deciding whether a query is expressible
by an RQDL description. The algorithm is based on a reduction of both the query and
the description into a simple standard schema that facilitates reasoning about relations and
attribute names.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 105
6.3 Reducing RQDL to p-Datalog with function symbols
Deciding whether a query is expressible by an RQDL description requires “matching” the
RQDL description with the query. This matching is a challenging problem, because vectors
have to match with nonatomic entities, i.e., sets of variables, hence making matching much
harder.
We present an algorithm that addresses this problem by reducing query expressibility
by RQDL descriptions to the problem of query expressibility by p-Datalog with function
symbols, i.e., we reduce the RQDL description into a corresponding description in p-Datalog
with function symbols. The reduction is based on the idea that every database DB can be
reduced to an equivalent database DB′ such that the attribute names and relation names
of DB appear in the data (and not the schema) of DB′. We call DB′ a standard schema
database. We then rewrite the query so that it refers to the schema of DB′ (i.e., the
standard schema) and we also rewrite the description into a p-Datalog description with
function symbols which refers to the standard schema as well.
Subsection 6.3.1 presents the conceptual reduction of a database into a standard schema
database. Subsection 6.3.2 presents the rewriting of queries and Subsection 6.3.3 presents
the rewriting of RQDL descriptions. Each of the subsections starts with one or two examples
and continues with a formal definition of the reduction, which can be skipped at the first
reading.
6.3.1 Reduction of a database to standard schema database
In order to reason with the relation names and attribute names of the queries, we conceptu-
ally reduce the original database into a standard schema database where the relation names
and the attribute names appear as data and hence can be manipulated without the need
for higher order syntax. First we present a reduction example and then we formally define
the reduction of a database into its standard schema counterpart.
Example 6.3.1 Consider the following database DB with schema
b(au, isbn) and f(subj , isbn).
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 106
b
au isbn
Smith 123
Jones 345
f
subj isbn
Logic 123
Theology 345
The corresponding standard schema database DB′ consists of two relations
tuple(table name, tuple id) and attr(tuple id , attr name, value) which are common to all
standard schema databases. In the running example DB′ is
tuple
table name tuple id
b b(au,Smith,isbn,123)
b b(au,Jones,isbn,345)
f f(subj,Logic,isbn,123)
f f(subj,Theology,isbn,345)
attr
tuple id attr name value
b(au,Smith,isbn,123) au Smith
b(au,Smith,isbn,123) isbn 123
b(au,Jones,isbn,345) au Jones
b(au,Jones,isbn,345) isbn 345
f(subj,Logic,isbn,123) subj Logic
f(subj,Logic,isbn,123) isbn 123
f(subj,Theology,isbn,345) subj Theology
f(subj,Theology,isbn,345) isbn 345
Notice above how we created mechanically one tuple id for each tuple of the original
database. 2
Definition: Given a database DB, we say that the standard schema database correspond-
ing to DB is the smallest database DB′ such that
1. its schema is tuple(table name, tuple id) and attr(tuple id , attr name, value), and
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 107
2. for every tuple t(a1 : v1, . . . , an : vn) in DB, there is a tuple tuple(t, t(a1, v1, . . . , an, vn))
in DB′ and for every attribute ai, i = 1, . . . , n there is also a tuple
attr(t(a1, v1, . . . , an, vn), ai, vi) in DB′.
2
6.3.2 Reduction of queries to standard schema queries
The RQDL expressibility algorithm first reduces a given conjunctive query Q over some
database DB into a corresponding query Q′ over the standard schema database DB′. The
reduction is correct in the following sense: the result of asking query Q′ on DB′ is equivalent,
modulo tuple-id naming, to the reduction into standard schema of the result of Q on DB.
To illustrate the query reduction, let us consider the following examples. We first con-
sider a boolean query Q over the schema of Example 6.3.1.
ans()← b(au : X, isbn : S1), f(subj : A, isbn : S2), equal(S1, S2), equal(A, Theology)
Query Q is reduced into the following query Q′:
tuple(ans, ans()) ← tuple(b, B), tuple(f, F ), attr(B, isbn, S1), attr(F, isbn, S2),
equal(S1, S2), attr(F, subj, A), equal(A, Theology)
Notice that for every ordinary subgoal we introduce a tuple subgoal and create mechani-
cally a tuple id. For every attribute we introduce an attr subgoal. The tuple id for the
result relation ans is simply ans() because the result relation has no attributes. When the
query head has attributes, a single conjunctive query is reduced to a nonrecursive Datalog
program. For example, consider the following query that returns the authors and ISBNs of
books if their subject is Theology.
ans(au : X, isbn : S1) ← b(au : X, isbn : S1), f(subj : A, isbn : S2), equal(S1, S2),
equal(A, Theology)
This query is reduced to the following program Q′ where the first rule defines the tuple part
of the standard schema answer and the last two rules describe the attr part.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 108
tuple(ans, ans(au,X , isbn,S1 )) ← tuple(b, B), tuple(f, F ), attr(B, isbn, S1),
attr(F, isbn, S2), equal(S1, S2), attr(B, au, X),
attr(F, subj, A), equal(A, Theology)
attr(ans(au,X , isbn,S1 ), au, X) ← tuple(b, B), tuple(f, F ), attr(B, isbn, S1),
attr(F, isbn, S2), equal(S1, S2), attr(B, au, X),
attr(F, subj, A), equal(A, Theology)
attr(ans(au,X , isbn,S1 ), isbn, S1) ← tuple(b, B), tuple(f, F ), attr(B, isbn, S1),
attr(F, isbn, S2), equal(S1, S2), attr(B, au, X),
attr(F, subj, A), equal(A, Theology)
In general, the reduction is accomplished by the following procedure:
Procedure 6.3.2 (Reduction) If Q’s head is ans(a1 : V1, . . . , an : Vn), generate a program
with n + 1 rules such that
1. One rule has head tuple(ans, ans(a1, V1, . . . , an, Vn)),
2. For every attribute ai, i = 1, . . . , n there is a rule with head
attr(ans(a1, V1, . . . , an, Vn), ai, Vi), and
3. All rules have the same body which is constructed by the following steps:
(a) For every subgoal of Q of the form r(a1 : X1, . . . , am : Xm), invent and associate
to it a unique variable T . The variables such as T bind to tuple id’s of the
standard schema database and hence we call them tuple id variables.
(b) Include in the standard schema query body the subgoal tuple(r, T ).
(c) For every attribute ai, i = 1, . . . ,m include in the standard schema query the
subgoal attr(T, ai, Xi).
(d) Add to the body all equality subgoals of the original query.
2
Xi can be a variable, a token or a constant. It is easy to see that under a few obvious
constraints there exists the inverse reduction.
Next we show how we reduce RQDL descriptions into p-Datalog descriptions over stan-
dard schema databases.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 109
6.3.3 Reduction of RQDL programs to Datalog programs over the stan-
dard schema
In the previous sections we showed how schema information, i.e., relation and attribute
names, becomes data in standard schema databases. Based on this idea, we will reduce
RQDL descriptions to p-Datalog descriptions that do not use higher order features such
as metapredicates and vectors. In particular, we “reduce” vectors to tuple identifiers. In-
tuitively, if a vector matches with the arguments of a subgoal, then the tuple identifier
associated with this subgoal is enough for finding all the attr-variable pairs that the vector
will match to. Otherwise, if a vector→V is the result of a union of two other vectors
→V1 and
→V2, then we associate with it a new constructed tuple id, the function u(T1, T2) where T1
and T2 are the tuple id’s that correspond to→V1 and
→V2. As we will see later, the reduction
carefully produces a program which terminates despite the use of the u function.
Example 6.3.3 Let us first consider a simple but interesting one-rule description:
ans(→V )← $p(
→V ), item(
→V , name,X)
This RQDL rule describes all selection-projection queries that refer to any schema over one
relation, with the constraint that the schema of the relation contains an attribute “name.”
This description reduces to the following p-Datalog description:
tuple(ans, ans(T )) ← tuple($p, T ), attr(T1, name,X), equal(T, T1)
attr(ans(T ), $a,X) ← tuple(ans, ans(T )), attr(T1, $a,X), equal(T, T1)
The vector variable→V is reduced to the variable T , which matches with a tuple id. The
metapredicate item(→V , name,X), is reduced to the predicate attr(T, name,X). 2
Example 6.3.4 The description of Example 6.1.2 describes all boolean conjunctive queries.
It reduces into the following p-Datalog description (with function symbols):
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 110
tuple(ans, ans(T )) ← tuple(cond , cond(T )) (1)
attr(ans(T ), $a,X) ← attr(T1, $a,X), tuple(ans, ans(T2)), equal(T, T1, T2)
tuple(cond , cond(T )) ← tuple($p, T1), tuple(cond , cond(T2)), valid(T, T1, T2)
attr(cond(T ), $a,X) ← attr(T1, $a,X), tuple(cond , cond(T2)), equal(T, T1, T2)
tuple(cond , cond(T )) ← attr(T ′, $a,X), equal(X, $c), tuple(cond , cond(T )),
equal(T ′, T )
tuple(cond , cond(T )) ← attr(T1, $a1, X1), attr(T2, $a2, X2), equal(X1, X2),
tuple(cond , cond(T )), equal(T, T1, T2)
tuple(cond , cond(T )) ← tuple($p, T )
and subset flag(1) = 1.
The reduction of each rule is independent of the reduction of other rules. Notice that the
metapredicate subset in the first rule was reduced into a subset flag set on the rule. In
the third rule, notice that we reduced→V to T , which is “produced” by the predicate valid,
given T1 and T2. The predicate valid is defined by the simple rule
valid(T, T1, T2)← sort(T, u(T1, T2)) (6.1)
and the rules for sort, which are given in Appendix B. Sort is a standard list-sorting routine,
that takes a list (in the form of an arbitrary u-term) as input, sorts it, and returns the sorted
list (in the form of a right-deep u-term).
The predicate valid constructs a valid, new tuple id for the vector union that has
“associated” with it all the attributes associated with the union of→V1 and
→V2. The new
tuple id is uniquely determined by the tuple ids of→V1 and
→V2. In particular, it is the right-
deep ordered binary tree with leaves the tuple ids in the unioned vectors. For the union
of the attribute lists of three relations with attribute vectors→V1,
→V2 and
→V3, that reduce to
tuple ids T1, T2 and T3 correspondingly,3 the tuple id generated would be u(T1, u(T2, T3));
no other u-term of “length 3” is produced by sort and, consequently, valid. For another
example, valid(T, u(t2, u(t3, t4)), u(t3, t5)) will bind T (by calling sort) to the sorted, right-
deep u-term with leaves t2, . . . , t5, that is, u(t2, u(t3, u(t4, t5))). T will be bound to the same
u-term by valid(T, u(t3, u(t4, t5)), u(t2, t5)) also.
3Without loss of generality, we assume that T1 < T2 < T3.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 111
The above definition guarantees that the number of valid tuples generated from executing
the valid rules on any canonical database is bounded, and determined by the number of
distinct tuple ids appearing in the database.
Finally, the description also has to include the rules4 of Figure 6.1, that make sure that
all attributes of tuple with ids T1 and T2 are also attributes of tuples with id T , constructed
from T1, T2.
attr(T, $a,X)← attr(T1, $a,X), valid(T, T1, T2)attr(T, $a,X)← attr(T2, $a,X), valid(T, T1, T2)
Figure 6.1: Default rules for generation of attr tuples
2
Formally, an RQDL description P is reduced to a p-Datalog description P ′ by reducing
each rule r of the description into p-Datalog with functions as follows:
Procedure 6.3.5 (Reduction)
1. If r has a subgoal of the form union(→V ,
→V1,
→V2), include in the reduction the valid
rules valid(T, T1, T2) ← sort(T, u(T1, T2)) and the rules for sort, as well as the rules
of Figure 6.1.
2. Reduce predicates that do not involve vectors as described in Section 6.3.2.
3. For each subgoal of the form r(→V ), where r is not a recursive predicate, include in the
reduced rule a subgoal tuple(r, T ). T is the reduction of→V .
4. For each subgoal of the form r(→V ), where r is a recursive predicate, include in the
reduced rule a subgoal tuple(r, r(T )).
5. For each subgoal of the form item(→V , a,X), where a is a token or a constant, include
in the reduced rule the subgoal attr(T, a,X), where T is the reduction of→V .
6. For each subgoal of the form union(→V ,
→V1,
→V2), replace in the reduced rule all in-
stances of→V with T and include the subgoal valid(T, T1, T2), where T1 and T2 are the
reductions of→V1 and
→V2.
4Note that we did not need to include these rules in Example 6.3.3.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 112
7. For each subgoal of the form subset(→V 1,
→V ), let T1 be the reduction of
→V 1 and T be
the reduction of→V . Replace T1 by T in the rule where subset appears, set the subset
flag for the rule to 1 (see below) and drop the subset subgoal.
8. If the head is of the form p(→V ), then reduce it to tuple(p, p(T )). Moreover, add the
following rule to the reduction:
attr(p(T ), $a,X)← attr(T1, $a,X), tuple(p, p(T2)), equal(T, T1, T2)
9. If the head of r is of the form p(attr-var set), then follow Procedure 6.3.2 to generate
all the p-Datalog rules that r reduces to.
2
The use of the subset flag of the head of a rule will be explained in Section 6.4.1. The
intuition behind it is as follows: Assume the existence of a subgoal subset(→V 1,
→V ) in rule r.
As we have said earlier,→V 1 must appear in the rule head, so let the head of r be p(
→V 1).
Also,→V must appear in an ordinary subgoal, e.g., q(
→V ). The subset subgoal means that the
RQDL rule r describes all conjunctive queries whose head attribute set is any projection
of the attribute set of relation q. In the reduction, we replace T1 (the reduction of→V 1) by
T (the reduction of→V ), saying effectively that the attribute set of p must be the same as
the attribute set of q. That’s why we set a flag on the rule, the subset flag, to remember
when deciding query expressibility and query description to also consider described those
conjunctive queries that include projections on q.
Theorem 6.3.6 Let P be an RQDL description and P ′ its reduction in p-Datalog with
functions. Let also DB be a canonical standard schema database of a query Q. Then P ′
applied on DB terminates.
Proof: It suffices to see that the generation of u terms cannot fall into an infinite loop.
For every two tuple ids in the canonical database, a new u-term is generated by a call to
valid, which in turn calls sort. By a simple inductive argument on the number of calls to
sort, it follows that for every two tuple ids in the canonical database, a unique u-term is
generated. Consequently, if n tuple-ids are present in the canonical database, at most an
exponential number in n of u-terms can be generated. 2
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 113
The next section explains the semantics of p-Datalog with functions, and shows how to solve
the CBR problem for RQDL using the algorithms developed for p-Datalog in Sections 5.2
and 5.3.
6.4 QED and CBR for RQDL descriptions
The reduction presented in the previous section allows us to formulate a solution to the
expressibility problem and CBR problems for RQDL descriptions.
In particular, we show in the next section that we can use QED with small changes for p-
Datalog with function symbols; we prove that the modified QED is sound and complete over
the fragment of p-Datalog with function symbols that is generated by the RQDL reduction.
In the remaining sections, we will denote that fraction of p-Datalog with function symbols
with p-Datalogf .
The result of applying the QED algorithm to a query Q and an RQDL description P is a
finite set of expansions of P , i.e., a finite set of queries (over the original schema) described
by P . We show in Section 6.4.2 that these expansions are the only relevant ones in trying
to answer Q using queries expressible by P . Therefore, as in Section 5.3, the CBR problem
for RQDL is immediately reduced to the problem of rewriting a conjunctive query using a
finite set of conjunctive views.
6.4.1 The query expressibility problem for RQDL
We first illustrate QED for RQDL with an example. Notice that there are now two “desig-
nated” predicates, the predicates tuple and attr.
Example 6.4.1 Consider the query Q: ans(a : X) ← books(au : X, titl : Y ) and the
descriptionans(a : X) ← $r(au : X, titl : Y )
ans(b : Y ) ← $r(au : X, titl : Y )
The reduction of the query is
tuple(ans, ans(a,X)) ← tuple(p, T0), attr(T1, au,X), attr(T2, titl, Y ), equal(T0, T1),
equal(T0, T2)
attr(ans(a,X), a,X) ← tuple(p, T ), attr(T, au, X), attr(T, titl, Y ), equal(T0, T1),
equal(T0, T2)
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 114
The canonical DB is
tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)
The reduction of the description (after rectification) is
tuple(ans, ans(a,X)) ← tuple($r, T ), attr(T1, au,X), attr(T2, titl, Y ), equal(T, T1),
equal(T, T2)
attr(ans(a,X), a, X) ← tuple($r, T ), attr(T1, au,X), attr(T2, titl, Y ), equal(T, T1),
equal(T, T2)
tuple(ans, ans(b, Y )) ← tuple($r, T ), attr(T1, au,X), attr(T2, titl, Y ), equal(T, T1),
equal(T, T2)
attr(ans(b, Y ), b, Y ) ← tuple($r, T ), attr(T1, au,X), attr(T2, titl, Y ), equal(T, T1),
equal(T, T2)
Notice that we didn’t include the rules of Figure 6.1 or the valid rules in the reduced
description, since the original description didn’t contain any metapredicates.
If we run the Algorithm 5.2.4 on the canonical DB, the following extended facts are
produced:
(1) < tuple(ans, ans(a, x)), {tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)} >
(2) < attr(ans(a, x), a, x), {tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)} >
< tuple(ans, ans(b, y)), {tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)} >
< attr(ans(b, y), b, y), {tuple(books, t0), attr(t1, au, x), attr(t2, titl, y), equal(t0, t1, t2)} >
The output of the algorithm includes extended facts with the same tuple id. We “group”
together the extended facts with the same tuple id. We notice that the group consisting of
the extended facts (1) and (2) corresponds exactly to the two conjunctive queries that are
the reduction of Q. Therefore Q is expressible by our description. 2
Before presenting the theorem that states the condition for RQDL expressibility, let us
make the following important observations:
• The only combination of function symbols and recursion is for the invention of tuple
ids (u-terms) for unioned attribute vectors.
• The valid program produces the same tuple id for the union of attribute vectors
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 115
→V1, . . . ,
→Vn, regardless of the ordering of the unions, merely by ordering the tuple ids
of the unioned attribute vectors.
• Lemmata 5.2.6 and 5.2.8 and Theorem 5.2.7 hold for p-Datalogf .
• Let Q be a conjunctive query and let {Qi|1 ≤ i ≤ n} be the set of standard schema
queries it reduces to. Let Hi be the heads of those queries. As we pointed out
in Section 6.3.2, all Qi have the same body. Moreover, for Q1, H1 is of the form
tuple(ans, T ), where T is a term that denotes a tuple id, and for all Qi, i 6= 1, Hi is
of the form attr(T, ci, Xi) for the same T. We call T the query id. In reference to the
previous example, the query id is ans(a,X).
Theorem 6.4.2 Let Q be a conjunctive query and {Qi|1 ≤ i ≤ n} be its standard schema
reduction. Q is expressible by an RQDL description P without the subset metapredicate if
and only if there exists a maximal set {Q′i|1 ≤ i ≤ n} of queries described by the reduced
description P ′, where all Q′i have the same id, such that Q′
i ≡ Qi, ∀1 ≤ i ≤ n.5
Referring again to Example 6.4.1, the maximal set {Q′i} is the set of the corresponding
queries to extended facts (1) and (2).
Note that the exact “value” of tuple ids is not important: their use is to identify com-
ponents (i.e., attributes) of the same relation. Therefore, we say that a reduced query Q in
p-Datalogf is expressible by a reduced p-Datalogf description P if and only if there exists
Q′ equivalent to Q up to tuple-id naming that is described by P .
Proof: The above theorem is easy to see in the case where the RQDL description contains
no vectors. When the RQDL description contains vectors, the intuition is as follows: Let
Q be a conjunctive query, and let {Qi|1 ≤ i ≤ n} be the set of standard schema queries
it reduces to. Also let P be the RQDL description and Pred be the reduced p-Datalogf
description.
For the ONLY IF direction: The reduction directly maps the RQDL rules to rules
“producing” tuple subgoals, so it ensures that if Q is expressible by P , then Q1 is expressible
by Pred. Because of the expressibility of Q1 and of rules 8 and 1 in the reduction, all {Qi}are also expressible.
The IF direction follows from the definition of RQDL expressibility, Procedure 6.3.5,
and the fact that all Q′i have the same query id. 2
5“Maximal” means that {Q′i} includes all described queries with that same query id.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 116
Because of Theorems 6.3.6 and 6.4.2, we can use Algorithm QED (see Section 5.2) to
answer the expressibility question in RQDL. QED generates all possible extended facts for
tuple and attr. We then check whether (i) all and only the necessary “frozen” tuple and attr
facts are produced and have the same id, and (ii) their corresponding queries are equivalent
to the Qi’s.
For the algorithm to work properly, a change needs to be made to the definition6 of the
supporting set of a fact: due to the reduction introduced in Sections 6.3.2 and 6.3.3, there
is an implicit connection between a fact tuple(const1, T ) and facts attr(T, const2, X), i.e.,
between the tuple fact and the attribute facts that are created by the reduction. We make
that connection explicit by modifying the definition of “supporting set” as follows:
Definition: Supporting Set - Modified Let h be an ordinary fact produced by an
application of the p-Datalogf rule
r : H ← G1, . . . , Gk, E1, . . . , Em
of a (reduced) p-Datalogf description P on a database DB that consists of a canonical
database CDB and other facts, and let µ be a mapping from the rule into DB such that
µ(Gi), µ(Ej) ∈ DB and h = µ(H). The set Sh of supporting facts of h, or supporting set of
h, with respect to P , is the smallest set such that
• if µ(Gi) ∈ CDB, then µ(Gi) ∈ Sh,
• if µ(Gi) 6∈ CDB and S ′ is the set of supporting facts of µ(Gi), then S ′ ⊆ Sh,
• if tuple(c, t) ∈ Sh for some7 c and t, then for all c′, x, if attr(t, c′, x) is in the canonical
DB, then attr(t, c′, x) ∈ Sh,
• if E is the set of all µ(Ei) ∈ Sh, then the smallest set of equality facts that includes
E and is an equivalence relation is included in Sh.
2
6We could have the same effect by correspondingly changing the RQDL to p-Datalogf reductionprocedure.
7The constant c can be a frozen or regular constant, t can be a frozen or regular constant or a groundterm.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 117
Modifications in the presence of subset subgoals: We have already explained that a
subset(→V ,
→V′) subgoal present in a rule r is reduced into a subset flag attached to r. Let
P be an RQDL description and Pred be its p-Datalogf reduction. Let Q be an expansion
of Pred (cf Section 5.1.1). We say that the subset flag attaches to Q if Q is the result of
resolving rule r of Pred with rule s and subset flag is attached to r. Then Theorem 6.4.2
can be restated more generally as follows:
Theorem 6.4.3 Let Q be a conjunctive query and {Qi|1 ≤ i ≤ n} be its standard schema
reduction. Q is expressible by an RQDL description P if and only if there exists a maximal
set {Q′i|1 ≤ i ≤ m} of queries described by the reduced description P ′, where all Q′
i have
the same id, such that Q′i ≡ Qi, ∀1 ≤ i ≤ n. It is n ≤ m if subset flag is attached to Q1
and n = m otherwise. Maximal means that {Q′i} includes all described queries with that
same query id.During execution of the QED algorithm, whenever a tuple fact is generated from a
rule that has subset flag attached, a subset flag annotation is set on its tuple id. That
annotation is used after the execution is complete together with Theorem 6.4.3 to determine
expressibility.
Let us consider the following example.
Example 6.4.4 If our RQDL description is
ans(→V )← p(
→V ), item(
→V , au, X)
as in Example 6.3.3 then the query Q : ans(au : X)← p(au : X, subj : Y ) is not expressible
by our description. The reduction of the description is
tuple(ans, ans(T )) ← tuple(p, T ), attr(T1, au, X), equal(T, T1)
attr(ans(T ), $a,X) ← attr(T1, $a,X), tuple(ans, ans(T1)), equal(T, T1)
and the reduction of the query (i.e., the set {Qi}) is
tuple(ans, ans(au,X)) ← tuple(p, T ), attr(T, au, X), attr(T, subj, Y )
attr(ans(au,X), au,X) ← tuple(p, T ), attr(T, au, X), attr(T, subj, Y )
The canonical DB is then
tuple(p, t0), attr(t1, au, x), attr(t2, subj, y), equal(t0, t1, t2)
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 118
The extended facts produced by Algorithm 5.2.4, taking into account the modification of
the definition of supporting sets introduced above, are
(1) < tuple(ans, ans(t0)), {tuple(p, t0), attr(t1, au, x), attr(t2, subj, y), equal(t0, t1, t2)} >
(2) < attr(ans(t0), au, x), {tuple(p, t0), attr(t1, au, x), attr(t2, subj, y), equal(t0, t1, t2)} >
(3) < attr(ans(t0), subj, y), {tuple(p, t0), attr(t1, au, x), attr(t2, subj, y), equal(t0, t1, t2)} >
Let us look in more detail into how extended fact (1) was produced. Application of the
first rule of the p-Datalogf program generates
< tuple(ans, ans(t0)), {tuple(p, t0), attr(t1, au, x), equal(t0, t1)} >
The second rule of the program consequently fires and gives
< attr(ans(t0), au, x), {tuple(p, t0), attr(t1, au, x), equal(t0, t1)} >
and
< attr(ans(t0), subj, y), {tuple(p, t0), attr(t2, subj, y), equal(t0, t2)} >
Then, according to the modified definition of supporting set, we need to augment the
supporting set of tuple(ans, ans(t0)), to include attr(t2, subj, y), thus getting extended fact
(1). Performing the augmentation step cannot take more than exponential amount of time.
Finally, the second rule of the program fires again, to generate (2) and (3).
Even though both standard-schema queries of the reduction are expressible by our re-
duced description, the original query, as pointed out, is not expressible by the RQDL de-
scription. That is because the only maximal set of described queries produced (consisting
of the corresponding queries for (1),(3) and (4)) is larger than the set of reduced queries.
On the other hand, if the description were
ans(→V )← p(
→V 1), item(
→V , au,X), subset(
→V ,
→V 1)
then Q is described by the modified description. The reduction of the description would
be exactly the same, but we would set the subset flag on the rule. Then, following The-
orem 6.4.3, the algorithm would decide correctly that Q is described by the modified de-
scription. 2
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 119
Let us consider a more complicated example of QED.
Example 6.4.5 The following source can accept queries that perform a join between rela-
tion q with any other relation over any set of attributes. The description of this source is a
simplification of description PCQ, of Example 6.1.2.
ans(→V ) ← cond(
→V )
cond(→V ) ← q(
→V 1), union(
→V ,
→V 1,
→V 2), cond(
→V 2)
cond(→V ) ← item(
→V , $a1, X1), item(
→V , $a2, X2), equal(X1, X2), cond(
→V )
cond(→V ) ← $r(
→V )
The reduction of the description, after rectification, is
tuple(ans, ans(T )) ← tuple(cond , cond(T ))
attr(ans(T ), $a,X) ← attr(T1, $a,X), tuple(ans, ans(T2)), equal(T, T1, T2)
tuple(cond , cond(T )) ← tuple(q, T1), tuple(cond , cond(T2)), valid(T, T3, T4),
equal(T1, T3), equal(T2, T4)
attr(cond(T ), $a,X) ← attr(T1, $a,X), tuple(cond , cond(T2)), equal(T, T1, T2)
tuple(cond , cond(T )) ← attr(T1, $a1, X1), attr(T2, $a2, X2), equal(X1, X2),
tuple(cond , cond(T )), equal(T, T1, T2)
tuple(cond , cond(T )) ← tuple($r, T )
attr(T, $a,X) ← attr(T1, $a,X), valid(T, T2, T3), equal(T1, T2)
attr(T, $a,X) ← attr(T1, $a,X), valid(T, T2, T3), equal(T1, T3)
plus the valid rules.
The user query submitted to the source is the following:
ans(au : X, ln : X, subj : Z)← q(au : X, subj : Z), s(ln : X)
(where ln stands for last name) which produces the extended canonical DB
tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3), attr(t4, ln, x1), equal(t0, t1, t2),
equal(t3, t4), equal(x, x1)
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 120
The standard schema reduction of the user query is
tuple(ans, ans(au,X, ln,X, subj, Z)) ← tuple(q, Q), tuple(s, S), attr(Q1, au, X),
attr(Q2, subj, Z), attr(S1, ln,X1), equal(S, S1),
equal(X, X1), equal(Q,Q1, Q2)
attr(ans(au,X, ln, X, subj, Z), au, X) ← tuple(q, Q), tuple(s, S), attr(Q1, au, X),
attr(Q2, subj, Z), attr(S1, ln,X1), equal(S, S1),
equal(X, X1), equal(Q,Q1, Q2)
attr(ans(au,X, ln, X, subj, Z), ln,X) ← tuple(q, Q), tuple(s, S), attr(Q1, au, X),
attr(Q2, subj, Z), attr(S1, ln,X1), equal(S, S1),
equal(X, X1), equal(Q,Q1, Q2)
attr(ans(au,X, ln, X, subj, Z), subj, Z) ← tuple(q, Q), tuple(s, S), attr(Q1, au, X),
attr(Q2, subj, Z), attr(S1, ln,X1), equal(S, S1),
equal(X, X1), equal(Q,Q1, Q2)
Running Algorithm 5.2.4 on the canonical DB produces the following extended facts
after augmentation8:
8We are only showing some of the extended facts produced
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 121
< valid(u(t0, t3), t0, t3), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),
attr(t4, ln, x1), equal(t0, t1, t2)} >
< tuple(cond, cond(t0)), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), equal(t0, t1, t2)} >
< tuple(cond, cond(t3)), {tuple(s, t3), attr(t4, ln, x1), equal(t3, t4)} >
< tuple(cond, cond(u(t0, t3))), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),
attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4)} >
< tuple(cond, cond(u(t0, t3))), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),
attr(t4, ln, x1), equal(t0, t1, t2), equal(x, x1), equal(t3, t4)} >
(1) < tuple(ans, ans(u(t0, t3))), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),
attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4), equal(x, x1)} >
< attr(u(t0, t3), au, x), {attr(t1, au, x), equal(t0, t1)} >
< attr(u(t0, t3), subj, z), {attr(t2, subj, z), equal(t0, t2)} >
< attr(u(t0, t3), ln, x1), {attr(t4, ln, x1), equal(t3, t4)} >
(2) < attr(ans(u(t0, t3)), au, x), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),
attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4), equal(x, x1)} >
(3) < attr(ans(u(t0, t3)), ln, x1), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),
attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4), equal(x, x1)} >
(4) < attr(ans(u(t0, t3)), subj, z), {tuple(q, t0), attr(t1, au, x), attr(t2, subj, z), tuple(s, t3),
attr(t4, ln, x1), equal(t0, t1, t2), equal(t3, t4), equal(x, x1)} >
The maximal set of described queries with query id u(t0, t3) (corresponding to (1),(2),(3)
and(4)) is equal to the set of the standard schema queries that are the reduction of the user
query. Therefore, the user query is expressible by our RQDL description, by Theorem 6.4.2.
2
6.4.2 The CBR problem for RQDL
We solve the CBR problem for a given query and a given RQDL description in two steps:
• We generate the set of relevant described queries from the output of the Algorithm 5.2.4,
by “gluing” together the tuple and attr subgoals that have the same supporting set.
In other words, we create the corresponding standard schema queries for the extended
facts and then do the inverse reduction on the sets of those that have the same id and
body (thus ending up with queries on the original schema). These are the relevant
queries of the description with respect to the given query.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 122
• Given the original query and the relevant queries (or views) that are expressible by the
given RQDL description, we can apply the appropriate query rewriting using views
algorithm, e.g., [Qia96; LMSS95] or [RSU95], on that problem.
The algorithm is correct because of the following theorem.
Theorem 6.4.6 (RQDL-CBR) Assume we have a query Q and an RQDL description P ,
and let {Qi} be the result of applying Algorithm 5.2.4 on Q and P . There exists a rewriting
Q′ of Q, such that Q′ ≡ Q, using any {Qj |Qj is expressible by P} if and only if there exists
a rewriting Q′′ , such that Q′′ ≡ Q, using only {Qi}.The proof follows directly from the proof of Theorem 5.3.2. Therefore, solving the query
expressibility problem for RQDL immediately reduces the CBR problem to the familiar
problem of answering the given query using a finite set of conjunctive views. The complexity
of the whole procedure is nondeterministic exponential in the input size.
Let us notice that, in the presence of subset subgoals in the RQDL description, the
QED algorithm produces candidate queries that can have set the subset flag annotation.
In principle, these annotations can be ignored for the solution of the CBR problem, since
we assume that the mediator has the capability to do projections locally (i.e., projections
can always be handled by the final rewriting at the mediator). Finally, it should be obvious
that the discussion of Subsection 5.3.1 about binding requirements holds for RQDL as well.
Example 6.4.7 We consider a source that expects a selection condition on attribute au or
on attribute subj, but not both. The RQDL description for this source is
ans(→V ) ← $r(
→V ), item(
→V , au, $c)
ans(→V ) ← $r(
→V ), item(
→V , subj, $c)
The description reduces to
tuple(ans, ans(T )) ← tuple($r, T ), attr(T1, au, X), equal(T, T1)
tuple(ans, ans(T )) ← tuple($r, T ), attr(T1, subj, X), equal(T, T1)
attr(ans(T ), $a,X) ← attr(T1, $a,X), tuple(ans, ans(T2)), equal(T, T1, T2)
Let the user query be
Q : ans(subj : X, au : Y, isbn : Z) ← books(subj : X, au : Y, isbn : Z), equal(X, Logic),
equal(Y, Smith)
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 123
It is obvious that Q can be answered with a combination of queries expressible by the
description: First send the selection condition on au, then on subj and finally intersect the
two results. Q reduces to
tuple(ans, ans(subj,X, au, Y, isbn, Z)) ← tuple(books, T ), attr(T, au, X), attr(T, subj, Y ),
attr(T, isbn, Z), equal(X, Logic), equal(Y, Smith)
attr(ans(subj,X, au, Y, isbn, Z), subj, X) ← tuple(books, T ), attr(T, au, X), attr(T, subj, Y ),
attr(T, isbn, Z), equal(X, Logic), equal(Y, Smith)
attr(ans(subj,X, au, Y, isbn, Z), au, Y ) ← tuple(books, T ), attr(T, au, X), attr(T, subj, Y ),
attr(T, isbn, Z), equal(X, Logic), equal(Y, Smith)
attr(ans(subj,X, au, Y, isbn, Z), isbn, Z) ← tuple(books, T ), attr(T, au, X), attr(T, subj, Y ),
attr(T, isbn, Z), equal(X, Logic), equal(Y, Smith)
The canonical DB is then9
tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z), equal(x, Logic), equal(y, Smith)
The extended facts that are generated by algorithm QED-T are shown in Figure 6.4.7:10
The result (after the inverse reduction) is two candidate conjunctive queries, with bind-
ing information:
C1 : ansbff (subj : X, au : Y, isbn : Z)← books(subj : X, au : Y, isbn : Z)
andC2 : ansfbf (subj : X, au : Y, isbn : Z)← books(subj : X, au : Y, isbn : Z)
Using Q and C1, C2 as input to algorithm AnsBind, we get the expected answer. 2
6.5 Conclusions and Related Work
We described and extended RQDL, which is a provably more expressive language than p-
Datalog. The extra power is mainly a result of vector variables which can match to sets
of attributes of arbitrary length. The existence of vector variables makes very hard a di-
rect, brute-force implementation of query expressibility and CBR algorithms for RQDL. In
[PGH96], a brute-force approach is proposed for a query expressibility algorithm; it tries to
9For brevity we are not doing full rectification.10The figure only shows the extended facts of interest.
CHAPTER 6. THE CAPABILITY DESCRIPTION LANGUAGE RQDL 124
< tuple(ans, ans(t)), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(x, Logic)} >
< tuple(ans, ans(t)), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(y, Smith)} >
< attr(ans(t), subj, x), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(x, Logic)} >
< attr(ans(t), au, y), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(x, Logic)} >
< attr(ans(t), isbn, z), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(x, Logic)} >
< attr(ans(t), subj, x), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(y, Smith)} >
< attr(ans(t), au, y), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(y, Smith)} >
< attr(ans(t), isbn, z), {tuple(books, t), attr(t, subj, x), attr(t, au, y), attr(t, isbn, z),equal(y, Smith)} >
Figure 6.2: Extended facts produced by Algorithm QED-T for Example 6.4.7
generate instantiated terminal expansions bottom up, so that vectors match with sets during
the derivation. This approach soon leads to complicated problems that force [PGH96] to
restrict the applicability of matching algorithms to a subset of RQDL descriptions. Con-
sequently, the query expressibility algorithm proposed in [PGH96] is not applicable to all
RQDL descriptions.
We provide a reduction of RQDL descriptions into p-Datalog augmented with function
symbols that construct unique tuple ids, similar to the invention of semantic object ids
in DSL (see Chapter 3) and the ideas in [Mai86; KL89]. Using this reduction, we provide
complete algorithms for solving the expressibility and CBR problems. Moreover, we demon-
strate how to automatically derive an RQDL description of the capabilities of a mediator,
given the descriptions of the capabilities of the sources it accesses.
Appendix A
Enabling Integration: TSIMMIS
Wrappers
In order to access information from a variety of heterogeneous information sources, TSIM-
MIS must translate queries and data from one data model into another. This functionality
is provided by source wrappers [Lew91; Wel] that convert queries into one or more com-
mands/queries understandable by the underlying source and transform the native results
in a format understood by the application. Anyone who has build a wrapper (and as part
of the TSIMMIS project we developed hard-coded wrappers for a variety of sources, like
Sybase DBMS, WWW pages, and legacy systems like Folio) can attest that development
is a heavy task. In situations where it is important/desirable to gain access to new sources
quickly, this is a major drawback. However, we have also observed that only a relatively
small part of the code deals with the specific access details of the source. The rest of the
code is either common among wrappers or implements query and data transformation that
could be expressed in a high-level, declarative fashion.
Based on these observations, I built a wrapper implementation toolkit for quickly building
wrappers. The toolkit contains a library for commonly used functions, such as for receiving
queries from the application and packaging results. It also contains a facility for translating
queries into source-specific commands, and for translating results into a model useful to
the application. The philosophy behind the “template-based” translation methodology
is as follows. The wrapper implementor specifies a set of templates (rules) written in a
high-level, declarative language that describe the queries accepted by the wrapper. If an
application query matches a template, an implementor-provided action associated with the
125
APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 126
template is executed to provide the native query for the underlying source. The native query
is not necessarily a string of a well-structured query language, e.g., SQL. In general, it is
any program used to access and retrieve information from the underlying source. When
the source returns the results of the query, the wrapper transforms the answer objects
which are represented in the data model of the source into a representation that is used
by the application. Using this toolkit one can quickly design a simple wrapper with a few
templates that cover some of the desired functionality, probably the one that is most urgently
needed. However, templates can be added gradually as more functionality is required later
on. In addition, the libraries in the toolkit provide the ability to perform post-processing
on the result if required, e.g., the incoming query did not match exactly with any of the
templates. In a sense, this post-processing capability allows the wrapper builder to enhance
the usefulness of the source by adding query capabilities to the wrapper that are not natively
supported.
Figure A.1: Wrapper architecture
Another important use of wrappers is in extending the query capabilities of a source.
For instance, some sources may not be capable of answering queries that have multiple
predicates. In such cases, it is necessary to pose a native query to such a source using only
APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 127
Figure A.2: Wrapper components and procedure calls
predicates that the source is able to handle. The rest of the predicates will be automatically
separated from the query and handled locally through a filter query. When the wrapper
receives the results, a post-processing engine applies the filter query. This engine supports
a set of built-in predicates based on comparison operators =, 6=, <, >, etc. In addition,
the engine can support more complex predicates that can be specified as part of the filter
query. The postprocessing engine is common to wrappers of all sources and is part of the
wrapper toolkit. It is a simply a lightweight version of the mediator query engine shown
in Figure 1.6. As we noted for mediators, the postprocessing engine gives the wrapper the
ability to handle a much larger class of queries than those that exactly match the templates
it had been given.
Figure A.1 shows an overview of the wrapper architecture as it is currently implemented
in our tsimmis testbed. Rounded components are provided by the toolkit, the rectangular
native component is source-specific and must be written by the implementor. The driver
component controls the translation process and invokes the following services of the toolkit:
the parser that parses the templates as well as the incoming queries into internal data
APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 128
structures, the matcher that matches a query against the set of templates and creates a
filter query for post-processing if necessary, the native component that submits the generated
action string and receives the native result, and the engine that transforms and packages the
result and applies a post-processing filter if one has been created by the matcher. Figure A.2
shows in detail the procedure calls made between the various components. It also shows
that the native component is well-encapsulated, consisting simply of four procedures that
need to be implemented by the wrapper implementor, and a native template, describing the
structure of the native results. We now describe the sequence of events that occur at the
wrapper during the translation of a query and its result using a concrete example. Queries
are formulated using a simple extension of DSL, and the wrapper turns the native results
into OEM objects.
A.1 An Example
Let us assume that the source is a relational database containing bibliographic information
about papers and books. Suppose that the user is interested in all papers authored by
“Jones” and published before 1984. The corresponding DSL query is:
(Q53) P : −P :< &O book {< year Y >< author “Jones” >} > AND lt(Y, 1984)
The predicate lt(Y,1984) specifies that the < comparison operator be used. The
pattern variable P in this example binds to the contents of the whole object pattern. The
use of pattern variables is essentially a shortcut.
Upon receipt by the wrapper, the query is sent to the driver component which invokes
the parser. After the query is successfully parsed, the driver invokes the matcher to match
the query against a set of template rules. These rules describe the queries that are accepted
by the wrapper and are expressed in a simple extension of DSL with pattern variables and
tokens (see Chapter 5). Associated with each rule is an action string that describes the
corresponding native query. In our scenario, the action string is a parametrized SQL query.
In order to give an example of postprocessing using a filter query, we have chosen not to
include any predicates on year in the templates. That way, we are essentially acting as
if the source does not support the < predicate on year. In a “production” version of the
wrapper it is usually beneficial to make use of all the natively supported query facilities in
APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 129
order to maximize efficiency (see discussion on query capabilities in Chapter 1).
Here is an example of a template (without the associated action) that matches the above
query:
(D9) B : −B :< book {< author $X >} >
The above template matches the given query because the substitutions B ← P , I ← O,
$X ← “Jones” transform the template into a DSL expression that is contained in the given
query. Note that we could have designed a template that matches the input query exactly,
had we decided to let the source execute the year predicate.
Using the following associated action
// $$ = “select ∗ from book
where author =′′ $X //
and the substitution $X ← “Jones”, the matcher produces the following native SQL query:
select *
from book
where author = "Jones"
The driver then invokes the query processing part of the native component which sub-
mits the native query to the source. When the result is returned, the driver invokes the
query engine to perform the necessary post-processing: the wrapper must remove all those
publications from the answer that were published after 1984 (since the original query was
for publications before 1984). This is done by applying the following DSL filter query to
the result:
B : −B :< book {< year Y >} > AND lt(Y, 1984)
Specifically, the postprocessing engine takes each answer object in the native query
result, extracts the year field of the object and checks if it less than 1984. If so, the object
is included in the result constructed by the engine. After the post-processing, the engine
creates an OEM answer object containing the desired publications. Finally, the driver
component returns the OEM result to the application that issued the query.
APPENDIX A. ENABLING INTEGRATION: TSIMMIS WRAPPERS 130
A.2 Implemented Wrappers
The toolkit has been used to wrap the following four different types of sources containing
bibliographic data in heterogeneous formats.
1. A University-owned legacy system called folio, which is accessible through an inter-
active front-end (called inspec).
2. A Sybase relational DBMS, which is accessible through SQL.
3. A collection of UNIX files, which are accessible through a PERL script file.
4. A World-Wide Web source which is accessible through a Python script file.
Although all four sources are supporting different access methods, the wrappers hide all
source specific details from the application/end-user by exporting a common interface to
the underlying data independently of where and how it is stored. By adding new templates
or modifying existing ones, it is easy to quickly enhance the query capabilities of a wrapper
as well as the structure of the resulting answers without writing one line of code.
Appendix B
Sort program
Sort is an almost-standard list-sorting routine, that takes a list (in the form of an arbitrary
u-term) as input, turns it into a right-deep u-term and then sorts it, deletes duplicates, and
returns the sorted list (in the form of a right-deep u-term). The sorting algorithm used is
selection sort [CLR90]. The rules to merge two right-deep u-terms are omitted.
sort(T, u(T1, T2)) ← uniquesort(T,U), rd(U, u(T1, T2))rd(T, T ) ← tuple(N,T )rd(U, u(T1, T2)) ← rd(U1, T1), rd(U2, T2),merge(U,U1, U2)uniquesort(T, u(T1, u(T1, T ))) ← selsort(u(T1, u(T1, T )), U)uniquesort(T, T ) ← selsort(T, T ), tuple(N,T )selsort(T, T ) ← tuple(N,T )selsort(u(Min, T ), u(T1, T2)) ← findMin(Min,Rest, u(T1, T2)),
selsort(T,Rest)findMin(T1, T2, u(T1, T2)) ← tuple(N1, T1), tuple(N2, T2), T1 < T2
findMin(Min, u(T1, Rest), u(T1, T2)) ← findMin(Min,Rest, T2), tuple(N,T1),T1 > Min
Figure B.1: A logic program implementing selection sort
131
Bibliography
[A+91] R. Ahmed et al. The Pegasus heterogeneous multidatabase system. IEEE Computer,
24:19–27, 1991.
[AAB+98] J. L. Ambite, N. Ashish, G. Barish, C. A. Knoblock, S. Minton, P. J. Modi, Ion
Muslea, A. Philpot, and S. Tejada. ARIADNE: A system for constructing mediators
for internet sources. In Proc. SIGMOD Conf., pages 561–563, 1998.
[ACHK93] Y. Arens, C.Y. Chee, C.-N. Hsu, and C.A. Knoblock. Retrieving and integrating
data from multiple information sources. Intl Journal of Intelligent and Cooperative
Informations Systems, 2:127–158, June 1993.
[ACPS96] S. Adali, S. C. Candan, Y. Papakonstantinou, and V. S. Subrahmanian. Query caching
and optimization in distributed mediator systems. In Proc. SIGMOD, pages 137–48,
1996.
[AD98] S. Abiteboul and O. Duschka. Complexity of answering queries using views. In Proc.
PODS Conf., 1998.
[Adl] S. Adler et al. Extensible Stylesheet Language (XSL) 1.0. W3C Work-
ing Draft. Available at http://www.w3.org/TR/xsl. More information at
http://www.w3.org/Style/XSL.
[AGMPY98] S. Abiteboul, H. Garcia-Molina, Y. Papakonstantinou, and R. Yerneni. Fusion query
optimization. In Proc. EDBT Conf., pages 57–71, 1998.
[AHV95] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley,
1995.
[AK89] S. Abiteboul and P.C. Kanellakis. Object identity as a query language primitive. In
Proc. ACM SIGMOD Conference, pages 159–73, Portland, OR, May 1989.
[AK97] J. L. Ambite and C. A. Knoblock. Planning by rewriting: Efficiently generating high-
quality plans. In Proc. AAAI Conf., pages 706–713, 1997.
132
BIBLIOGRAPHY 133
[AKL97] N. Ashish, C. A. Knoblock, and A. Levy. Information gathering plans with sensing
actions. In Fourth European Conference on Planning, 1997.
[ALW99] C. R. Anderson, A. Y. Levy, and D. S. Weld. Declarative web site management with
tiramisu. In Informal Proc. WebDB Workshop, pages 19–24, 1999.
[AQM+97] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel query
language for semistructured data. International Journal on Digital Libraries, 1(1):68–
88, April 1997.
[ASU87] A. Aho, R. Sethi, and J.D. Ullman. Compilers Principles, Techniques, and Tools.
Addison-Wesley, 1987.
[BDFS97] P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding structure to unstruc-
tured data. In Proc. ICDT Conf., 1997.
[BDHS96] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A query language and opti-
mization techniques for unstructured data. In Proc. ACM SIGMOD, 1996.
[Bla96] J. Blakeley. Data access for the masses through ole db. In Proc. ACM SIGMOD
Conf., pages 161–72, 1996.
[BLN86] C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analysis of methodologies
for database schema integration. ACM Computing Surveys, 18:323–364, 1986.
[BLR97] C. Beeri, A. Y. Levy, and M.-C. Rousset. Rewriting queries using views in description
logics. In Proc. PODS Conf., pages 99–108, 1997.
[BM] P. V. Biron and A. Malhotra. XML Schema Part 2: Datatypes. W3C Working Draft.
Latest version available at http://www.w3.org/TR/xmlschema-2/.
[BPSM] T. Bray, J. Paoli, and C. Sperberg-McQueen. Extensible Markup Language (XML) 1.0.
W3C Recommendation. Latest version available at http://www.w3.org/TR/REC-xml.
[C+95] M.J. Carey et al. Towards heterogeneous multimedia information systems: The Garlic
approach. In Proc. RIDE-DOM Workshop, pages 124–31, 1995.
[Cad] Cadabra.com. At http://www.cadabra.com.
[CGL98a] D. Calvanese, G. De Giacomo, and M. Lenzerini. On the decidability of query con-
tainment under constraints. In Proc. PODS Conf., pages 149–158, 1998.
[CGL+98b] D. Calvanese, G. De Giacomo, M. Lenzerini, D. Nardi, and R. Rosati. Description
logic framework for information integration. In Proc. of the 6th Int. Conf. on the
Principles of Knowledge Representation and Reasoning (KR’98), pages 2–13, 1998.
BIBLIOGRAPHY 134
[CGL99] D. Calvanese, G. De Giacomo, and M. Lenzerini. Answering queries using views in
description logics. In Proc. of the 6th Int. Workshop on Knowledge Representation
meets Databases (KRDB’99), pages 6–10, 1999.
[CGLV99] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Vardi. Rewriting of regular
expressions and regular path queries. In Proc. PODS Conf., 1999.
[CGLV00] D. Calvanese, G. De Giacomo, M. Lenzerini, and M. Y. Vardi. Query processing using
views for regular path queries with inverse. In Proc. PODS Conf., 2000.
[Cha92] E.P. Chan. Containment and minimization of positive conjunctive queries in OODB’s.
In Proc. PODS Conf., 1992.
[CKW93] W. Chen, M. Kifer, and D.S. Warren. Hilog: a foundation for higher-order logic
programming. Journal of Logic Programming, 15:187–230, February 1993.
[CLR90] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw
Hill, 1990.
[CM77] A.K. Chandra and P.M. Merlin. Optimal implementation of conjunctive queries in
relational databases. In Proceedings of the Ninth Annual ACM Symposium on Theory
of Computing, pages 77–90, 1977.
[CM90] M. Consens and A. Mendelzon. GraphLog: a visual formalism for real life recursion.
In Proc. PODS Conf., pages 404–416, 1990.
[CRF00] D. Chamberlin, J. Robie, and D. Florescu. Quilt: An XML query language for het-
erogeneous data sources. In Proc. SIGMOD WebDB Workshop, 2000.
[DFF+] A. Deutch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. XML-QL:
A query language for XML. Submission to W3C. Latest version available at
http://www.w3.org/TR/NOTE-xml-ql.
[DG97] O. Duschka and M. Genesereth. Answering queries using recursive views. In Proc.
PODS Conf., 1997.
[DKS92] W. Du, R. Krishnamurthy, and M.-C. Shan. Query optimization in heterogeneous
DBMS. In Proc. VLDB Conference, pages 277–91, Vancouver, Canada, August 1992.
[DL97] O. Duschka and A. Levy. Recursive plans for information gathering. In Proceedings
of the Fifteenth International Joint Conference on Artificial Intelligence, 1997.
[DL99] A. Doan and A. Levy. Efficiently ordering query plans for data integration. In Proc.
AAAI Conf., pages 67–73, 1999.
[DP90] B.A. Davey and H. A. Priestley. Introduction to lattices and order. Cambridge Math-
ematical Textbooks, 1990.
BIBLIOGRAPHY 135
[End72] H. Enderton. A Mathematical Introduction to Logic. Academic Press, 1972.
[Eno] Enosys Markets, inc. At http://www.enosysmarkets.com.
[Fet] Fetch Technologies. At http://www.fetch.com.
[FFK+98] M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu. Catching the boat with
Strudel: Experiences with a web-site management system. In Proc. SIGMOD Conf.,
1998.
[FFLS97] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language and processor
for a web-site management system. In Workshop on Management of Semistructured
Data, ACM SIGMOD Conf., 1997.
[FGL+98] P. Fankhauser, G. Gardarin, M. Lopez, J. Muoz, and A. Tomasic. Experiences in
federated databases: From IRO-DB to MIRO-Web. In Proc. VLDB Conf., pages
655–658, 1998.
[FKL97] D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integra-
tion. In Proc. VLDB Conf., 1997.
[FLM99] M. Friedman, A. Levy, and T. Millstein. Navigational plans for data integration. In
Proc. AAAI Conf., pages 67–73, 1999.
[FLMS99] D. Florescu, A. Levy, I. Manolescu, and D. Suciu. Query optimization in the presence
of limited access patterns. In Proc. SIGMOD Conf., pages 311–322, 1999.
[FLNS88] P. Fankhauser, W. Litwin, E.J. Neuhold, and M. Screfl. Global view definition and
multidatabase languages: two approaches to database integration. In Research into
Networks and Distributed Applications. European Teleinformatics Conf., pages 1069–
1082, Vienna, Austria, April 1988.
[FLS98] D. Florescu, A. Levy, and D. Suciu. Query containment for conjunctive queries with
regular expressions. In Proc. PODS Conf., 1998.
[FLSY99] D. Florescu, A. Y. Levy, D. Suciu, and K. Yagoub. Optimization of run-time man-
agement of data intensive web-sites. In Proc. VLDB Conf., pages 627–638, 1999.
[FS98] M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas.
In Proc. ICDE Conf., 1998.
[FSW99] M. Fernandez, J. Simeon, and P. Wadler. Xml query languages: Experiences and
exemplars, 1999. With contributions from S. Cluet et al. Available from http://www-
db.research.bell-labs.com/user/simeon/xquery.html.
[FW97] M. Friedman and D. S. Weld. Efficiently executing information-gathering plans. In
Proc. IJCAI Conf., 1997.
BIBLIOGRAPHY 136
[G+92] M.R. Genesereth et al. Knowledge Interchange Format. Version 3.0. Reference Man-
ual. Technical Report Logic-92-1, Stanford University, 1992. Also available by URL
http://logic.stanford.edu/kif.html.
[GHR98] A. Gupta, V. Harinarayan, and A. Rajaraman. Virtual database technology. In Proc.
ICDE Conf., pages 297–301, 1998.
[GKD97] M. R. Genesereth, A. M. Keller, and O. Duschka. Infomaster: An information inte-
gration system. In Proc. SIGMOD Conf., 1997.
[GL94] P. Gupta and E. Lin. DataJoiner: a practical approach to multidatabase access. In
Proc. PDIS Conf., page 264, 1994.
[GM+97] H. Garcia-Molina et al. The TSIMMIS approach to mediation: data models and
languages. Journal of Intelligent Information Systems, 8:117–132, 1997.
[GM99] G. Grahne and A. O. Mendelzon. Tableau techniques for querying information sources
through global schemas. In Proc. ICDT Conf., pages 332–347, 1999.
[GMLY99] H. Garcia-Molina, W. Labio, and R. Yerneni. Capability-sensitive query processing
on internet sources. In Proc. ICDE Conf., pages 50–59, 1999.
[GMPVY] H. Garcia-Molina, Y. Papakonstanti-
nou, V. Vassalos, and R. Yerneni. A tsimmis retrospective. Working paper. Draft
available from http://www.stern.nyu.edu/ vassalos/retro-draft.ps.
[GN88] M.R. Genesereth and N.J. Nillson. Logical Foundations of Artificial Intelligence. Mor-
gan Cauffman, 1988.
[Gol90] C. Goldfarb. The SGML Handbook. Oxford University Press, 1990.
[Gup89] A. Gupta. Integration of Information Systems: Bridging Heterogeneous Databases.
IEEE Press, 1989.
[GW97] R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization
in semistructured databases. In Proc. VLDB Conf., 1997.
[HKWY96] Laura Haas, Donald Kossman, Edward Wimmers, and Jun Yang. An optimizer for
heterogeneous systems with non-standard data and search capabilities. Special Issue
on Query Processing for Non-Standard Data, IEEE Data Engineering Bulletin, 19:37–
43, December 1996.
[HKWY97] L. Haas, D. Kossman, E. Wimmers, and J. Yang. Optimizing queries across diverse
data sources. In Proc. VLDB, 1997.
BIBLIOGRAPHY 137
[HM93] J. Hammer and D. McLeod. An approach to resolving semantic heterogeneity in a
federation of autonomous, heterogeneous database systems. Intl Journal of Intelligent
and Cooperative information Systems, 2:51–83, 1993.
[HY90] R. Hull and M. Yoshikawa. ILOG: Declarative creation and manipulation of object
identifiers. In Proc. VLDB Conference, pages 455–68, Brisbane, Australia, August
1990.
[HY91] R. Hull and M. Yoshikawa. On the equivalence of data restructurings involving object
identifiers. In Proc. PODS Conference, 1991.
[IFF+99] Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S. Weld. An adaptive query
execution system for data integration. In Proc. SIGMOD Conf., pages 299–310, 1999.
[Imm82] N. Immerman. Upper and lower bounds for first-order expressibility. Journal of
Computer and System Sciences, 25(1):76–98, August 1982.
[JBHM+97] J.Hammer, M. Breunig, H.Garcia-Molina, S.Nestorov, V.Vassalos, and R. Yerneni.
Template-based wrappers in the tsimmis system. In Proc. ACM SIGMOD, pages
532–535, 1997.
[K+93] W. Kim et al. On resolving schematic heterogeneity in multidatabase systems. Dis-
tributed And Parallel Databases, 1:251–279, 1993.
[KL89] M. Kifer and G. Lausen. F-logic: a higher-order language for reasoning about objects,
inheritance, and scheme. In Proc. ACM SIGMOD Conf., pages 134–46, Portland, OR,
June 1989.
[KW96] C. T. Kwok and D. S. Weld. Planning to gather information. In Proc. AAAI Conf.,
1996.
[Lev] A. Levy. Answering queries using views: a survey. Available from
www.cs.washington.edu/homes/alon/site/files/view-survey.ps.
[Lev00] Special issue on adaptive query processing. Bulletin of the Technical Committee on
Data Engineering, 23(2), June 2000.
[Lew91] J.W. Lewis. Wrappers: integrating utilities and services for the DICE architecture.
In Proceedings of the Second National Symposium on Concurrent Engineering, pages
445–457, 1991.
[LHL+98] B. Ludscher, R. Himmerder, G. Lausen, W. May, and C. Schlepphorst. Managing
semistructured data with FLORID: A deductive object-oriented perspective. Infor-
mation Systems, 23(8):589–613, 1998.
BIBLIOGRAPHY 138
[LMR90] W. Litwin, L. Mark, and N. Roussopoulos. Interoperability of multiple autonomous
databases. ACM Computing Surveys, 22:267–293, 1990.
[LMSS95] A. Levy, A. Mendelzon, Y. Sagiv, and D. Srivastava. Answering queries using views.
In Proc. PODS Conf., pages 95–104, 1995.
[LPV00] B. Ludascher, Y. Papakonstantinou, and P. Velikhov. Navigation-driven evaluation of
virtual mediated views. In Proc. EDBT Conf., 2000.
[LR96] A. Levy and M.-C. Rousset. CARIN: a representation language integrating rules and
description logics. In Proceedings of the European Conference on Artificial Intelligence,
Budapest, Hungary, 1996.
[LRO96] A. Levy, A. Rajaraman, and J. Ordille. Querying heterogeneous information sources
using source descriptions. In Proc. VLDB, pages 251–262, 1996.
[LRU96] A. Levy, A. Rajaraman, and J. Ullman. Answering queries using limited external
processors. In Proc. PODS, pages 227–37, 1996.
[LRU99] A. Levy, A. Rajaraman, and J. Ullman. Answering queries using limited external
processors. Journal of Computer and System Sciences, 58(1):69–82, February 1999.
[LS97] A. Levy and D. Suciu. Deciding containment for queries with complex objects. In
Proc. PODS Conf., 1997.
[MAG+97] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A database
management system for semistructured data. SIGMOD Record, 26(3):54–66, 1997.
[Mai86] D. Maier. A logic for objects. In J. Minker, editor, Preprints of Workshop on Founda-
tions of Deductive Database and Logic Programming, Washington, DC, USA, August
1986.
[Mer] Mergent Systems. At http://www.mergent.com.
[MFDG98] S. Mace, U. Flohr, R. Dobson, and T. Graham. Weaving a better Web. BYTE
Magazine, March 1998. Cover Story.
[MLF00] T. D. Millstein, A. Y. Levy, and M. Friedman. Query containment for data integration
systems. In Proc. PODS Conf., 2000.
[MP00] K. Munroe and Y. Papakonstantinou. BBQ: A visual interface for browsing and
querying XML. In Proc. Visual Database Systems, 2000.
[MS99] T. Milo and D. Suciu. Type inference for queries on semistructured data. In Proc.
PODS Conf., pages 215–226, 1999.
BIBLIOGRAPHY 139
[MSV00] T. Milo, D. Suciu, and V. Vianu. Typechecking for XML transformers. In Proc. PODS
Conf., 2000.
[MW99] J. McHugh and J. Widom. Query optimization for an xml query language. In Proc.
SIGMOD Conf., 1999.
[Nim] Nimble.com. At http://www.nimble.com.
[PAGM96] Y. Papakonstantinou, S. Abiteboul, and H. Garcia-Molina. Object fusion in mediator
systems. In Proc. VLDB Conf., 1996.
[Pap97] Y. Papakonstantinou. Query processing in heterogeneous information sources.
Technical report, Stanford University Thesis, 1997. Available from www-
cse.ucsd.edu/~yannis/papers/.
[PGGMU95] Y. Papakonstantinou, A. Gupta, H. Garcia-Molina, and J. Ullman. A query translation
scheme for the rapid implementation of wrappers. In Proc. DOOD Conf., pages 161–
86, 1995.
[PGH96] Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in
mediator systems. In Proc. PDIS Conf., 1996.
[PGH98] Y. Papakonstantinou, A. Gupta, and L. Haas. Capabilities-based query rewriting in
mediator systems. Distributed and Parallel Databases, 6:73–110, 1998.
[PGMU96] Y. Papakonstantinou, H. Garcia-Molina, and J. Ullman. Medmaker: A mediation
system based on declarative specifications. In Proc. ICDE Conf., pages 132–41, 1996.
[PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across
heterogeneous information sources. In Proc. ICDE Conf., pages 251–60, 1995.
[PL00] R. Pottinger and A. Levy. A scalable algorithm for answering queries using views. In
Proc. VLDB Conf., 2000.
[PV99] Y. Papakonstantinou and V. Vassalos. Query rewriting for semistructured data. In
Proc. SIGMOD Conf., pages 455–466, 1999.
[PV00] Y. Papakonstantinou and V. Vianu. DTD inference for views of XML data. In Proc.
PODS Conf., 2000.
[PWDN99] M. Papiani, J. Wason, A. Dunlop, and D. Nicole. A distributed scientific data archive
using the Web, XML and SQL/MED. SIGMOD Record, 28(3), 1999.
[Qia96] Xiaolei Qian. Query folding. In Proc. ICDE, pages 48–55, 1996.
[ROH99] M. Tork Roth, F. Ozcan, and L. M. Haas. Cost models DO matter: Providing cost
information for diverse data sources in a federated system. In Proc. VLDB Conf.,
pages 599–610, 1999.
BIBLIOGRAPHY 140
[RS97] M. Tork Roth and P. Schwarz. Don’t Scrap It, Wrap It! An architecture for legacy
data sources. In Proc. VLDB Conf., pages 266–275, 1997.
[RSU95] A. Rajaraman, Y. Sagiv, and J. Ullman. Answering queries using templates with
binding patterns. In Proc. PODS Conf., pages 105–112, 1995.
[RSUV89] R. Ramakrishnan, Y. Sagiv, J.D. Ullman, and M.Y. Vardi. Proof tree transformations
and their applications. In Proc. PODS Conf, pages 172–182, 1989.
[S+] V.S. Subrahmanian et al. HERMES: A heterogeneous reasoning and mediator system.
Available at http://www.cs.umd.edu/projects/hermes/overview/paper.
[SAC+79] P. G. Selinger, M. Astrahan, D. Chamberlin, R. A. Lorie, and T. G. Price. Access
path selection in a relational database management system. In Proc. SIGMOD Conf.,
pages 23–34, 1979.
[SBGJ+97] K. Fynn S Bressan, C. Goh, M. Jakobisiak, K. Hussein, H. Kon, T. Lee., S. Madnick,
T. Pena, J. Qu, A. Shum, and M. Siegel. The context interchange mediator prototype.
In Proc. SIGMOD Conf., pages 525–527, 1997.
[SGM] Overview of SGML resources. At http://www.w3.org/MarkUp/SGML/.
[Suc98] D. Suciu. Semistructured data and XML. In Proc. FODO Conf., 1998.
[SY80] S. Sagiv and M. Yannakakis. Equivalences among relational expressions with the union
and difference operators. JACM, 27:633–55, 1980.
[T+90] G. Thomas et al. Heterogeneous distributed database systems for production use.
ACM Computing Surveys, 22:237–266, 1990.
[TBMM] H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn. XML
Schema Part 1: Structures. W3C Working Draft. Latest version available at
http://www.w3.org/TR/xmlschema-1/.
[TRV98] A. Tomasic, L. Raschid, and P. Valduriez. Scaling access to heterogeneous data sources
with DISCO. Transactions on Knowledge and Data Engineering, 10(5):808–823, 1998.
[Tuk] Tukwila data integration system. At
http://data.cs.washington.edu.ucsd.edu/integration/tukwila.
[Ull88] J.D. Ullman. Principles of Database and Knowledge-Base Systems, Vol. I & II. Com-
puter Science Press, New York, NY, 1988.
[Ull89] J.D. Ullman. Principles of Database and Knowledge-Base Systems, Vol. II: The New
Technologies. Computer Science Press, New York, NY, 1989.
BIBLIOGRAPHY 141
[Ull97] J.D. Ullman. Information integration using logical views. In Proc. ICDT Conf., pages
19–40, 1997.
[VP97] V. Vassalos and Y. Papakonstantinou. Describing and Using Query Capabilities of
Heterogeneous Sources. In Proc. VLDB Conf., pages 256–266, 1997.
[VP98] V. Vassalos and Y. Papakonstantinou. Using knowledge of redundancy for query
optimization in mediators. In Proceedings of the AAAI’98 Workshop on AI and Infor-
mation Integration, 1998. Available from www.stern.edu/~vassalos/publications/.
[VP00] V. Vassalos and Y. Papakonstantinou. Expressive capabilities description languages
and query rewriting algorithms. Journal of Logic Programming, 43(1):75–122, 2000.
[Wel] D. Wells. Wrappers survey. Available from http://www.objs.com/survey/wrap.htm.
[Wie92] G. Wiederhold. Mediators in the architecture of future information systems. IEEE
Computer, 25:38–49, 1992.
[YL87] H. Z. Yang and P. . Larson. Query transformation for PSJ-queries. In Proc. VLDB
Conf., pages 245–254, 1987.
[YLGMU99] R. Yerneni, C. Li, H. Garcia-Molina, and J. D. Ullman. Computing capabilities of
mediators. In Proc. SIGMOD Conf., pages 443–454, 1999.
[YLUGM99] R. Yerneni, C. Li, J. D. Ullman, and H. Garcia-Molina. Optimizing large join queries
in mediation systems. In Proc. ICDT Conf., pages 348–364, 1999.
[ZGMHW95] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. View maintenance in a
warehousing environment. In Proc. SIGMOD Conference, pages 316–327, 1995.