database sampling with functional dependencies - …jbisbal/publications/ist2001.pdf · database...
TRANSCRIPT
Database Sampling with Functional
Dependencies
Jesus Bisbal, Jane Grimson
Department of Computer Science, Trinity College Dublin, Dublin 2, Ireland
Abstract
During the development of information systems there is a need to prototype the
database which the applications will use when in operation. A prototype database
can be built by sampling data from an existing database. Including relevant semantic
information when extracting a sample from a database is considered invaluable to
support the development of data-intensive applications. Functional dependencies
are an example of semantic information that could be considered when sampling a
database.
This paper investigates how a database relation can be sampled so that the result-
ing sample satisfies precisely a given set of functional dependencies (and its logical
consequences), i.e. is an Armstrong relation.
Key words: Information System Development, Legacy Migration, Database
Sampling, Integrity Constraints, Database Prototyping, Armstrong Relation,
Functional Dependencies
1 Introduction
During the development of an information system there is a need to construct
a prototype of the operational database. Such a database can support several
stages of the development process:
Validation of database and/or application requirements. Different de-
signs can be checked for completeness and correctness [17];
User training. Applications will need a database on which to operate with
during training;
Testing. As is needed in, for example, system’s functional test [4], perfor-
mance evaluation [11], and back-to-back testing [19].
Most of the existing approaches to building prototype databases use synthetic
data values to populate the resulting database. Although using synthetic data
may be the only alternative, the lack of data values relevant to the application
domain limits the usefulness of the prototype database [5,6]. Increasingly,
however, real (existing) data is actually available. It has been reported that
from 60% to 85% of the total software development costs in industry can be
attributed to maintenance [21]. Thus it is reasonable to expect that in over 60%
of projects there will be a database from which real data can be extracted. This
paper proposes that this data should be used to populate a prototype database.
The resulting database is called a Sample Database. A database can be sampled
according to many different criteria, such as for example random selection, or
selecting data items in such a way that the resulting sample satisfies a set of
integrity constraints. Functional dependencies are an example of such integrity
constraints, which have been widely investigated in the context of relational
databases [3,10,1]. They capture intuitive relationships between data items,
2
and play a central role in the design of databases [16]. Sampling a database in
such a way that the resulting sample satisfies a set of functional dependencies
will produce a sample with appropriate semantic information to support the
database design process, as outlined above. Since it will also contain domain
relevant data values this sample can be expected to support user training and
testing better than when using synthetic values.
This research was initially motivated in the context of Legacy Information
Systems Migration [8], where the need to identify a representative sample
(in terms of semantic information and data values) of an existing operational
database is essential to the success of any migration project [22]. A sample of
the migrating database can be used to thoroughly test the mapping between
legacy and target schemes; additionally, a migrated sample can support the
development of the target information system, in the stages identified above.
Using the entire legacy database for development and testing purposes only
may not be cost-effective. This paper investigates the use of functional de-
pendencies in the context of Database Sampling. This same problem has also
been investigated considering other types of integrity constraints, as will be
briefly outlined in Section 2.
The reminder of this paper is organised as follows. The next Section reviews
existing research in the area of database prototyping and database sampling.
Required notation and terminology are introduced in Section 3. Section 4 de-
fines the objectives when sampling with functional dependencies, and describes
a concrete database sampling example which will be used in later Sections.
Two alternative sampling algorithms are analysed in Section 5. The goal of
Sampling with functional dependencies, as defined in Section 4, is generalised
in Section 6. The final Section summarises the paper and gives a number of
3
future directions for this research.
2 Related Work
Existing research into database prototyping methods has focused mainly on
populating the database using synthetic values, i.e. not necessarily related to
the database application domain. The objective is to produce a database satis-
fying a predefined set of integrity constraints, which represent domain-relevant
semantics for the data. Noble [18] describes one such method, which considers
referential integrity constraints and functional dependencies, but populates
the database mainly with synthetic values, although the user can also enter a
list of values to be used as the domain for an attribute. A different database
prototyping approach is presented in [24]. This method firstly checks for the
consistency of an Extended Entity-Relationship (EER) schema defined for
the database being prototyped, based on cardinality constraints only. Once
the design has been proved consistent, a database according to this schema
is generated. To guide the generation process, a so-called general Dependency
Graph is created from the EER diagram. This graph represents the set of ref-
erential integrity constraints that must hold in the database and which is used
to define a partial order between the entities of the EER diagram. The syn-
thetic data generation process populates the database entities following this
partial order. Lohr-Richter [13] further develops the data generation step used
by this method, and recognises the need for additional information in order
to generate relevant data for the application domain, without addressing how
this is to be achieved. Tucherman [20] presents a database design tool which
prototypes a database also based on an Entity-Relationship Schema. The tool
4
automatically maps this design into a normalised relational schema. Special
attention is paid in this contribution to what are called restrict/propagate
rules, that is, to enforcing referential integrity constraints. When an operation
is going to violate one such constraint, the system can be instructed to either
block (restrict) the operation so that such violation does not occur or to prop-
agate it to the associated tables by deleting or inserting tuples as required. No
explicit reference is made as to how the resulting database should be popu-
lated for prototyping, although it identifies the possibility of interacting with
the user when new insertions are required, as the user may be the only source
for appropriate domain-relevant data values.
An approach more closely related to the work presented here was reported by
Mannila in [14]. This paper described a mechanism for efficiently populating
a database relation so that it satisfies exactly a predefined set of functional
dependencies (and thus being an Armstrong relation as will be defined in
Section 3). Such a relation can be used to assist the database designer in
identifying the appropriate set of functional dependencies which the database
must satisfy. Since this relation satisfies all the required dependencies and only
those, it can be seen as an alternative representation for the dependencies
themselves. A relation generated using this method is expected to expose
missing or undesirable functional dependencies, and the designer can use it
to iteratively refine the database design until it contains only the required
dependencies. Special attention is paid in [14] to the size of the relation being
generated. A designer might not be able to identify missing or undesirable
dependencies if he was presented with a relation containing a large number of
tuples. For this reason, the goal is to produce a relation that, satisfying exactly
a given set of functional dependencies, contains the smallest possible number
5
of tuples. Refer to [16] for a more extended treatment of this approach.
All database prototyping approaches described above populate the database
using data values not related to the application domain. This limits their ap-
plicability, especially in cases where data values from such prototype databases
must be presented to an end user, as is common in software prototyping for
requirements elicitation. A user who is unfamiliar with the data values pre-
sented, may not fully understand their semantics and therefore may not be
able to comment on required modifications, enhancements or missing func-
tionality.
This paper investigates how a prototype database can be built by sampling
from an operational database and thus using domain-relevant data. Limited
work has been reported to date in the area of Database Sampling resulting
in a database which satisfies a predefined set of integrity constraints. Bisbal
et al. [9] described a mechanism to represent a wide range of integrity con-
traints (including cardinality constraints, inclusion dependencies, hierarchical
relationships and a generalisation of referential integrity constraints) specially
well suited for sampling purposes. In a related paper [7] the database sampling
process was analysed from a more abstract point of view, without focusing on
any particular type of integrity constraint, and thus identifying the issues
that must always be addressed when consistently sampling from a database.
The authors [5,6] have extensively discussed the motivations and application
areas of database prototyping and database sampling, and reviewed existing
research on this area. The design of a database sampling tool which implements
the work presented here has been reported in [5,7].
The work presented here has similar objectives to those described in [14]. The
6
aim is also to produce a small database relation that satisfies a predefined set of
functional dependencies. The difference, however, it that the resulting relation
must be a subset of a relation in an existing operational database. Using such
a relation as the source of data in software prototyping, an end user would
be familiar with the data values presented. Also, a database designer, when
maintaining an existing database, could identify dependencies that no longer
should be satisfied or dependencies that should be added as a consequence of
a change in the environment in which the database operates. Using the entire
operational database relation may not be cost-effective in some applications
of database prototyping, as for example in legacy migration. Also, as outlined
above, the size of the entire relation is likely to be too large to be manageable
by the designer.
Note that the need for a sample of a database arises in a context significantly
different from that which motivated the database prototyping approaches de-
scribed earlier. Methods that produce synthetically generated databases are
needed when developing completely new applications. Currently, there is an
increasing need to develop or extend applications based on existing software
and data (enhancements, migration, etc.) which motivates the need to address
the traditional problem but with an additional constraint, the availability of
existing applications and related data. Refer to [8] for a more extended dis-
cussion on the challenges posed by existing (legacy) applications.
3 Definitions and Notation
This section defines the terminology and notation based on standard defini-
tions [23,1] that will be used throughout this paper. Examples to illustrate
7
these definitions will be given in later Sections, where they will be used.
Assume there is a set of attributes U , each with an associated domain. A rela-
tion schema R over U is a subset of U . A database schema D = {R1, . . . , Rn}
is a set of relation schemas. A relation r over the relation schema R is a set
of R-tuples, where a R-tuple is a mapping from the attributes of R to their
domains.
Attributes in U are denoted with uppercase letters A, B, C, . . . . The notation
will be simplified when referring to sets of attributes, e.g. {A, B}, by removing
the curly brackets, e.g. AB, which should not cause confusion as it will be clear
from the context. R-tuples are denoted by lowercase letters, possibly with
subscripts, t1, t2, . . . . The relation r to which an R-tuple belongs will always
be clear from the context. The value of an attribute, A ∈ U , for a particular
R-tuple, t1, is denoted t1(A). This notation is extended to sets of attributes
so that for Y ⊆ U , t1(Y ) denotes a tuple, u1, over Y with u1(A) = t1(A) for
all attributes A ∈ Y .
A functional dependency can be defined as follows.
Definition 3.1 If U is a set of attributes, then a Functional Dependency over
U is an expression of the form Y → Z, where Y, Z ⊆ U . A relation r over U
satisfies Y → Z, denoted r |= Y → Z, if for each pair of tuples in r, t1 and t2,
t1(Y ) = t2(Y ) implies t1(Z) = t2(Z). If Σ is a set of functional dependencies,
then r |= Σ means that all dependencies in Σ are satisfied in r.
Individual functional dependencies are denoted with lowercase Greek letters,
possibly with subscripts, σ, γ1, γ2 . . . . Sets of functional dependencies are de-
noted using uppercase Greek letters, possible with subscripts, Σ, Γ1, Γ2, . . .
8
.
Definition 3.2 Let Σ and Γ be sets of functional dependencies over U . Then
Σ implies Γ, denoted Σ |= Γ, iff for all relations r over U , r |= Σ implies
r |= Γ.
Definition 3.3 Let Σ be a set of functional dependencies over U . The closure
of Σ, denoted Σ∗, is defined as Σ∗ = {Y → Z|Y Z ⊆ U and Σ |= Y → Z}.
Σ∗ represents the set of all functional dependencies that are logical conse-
quences of Σ. This concept allows for the definition of what are called Arm-
strong Relations, a concept initially introduced in [2], although the term itself
was coined by Fagin in [10].
Definition 3.4 A database relation r is an Armstrong relation for a set of
functional dependencies Σ iff r satisfies all functional dependencies in Σ∗ and
no other functional dependency. It is minimal if every Armstrong Relation for
Σ has at least as many tuples as r.
It should be noted that a set of functional dependencies and an Armstrong
relation for it are dual representations of the same information [15]. During the
design and maintenance of a database, incorrect constraints are more easily
identified from a list of dependencies, whereas missing dependencies are better
revealed from an example of the relation. This justifies the use of Armstrong
relations for design purposes.
The construction of an Armstrong relation presented here relies on the closure
of a set of attributes under a set of functional dependencies, called fd-closure,
which is defined next [3,1].
9
Definition 3.5 Given a set Σ of functional dependencies over U and an at-
tribute set X ⊆ U , the fd-closure of X under Σ, denoted (X , Σ)∗,U or simply
X ∗ if Σ and U are understood from the context, is the set X ∗ = {A ∈ X | Σ |=
X → A}.
Intuitively, X ∗ represents the set of all attributes that are determined by X
according to Σ. Some sets of attributes, referred to as saturated [2] (or closed
sets in [3]), do not determine any other attributes but themselves.
Definition 3.6 A set of attributes X over U is saturated with respect to a
set of functional dependencies Σ over U iff X ∗ = X .
Following the terminology of [3], we define the concept of two tuples agreeing
exactly on a set of attributes as follows.
Definition 3.7 Let X be a set of attributes, X ⊂ U . A pair of tuples t1 and
t2 over U agree exactly on X iff ∀A ∈ X , t1(A) = t2(A) and ∀B /∈ X , t1(B) 6=
t2(B).
This concept is next extended to entire relations.
Definition 3.8 A database relation r over U satisfies X , with X ⊂ U , iff
there exist two different tuples in r that agree exactly on X .
This last definition is central to the purposes of this paper. As will be shown
by Theorem 5.1, if relation r satisfies precisely all saturated sets then it is an
Armstrong relation for the specified set of functional dependencies. Therefore,
the goal is to develop an algorithm that selects a set of tuples from a relation
in such a way that the resulting relation satisfies all saturated sets.
10
4 Sampling with Functional Dependencies
This Section defines the objective of the process described in this paper, that
is, what is involved when sampling a relation according to a given set of
functional dependencies, and also introduces an example.
4.1 Defining the Problem
Given a database relation r over U , and a set of functional dependencies Σ
satisfied by r, the goal is to extract a sample of this relation which is a small 1
Armstrong relation for Σ.
The exposition given here will firstly assume that r is an Armstrong relation
for Σ. Then, the same problem will be addressed in Section 6 considering a
subset Γ1 and a superset Γ2 of Σ, Γ1 Σ Γ2.
4.2 Running Example
The following is an example that will be used in this paper to illustrate the
concepts being developed.
Consider the set of functional dependencies Σ = {B → C, AC → B}, and
the database relation r over U = {A, B, C} shown in Table 1. It can be seen
how tuples t1, t2, t3 and t4 in this relation form a minimal Armstrong relation
for Σ. The entire relation r is also an Armstrong relation for Σ (adding tuples
t5 and t6 does not invalidate any functional dependency in Σ). It follows that
1 The sense in which it is small will be addressed in Section 5.1.2.
11
Table 1Running Example - Armstrong relation for Σ = {B → C, AC → B}
A B C
t1 a1 b1 c1
t2 a1 b2 c2
t3 a2 b3 c1
t4 a2 b2 c2
t5 a4 b3 c1
t6 a3 b1 c1
this is an example of a relation from which a sample can be extracted so that
the set of functional dependencies it satisfies is the same.
5 Sampling Algorithms
This Section describes the two algorithms that are used in this paper to select
a set of tuples from a relation so that the resulting sample is an Armstrong
relation for a predefined set of functional dependencies. The first algorithm
extracts a minimal size sample. The second algorithm selects tuples randomly.
Although the size of the sample resulting when using the second algorithm can
be considerably larger than using the first method, the efficiency improvement
can be very significant.
5.1 Sampling Method (I) - Minimal Sample Size
In order to guarantee the minimal possible size for the resulting sample, the
algorithm presented here relies on constructing what is called in this paper
an Agreements Table. This table represents the interactions between tuples in
the input relation r regarding the set of functional dependencies each pair of
12
tuples, if included in the sample, invalidates (see below for an example). This
information is used to guide the sampling process 2 . The construction of the
Agreements Table is based on the concept of a set of attributes being saturated
(Definition 3.6), as described below.
5.1.1 Agreements Table
The fd-closure is used in an Agreements Table to compute all sets of attributes
X ⊂ U such that X ∗ = X , that is, all saturated sets. An Agreements Table
gathers, for each tuple in the input database relation, which other tuples can
be used with that one to agree exactly (Definition 3.7) on each saturated set.
As an example, using Σ = {B → C, AC → B} as in Section 4.2, the saturated
sets of attributes are {A, C,BC, Ø} 3 . An Agreements Table has a column for
each such set. Table 2 shows the Agreements Table for the Running Example
of Section 4.2. For example, tuple t2 does not agree exactly on C∗ with any
other tuple. Even though t2 and t4 do have the same value for attribute C,
they also have the same value for B, and a different value for A, for this
reason they agree exactly on {B, C}∗, not on B∗. As another example, it can
be seen how tuple t1 can only use tuple t2 to agree exactly on A∗. This follows
from the fact that in relation r of Section 4.2 we have t1(A) = t2(A), but
t1(B) 6= t2(B) and t1(C) 6= t2(C), as required by Definition 3.7. The meaning
of this relationship between t1 and t2 is that if both tuples are included in
2 All database sampling strategies require some information structure to guide thesampling process [7].3 To simplify the exposition, this paper is considering all saturated sets. It hasbeen shown [2,3] that it suffices to consider only the set of so-called intersectiongenerators of this set. The algorithm presented below is valid in either case. Byconvention, U is the intersection of the empty collection of sets, thus it is not agenerator (although it is saturated). For this reason it has not been included here.In the relation example used in this paper, the set of generators and the set of allsaturated sets are the same except for U .
13
Table 2Agreements Table for the Running Example
A∗ C∗ {BC}∗ Ø∗ No. X ∗
t1 {t2} {t3, t5} {t6} {t4} 4
t2 {t1} {t4} {t3, t6, t5} 3
t3 {t4} {t1, t6} {t5} {t2} 4
t4 {t3} {t2} {t1, t6, t5} 3
t5 {t1, t6} {t3} {t2, t4} 3
t6 {t3, t5} {t1} {t2, t4} 3
the sample, no functional dependency with only attribute ‘A’ in the left-hand
side will be satisfied. The algorithm described below exploits this fact in order
to build a sample of the input relation that violates (i.e. does not satisfy) all
functional dependencies not in Σ∗.
The rationale behind an Agreements Table is that if a sample is selected in
such a way that it satisfies all saturated attributes sets then all functional
dependencies in Σ∗, and only those, hold in the resulting sample. This is
formalised next.
Theorem 5.1 Let r be a database relation and Σ a set of functional depen-
dencies, both over U . If relation r satisfies precisely all sets of attributes X
with X ∗ = X , then r is an Armstrong relation for Σ.
Proof. 4 Firstly, to prove that all functional dependencies in Σ∗ hold in r,
let σ be a functional dependency σ ∈ Σ∗. Without loss of generality, assume
that σ is of the form X → A, with X ⊂ U and A ∈ U . Note that Z = X∗
is saturated. By hypothesis any two distinct tuples must agree exactly on a
4 This proof has been included here for completeness only. Refer to [3, p. 37] or [2,p. 582] for alternative proofs.
14
saturated set. If t1 and t2 agree exactly on X, they must agree exactly on X∗.
Since A ∈ X∗, t1 and t2 agree also on A. It follows that r |= σ, as required.
To prove that only the dependencies in Σ∗ hold in r, take a functional de-
pendency X → A /∈ Σ∗. This implies A /∈ X∗. By hypothesis there is a
pair of tuples, t1 and t2, that agree exactly on X∗, therefore t1(X) = t2(X)
and t1(A) 6= t2(A). So, by definition r 2 X → A. This proves that r is an
Armstrong relation for Σ. �
In addition to one column for each saturated set, another column, termed ’No.
X ∗’ (Number X ∗), is shown in the Agreements Table of Table 2. It represents
the number of different sets X that could be satisfied by selecting this tuple
(i.e. number of non-empty columns in each row). This column is used to select
between several possible tuples that could be inserted into the sample, in order
for it to be of minimal size. Selecting tuples with the highest value for ‘No.
X ∗’ leads to the minimum possible additional tuples required to satisfy all
saturated sets.
5.1.2 Minimal Size Sampling Algorithm
The above discussion has shown how, using an Agreements Table, the result-
ing sample (1) is an Armstrong relation, and (2) is expected to be small. This
solves the problem as stated in Section 4.1. The algorithm presented in this
Section extracts a sample of minimal size. It is only minimal according to the
information available in the Agreements Table. Tuples are selected according
to their values for column ’No. X ∗’ in this table. These values represent the
number of saturated sets to which each particular tuple is directly related (in
the sense of being able to satisfy them), as opposed to considering all satu-
15
rated sets to which each tuple is transitively related. These two alternatives
can be identified with selecting tuples using either local information or global
information, respectively. The algorithm that follows would not change if the
second alternative was followed; however, the complexity (in terms of execu-
tion time) of building an Agreements Table would be significantly higher.
Fig 1 shows an algorithm that implements the sampling process described
above. The initial call for this recursive algorithm should be
MinimalSizeSample(r, Σ, Sample)
where r is the relation being sampled, Σ is the set of functional dependen-
cies being considered, and Sample contains the resulting sample relation. In
addition to a sample of relation r, this algorithm also returns whether sam-
pling was successfully performed. (The situation in which it is not possible to
extract an Armstrong relation for Σ from the input relation r is analysed in
Section 6.)
The algorithm initially selects a tuple, tj, with the highest value for column
‘No. X ’ according to the Agreements Table (if several tuples have the same
value, it can select randomly between them). This tuple is then included in the
sample. After that a recursive procedure, MinialSizeSamplerec, is used to
select tuples which, taken together with the tuple inserted last, agree exactly on
as many saturated sets as possible. Using this last inserted tuple, LastTuple,
function SelectTuple_local returns another tuple, ti, that agrees exactly
with LastTuple on one (or more) saturated set, X . The information stored
in the Agreements Table is used to select such a tuple. This information is
updated after each insertion in order to record that tuple ti and the newly
satisfied set (or sets) X are not to be considered in future selections. Then
16
/* Main function to extract an Armstrong relation for Σ from relation r */Function MinimalSizeSample(input r : relation, Σ : Set of fd,
output Sample : relation) output bool/*AgrT.SetofX is the family of sets X with X ∗ = X still to be satisfied.tj is a single tuple from the input relation.*/
ConstructAgreementsTable(r, Σ, AgrT);while (AgrT.SetOfX 6= Ø)
tj := SelectTuple(AgrT); /* Global selection criterion */if (tj = 〈〉) then /* Consistent Sampling not possible:
no Armstrong relation */return false;
InsertTuple(Sample, tj); /* Add to the Sample */UpdateAgreementsTable(AgrT, Sample, tj);
/* tj is not to be considered again *//* Some new sets X are now satisfied */
MinimalSizeSamplerec(AgrT, Sample, tj);end whilereturn true;
end Function
/* Recursive procedure used by the main function. */Procedure MinimalSizeSamplerec(input/output AgrT : AgreementsTable,
input/output Sample : relation,input LastTuple : tuple)
/*LastTuple is the last tuple inserted in the sample before making this
recursive call.ti is a single tuple from the input relation.*/while (AgrT.SetofX 6= Ø) do /*Satisfy all X with X ∗ = X */
ti:= SelectTuple local(AgrT, LastTuple); /* Local selection criterion */if(ti=〈〉) then return; /* No tuple selected */InsertTuple(Sample, ti); /* Add to the Sample */UpdateAgreementsTable(AgrT, Sample, ti);
/* ti is not to be considered again *//* Some new sets X are now satisfied */
MinimalSizeSamplerec(AgrT, Sample, ti); /* Keep consistency */end while
end Procedure
Fig. 1. Algorithm for Minimal Size Sampling
17
MinialSizeSamplerec is called again using the newly inserted tuple.
The chain of recursive calls terminates when either all saturated sets have been
satisfied, or no more sets can be satisfied using tuple tj. In the former case
the algorithm will terminate and the current sample is an Armstrong relation.
In the latter, a new tuple tj will be selected and the process described above
will be repeated. If no such tuple can be found, it means that there are some
saturated sets that cannot be satisfied, and therefore an Armstrong cannot be
extracted (see Section 6).
5.1.3 Sampling Example
Applying the algorithm of Figure 1 to the Running Example of Section 4.2
and the Agreements Table shown in Table 2 leads to the following scenario.
(1) According to the global selection criteria (function SelectTuple), two tu-
ples can initially be selected since they have the highest value for ‘No. X ∗’:
t1 and t3. Assume that tuple t1 is (randomly) chosen. Now, the recursive
procedure is called using MinialSizeSamplerec(AgrT, Sample, t1).
(a) Using the Agreements Table it can be seen how t2 must be selected
in order to satisfy A∗ (function SelectTuple_local).
Now MinialSizeSamplerec(AgrT, Sample, t2) is called.
(i) Set C∗ cannot be satisfied using t2. However, according to the
Agreements Table (row t2), t4 must be selected so that {BC}∗
is satisfied. Additionally, due to the interactions between t1 (al-
ready in the sample) and t4 (just inserted), set Ø∗ is also satis-
fied.
MinialSizeSamplerec(AgrT, Sample, t4) is called.
18
(A) The only saturated set that remains to be satisfied is C∗.
However, t4 does not agree exactly with any tuple on C∗.
So this recursive call terminates.
(ii) For the same reason t2 cannot be used. This recursive call also
terminates.
(b) In contrast, there are two tuples that agree exactly on C∗ with t1,
they are t3 and t5. Function SelectTuple_local, using the values
of column ’No. X ’ in Table 2, would select t3 to be inserted into the
sample. MinialSizeSamplerec(AgrT, Sample, t3) is called.
(i) All X ∗ (saturated sets) are satisfied, so there is nothing for this
recursive call to do.
(c) All X ∗ are satisfied, so this recursive call terminates.
(2) All X ∗ are satisfied, function MinialSizeSample terminates.
The resulting sample (tuples t1, t2, t4, and t3) is exactly the minimal Armstrong
relation identified in Section 4.2.
5.2 Sampling Method (II) - Random Sampling
An alternative method for sampling a database is to select the set of tuples
randomly. If such an approach is followed, the sample size cannot be assured to
be minimal, but the efficiency improvement (both in performance and memory
usage) will be significant, as discussed below.
Following this approach, a list of saturated sets of attributes still to be satis-
fied by the current sample is maintained. Whenever a new tuple is randomly
selected, the algorithm checks whether any new saturated set is satisfied. If
so, this tuple becomes part of the sample and the saturated set now satisfied
19
is deleted from the list. If not, a new tuple is sampled. This process continues
until the list of saturated sets of attributes to be satisfied is empty.
The two approaches, Minimal Size Sample and Random Sample Selection,
must be compared. It is expected that in those cases where the minimal size
sample is much smaller than the initial table, random sampling will, on av-
erage, not perform better than the minimal size sampling algorithm as the
resulting sample will be significantly larger than necessary.
5.3 Complexity of Sampling Algorithms
This Section analyses the worst-case complexity 5 of the algorithms given in
Section 5.1.2 and 5.2. The standard O-Notation [12] will be used.
The minimal size sampling algorithm consists in three phases: (1) Computing
all saturated 6 sets of attributes; (2) Constructing the Agreements Table; (3)
Extracting a minimal size sample using the algorithm shown in Figure 1.
Let n be the number of attributes in U , p the number of tuples in the databases,
and q the number of functional dependencies in Σ; that is n = |U|, p = |r|,
and q = |Σ|.
(1) Computing X ∗ can be done in linear time with the size of Σ and X [1],
and thus checking whether X is saturated has complexity O (q + n).
Beeri [3] showed that there exist some functional dependency sets for
which any Armstrong relation will have exponential size with Σ. There-
fore any algorithm must face this worst-case complexity. However, this
5 Set operations are ignored for simplicity.6 Recall from Section 5.1 that it suffices to consider only the intersection generators.
20
is only the case for highly unnormalised relations, which should be rare.
In case of relations in Boyce-Codd Normal Form [16], the number of sat-
urated sets is bounded by a polynomial in n and q [14]. The degree of
this polynomial depends on the number of keys in the relation. Refer
to this polynomial as pol(n, q). Therefore this first phase has complexity
O (pol(n, q) (p + n)).
(2) For each tuple t1 and each saturated set X compute which tuples t2
agree with t1 exactly on X . Since |X | ≤ n this phase’s complexity is
O (pol(n, q)p2n). It can be improved, however, if the relation is sorted on
X before each test [15]. Which results in complexity O (pol(n, q)np log p).
(3) Algorithm of Figure 1 has complexity O (pol(n, q)p), if one ignores the
construction of the Agreements Table, which was analysed in (1).
Therefore the entire algorithm to extract a Minimal Size Sample, if the relation
is normalised, has complexity
O (pol(n, q)np log p) .
This result is comparable to the complexity of known algorithms to build
Armstrong relations using synthetic data [14].
Following a similar analysis it can be seen how the algorithm for Random
Sampling with functional dependencies has complexity
O (pol(n, q)pn) .
21
6 Sampling with Subsets or Supersets of Σ
The previous Section assumed that the input relation r was an Armstrong
relation for Σ, and the goal was to extract a minimal Armstrong relation.
This Section relaxes this assumption in both directions. First considering a
subset of the whole set of functional dependencies satisfied by r, Σ, and then
considering a superset.
If a subset Γ1 of Σ, Γ∗1 Σ∗, is considered then sampling is not possible. That
is, no sample of a relation, which satisfies exactly Σ∗, will satisfy exactly Γ∗1.
This is proved next.
Theorem 6.1 Let r be an Armstrong relation for a set of functional depen-
dencies Σ. There can not be a sample s of relation r which is an Armstrong
relation for a proper subset Γ1 of Σ, Γ∗1 Σ∗.
Proof. Let σ be a functional dependency such that σ ∈ Σ∗ but σ /∈ Γ∗1. Since
r is an Armstrong relation for Σ there can not be any pair of tuples in r that
do not satisfy σ. Therefore, a sample s of r cannot violate σ either. Since s |= σ
for some σ /∈ Γ∗1, s cannot be an Armstrong relation for Γ1, as required. �
Consider now sampling with supersets of Σ. If the aim is also to extract an
Armstrong relation for a given superset of Σ, then this may not always be pos-
sible. To see this, assume that a sample satisfying precisely Γ2 = Σ∪{A → C}
is to be extracted from a relation containing only tuples {t1, t2, t3, t4} of Table
1. It can be seen how one tuple from {t1, t2} and another one from {t3, t4}
must be left out of the sample, in order to satisfy the new functional depen-
dency {A → C}. However, the resulting sample (e.g. {t1, t4}, {t2, t3}) satisfies
22
more functional dependencies than only those in Γ∗2 (e.g. {B → A, C → A}),
thus it cannot be an Armstrong relation for Γ2.
Therefore, in order to sample with a superset of Σ, it is necessary to weaken
the objective of sampling. Instead of requiring the resulting sample to be an
Armstrong relation, the goal is now only to ensure it satisfies all specified
functional dependencies (even if it may also satisfy others).
Theorem 6.2 Let r be an Armstrong relation for a set of functional depen-
dencies Σ over U . There exists a sample s of r which satisfies a superset Γ2
of Σ, Σ∗ ⊂ Γ∗2.
Proof. Let γ1 be a functional dependency such that γ1 ∈ Γ2 and γ1 /∈ Σ∗.
Without loss of generality assume γ1 is of the form Y → A, with Y ⊂ U and
A ∈ U . Initially, let sample s be the entire relation r. For any pair of tuples
t1 and t2 in s with t1(Y ) = t2(Y ) and t1(A) 6= t2(A), remove t2 from s. No
functional dependency in Σ can become unsatisfied by removing a tuple from
s (Theorem 6.1), therefore now s |= Σ∗∪{γ1}. (Note that by removing a tuple,
however, a new functional dependency not in Γ∗2 may become satisfied.) This
process can be repeated for any additional γ2 ∈ Γ2 \ {γ1} and γ2 /∈ Σ∗ ∪ {γ1}
until s |= Γ2, as required. �
The algorithm given in Figure 1 can easily be modified to sample with su-
persets of Σ, or to detect when this is not possible as in the previous ex-
ample. All that is required is to ensure that functions SelectTuple() and
SelectTuple_local() do not select tuples that satisfy non-saturated sets of
attributes. As an example, consider Γ2 = Σ ∪ {A → C} = {B → C, AC →
B, A → C} and the (entire) relation shown in Table 1 of Section 4.2. The cor-
23
responding set of saturated sets is {Ø, C,BC}, therefore the set that must not
be satisfied is {A∗} (see Table 2). Applying the algorithm of Figure 1 with the
suggested modification, the following sample would be extracted: {t1, t5, t4, t6}.
This sample is indeed an Armstrong relation for Γ2. The same procedure will
demonstrate why, in the previous example, no sample of {t1, t2, t3, t4} can be
an Armstrong relation of Γ2.
The two theorems presented in this Section identify what can be expected
when sampling with functional dependencies, and the minimum knowledge
required about the database relation being sampled in order to do so (i.e. a
set of functional dependencies for which the input relation is an Armstrong
relation). If the set of functional dependencies is not the minimum required,
the algorithm given in Section 5.1.2 can detect this situation, notifying the
database designer and increasing the understanding of this relation.
7 Summary and Future Work
This paper has investigated how a database relation can be sampled so that
the resulting set of tuples satisfies precisely those functional dependencies
that are logical consequences of a given set of functional dependencies, that
is, it is an Armstrong relation. This has been achieved by using the concept
of closure of a set of attributes under a set of functional dependencies, fd-
closure, which has then been used to compute the so-called saturated sets of
attributes. The saturated sets guide the sampling process to select tuples in
such a way that all functional dependencies not in the initial set are violated
(i.e. not satisfied). Two algorithms have been proposed, one which extracts a
sample with the minimum number of tuples, and the other which selects tuples
24
randomly. Initially it has been assumed that sampling would be performed
according to the whole set of functional dependencies satisfied by the relation,
denoted by Σ. Then this assumption has been relaxed, considering subsets
and supersets of the initial set of functional dependencies.
As outlined in Section 5.1.2, the algorithm presented in this paper to extract
a minimal size sample is in fact suboptimal, in the sense that a sample of
smaller size than the one generated by the algorithm could exist in some cir-
cumstances. This is due to the fact that the algorithm considers only local
information to decide which tuples must be inserted into the sample. How
global information could be used to ensure minimal sample size, without ex-
cessively increasing its complexity, is still to be investigated.
Theorem 6.2 proved that sampling with a superset of Σ is not always possi-
ble. An interesting line of work could investigate in which cases it is actually
possible to extract an Armstrong relation for a superset of Σ.
Finally, it should be noted that the concept of Armstrong relation is not
exclusively related to functional dependencies, but it has been extended to a
more general type of dependencies [10]. Database sampling with this type of
dependencies, in order to extract Armstrong relations for them, would be an
interesting area for future work.
References
[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-
Wesley, 1995.
[2] W.W. Armstrong. Dependency structures of data base relationships. In J.L.
25
Rosenfeld, editor, Information Processing 74 (Proceedings of IFIP Congress 74),
pages 580–583. IFIP, North-Holland, August 1974.
[3] C. Beeri, M. Dowd, R. Fagin, and R. Statman. On the structure of armstrong
relations for functional dependencies. Journal of the ACM, 31(1):30–46, January
1984.
[4] B. Beizer. Black-Box Testing : Techniques for Functional Testing of Software
and Systems. Wiley, 1995.
[5] J. Bisbal. Database Sampling to Support the Development of Data-Intensive
Applications. PhD thesis, University of Dublin, Trinity College, October 2000.
[6] J. Bisbal and J. Grimson. Database prototyping through consistent sampling.
In the International Conference on Advances in Infrastructure for Electronic
Business, Science, and Education on the Internet (SSGRR’2000). Scuola
Superiore Guglielmo Reiss Romoli (SSGRR), August 2000.
[7] J. Bisbal and J. Grimson. Generalising the consistent database sampling
process. In B. Sanchez, N. Nada, A. Rashid, T. Arndt, and M. Sanchez, editors,
Proceedings of the Joint meeting of the 4th World Multiconference on Systemics,
Cybernetics and Informatics (SCI’2000) and the 6th International Conference
on Information Systems Analysis and Synthesis (ISAS’2000), volume II -
Information Systems Development, pages 11–16. International Institute of
Informatics and Systemics (IIIS), July 2000.
[8] J. Bisbal, D. Lawless, B. Wu, and J. Grimson. Legacy information systems:
Issues and directions. IEEE Software, 16(5):103–111, September/October 1999.
[9] J. Bisbal, B. Wu, D. Lawless, and J. Grimson. Building consistent
sample databases to support information system evolution and migration.
In G. Quirchmayr, E. Schweighofer, and T. J.M. Bench-Capon, editors,
Proceedings of the 9th International Conference on Database and Expert
26
Systems Applications (DEXA’98), volume 1460 of Lecture Notes in Computer
Science, pages 196–205. Springer-Verlag, 1998.
[10] R. Fagin. Horn clauses and database dependencies. Journal of the ACM,
29(4):952–985, October 1982.
[11] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger.
Quickly generating billion-record synthetic databases. In Proceedings of the
International Conference on Management of Data (SIGMOD 1994), pages 243–
252. ACM Press, 1994.
[12] D.E. Knuth. Fundamental Algorithms, volume 1 of The Art of Computer
Programming. Addison-Wesley, third edition, 1997.
[13] P. Lohr-Richter and A. Zamperoni. Validating database components of software
systems. Technical Report 94–24, Leiden University, Department of Computer
Science, 1994.
[14] H. Mannila and K.-J. Raiha. Small armstrong relations for database design. In
Proceedings of the Fourth ACM SIGACT-SIGMOD Symposium on Principles
of Database Systems (PODS’85), pages 245–250, 1985.
[15] H. Mannila and K.-J. Raiha. Dependency inference. In P.M. Stocker, W. Kent,
and P. Hammersley, editors, Proceedings of 13th International Conference on
Very Large Data Bases (VLDB’87), pages 155–158. Morgan Kaufmann, 1987.
[16] H. Mannila and K.-J. Raiha. The Design of Relational Databases. Addison-
Wesley, 1992.
[17] A. Neufeld, G. Moerkotte, and P. C. Lockemann. Generating consistent test
data: Restricting the search space by a generator formula. VLDB Journal,
2(2):173–213, April 1993.
[18] H. Noble. The automatic generation of test data for a relational database.
Information Systems, 8(2):79–86, 1983.
27
[19] I. Sommerville. Software Engineering. Addison-Wesley, fifth edition, 1995.
[20] L. Tucherman, M.A. Casanova, and A.L. Furtado. The chris consultant - a tool
for database design and rapid prototyping. Information Systems, 15(2):187–195,
1990.
[21] G. Wiederhold. Modeling and system maintenance. In M.P. Papazoglou,
editor, Proceedings of the 14th International Conference on Object-Oriented and
Entity-Relationship Modeling (OOER’95), pages 1–20. Springer-Verlag, 1995.
[22] B. Wu, D. Lawless, J. Bisbal, R. Richardson, J. Grimson, V. Wade, and
D. O’Sullivan. The butterfly methodology: A gateway-free approach for
migrating legacy information systems. In B. Werner, editor, Proceedings
of the 3rd IEEE Conference on Engineering of Complex Computer Systems
(ICECCS’97), pages 200–205. IEEE Computer Society Press, September 1997.
[23] M. Yannakakis. Perspectives on database theory. In Proceedings of the 36th
Annual Symposium on Foundations of Computer Science, pages 224–246. IEEE
Computer Society Press, 1995.
[24] A. Zamperoni and P. Lohr-Richter. Enhancing the quality of conceptual
database specifications through validation. In Proceedings of the 12th
International Conference on Entity-Relationship Approach (ER’93), volume 823
of Lecture Notes in Computer Science, pages 85–98. Springer-Verlag, 1993.
28