database sampling with functional dependencies - …jbisbal/publications/ist2001.pdf · database...

28
Database Sampling with Functional Dependencies Jes´ us Bisbal, Jane Grimson Department of Computer Science, Trinity College Dublin, Dublin 2, Ireland Abstract During the development of information systems there is a need to prototype the database which the applications will use when in operation. A prototype database can be built by sampling data from an existing database. Including relevant semantic information when extracting a sample from a database is considered invaluable to support the development of data-intensive applications. Functional dependencies are an example of semantic information that could be considered when sampling a database. This paper investigates how a database relation can be sampled so that the result- ing sample satisfies precisely a given set of functional dependencies (and its logical consequences), i.e. is an Armstrong relation. Key words: Information System Development, Legacy Migration, Database Sampling, Integrity Constraints, Database Prototyping, Armstrong Relation, Functional Dependencies

Upload: hoangkiet

Post on 04-Jun-2018

241 views

Category:

Documents


0 download

TRANSCRIPT

Database Sampling with Functional

Dependencies

Jesus Bisbal, Jane Grimson

Department of Computer Science, Trinity College Dublin, Dublin 2, Ireland

Abstract

During the development of information systems there is a need to prototype the

database which the applications will use when in operation. A prototype database

can be built by sampling data from an existing database. Including relevant semantic

information when extracting a sample from a database is considered invaluable to

support the development of data-intensive applications. Functional dependencies

are an example of semantic information that could be considered when sampling a

database.

This paper investigates how a database relation can be sampled so that the result-

ing sample satisfies precisely a given set of functional dependencies (and its logical

consequences), i.e. is an Armstrong relation.

Key words: Information System Development, Legacy Migration, Database

Sampling, Integrity Constraints, Database Prototyping, Armstrong Relation,

Functional Dependencies

1 Introduction

During the development of an information system there is a need to construct

a prototype of the operational database. Such a database can support several

stages of the development process:

Validation of database and/or application requirements. Different de-

signs can be checked for completeness and correctness [17];

User training. Applications will need a database on which to operate with

during training;

Testing. As is needed in, for example, system’s functional test [4], perfor-

mance evaluation [11], and back-to-back testing [19].

Most of the existing approaches to building prototype databases use synthetic

data values to populate the resulting database. Although using synthetic data

may be the only alternative, the lack of data values relevant to the application

domain limits the usefulness of the prototype database [5,6]. Increasingly,

however, real (existing) data is actually available. It has been reported that

from 60% to 85% of the total software development costs in industry can be

attributed to maintenance [21]. Thus it is reasonable to expect that in over 60%

of projects there will be a database from which real data can be extracted. This

paper proposes that this data should be used to populate a prototype database.

The resulting database is called a Sample Database. A database can be sampled

according to many different criteria, such as for example random selection, or

selecting data items in such a way that the resulting sample satisfies a set of

integrity constraints. Functional dependencies are an example of such integrity

constraints, which have been widely investigated in the context of relational

databases [3,10,1]. They capture intuitive relationships between data items,

2

and play a central role in the design of databases [16]. Sampling a database in

such a way that the resulting sample satisfies a set of functional dependencies

will produce a sample with appropriate semantic information to support the

database design process, as outlined above. Since it will also contain domain

relevant data values this sample can be expected to support user training and

testing better than when using synthetic values.

This research was initially motivated in the context of Legacy Information

Systems Migration [8], where the need to identify a representative sample

(in terms of semantic information and data values) of an existing operational

database is essential to the success of any migration project [22]. A sample of

the migrating database can be used to thoroughly test the mapping between

legacy and target schemes; additionally, a migrated sample can support the

development of the target information system, in the stages identified above.

Using the entire legacy database for development and testing purposes only

may not be cost-effective. This paper investigates the use of functional de-

pendencies in the context of Database Sampling. This same problem has also

been investigated considering other types of integrity constraints, as will be

briefly outlined in Section 2.

The reminder of this paper is organised as follows. The next Section reviews

existing research in the area of database prototyping and database sampling.

Required notation and terminology are introduced in Section 3. Section 4 de-

fines the objectives when sampling with functional dependencies, and describes

a concrete database sampling example which will be used in later Sections.

Two alternative sampling algorithms are analysed in Section 5. The goal of

Sampling with functional dependencies, as defined in Section 4, is generalised

in Section 6. The final Section summarises the paper and gives a number of

3

future directions for this research.

2 Related Work

Existing research into database prototyping methods has focused mainly on

populating the database using synthetic values, i.e. not necessarily related to

the database application domain. The objective is to produce a database satis-

fying a predefined set of integrity constraints, which represent domain-relevant

semantics for the data. Noble [18] describes one such method, which considers

referential integrity constraints and functional dependencies, but populates

the database mainly with synthetic values, although the user can also enter a

list of values to be used as the domain for an attribute. A different database

prototyping approach is presented in [24]. This method firstly checks for the

consistency of an Extended Entity-Relationship (EER) schema defined for

the database being prototyped, based on cardinality constraints only. Once

the design has been proved consistent, a database according to this schema

is generated. To guide the generation process, a so-called general Dependency

Graph is created from the EER diagram. This graph represents the set of ref-

erential integrity constraints that must hold in the database and which is used

to define a partial order between the entities of the EER diagram. The syn-

thetic data generation process populates the database entities following this

partial order. Lohr-Richter [13] further develops the data generation step used

by this method, and recognises the need for additional information in order

to generate relevant data for the application domain, without addressing how

this is to be achieved. Tucherman [20] presents a database design tool which

prototypes a database also based on an Entity-Relationship Schema. The tool

4

automatically maps this design into a normalised relational schema. Special

attention is paid in this contribution to what are called restrict/propagate

rules, that is, to enforcing referential integrity constraints. When an operation

is going to violate one such constraint, the system can be instructed to either

block (restrict) the operation so that such violation does not occur or to prop-

agate it to the associated tables by deleting or inserting tuples as required. No

explicit reference is made as to how the resulting database should be popu-

lated for prototyping, although it identifies the possibility of interacting with

the user when new insertions are required, as the user may be the only source

for appropriate domain-relevant data values.

An approach more closely related to the work presented here was reported by

Mannila in [14]. This paper described a mechanism for efficiently populating

a database relation so that it satisfies exactly a predefined set of functional

dependencies (and thus being an Armstrong relation as will be defined in

Section 3). Such a relation can be used to assist the database designer in

identifying the appropriate set of functional dependencies which the database

must satisfy. Since this relation satisfies all the required dependencies and only

those, it can be seen as an alternative representation for the dependencies

themselves. A relation generated using this method is expected to expose

missing or undesirable functional dependencies, and the designer can use it

to iteratively refine the database design until it contains only the required

dependencies. Special attention is paid in [14] to the size of the relation being

generated. A designer might not be able to identify missing or undesirable

dependencies if he was presented with a relation containing a large number of

tuples. For this reason, the goal is to produce a relation that, satisfying exactly

a given set of functional dependencies, contains the smallest possible number

5

of tuples. Refer to [16] for a more extended treatment of this approach.

All database prototyping approaches described above populate the database

using data values not related to the application domain. This limits their ap-

plicability, especially in cases where data values from such prototype databases

must be presented to an end user, as is common in software prototyping for

requirements elicitation. A user who is unfamiliar with the data values pre-

sented, may not fully understand their semantics and therefore may not be

able to comment on required modifications, enhancements or missing func-

tionality.

This paper investigates how a prototype database can be built by sampling

from an operational database and thus using domain-relevant data. Limited

work has been reported to date in the area of Database Sampling resulting

in a database which satisfies a predefined set of integrity constraints. Bisbal

et al. [9] described a mechanism to represent a wide range of integrity con-

traints (including cardinality constraints, inclusion dependencies, hierarchical

relationships and a generalisation of referential integrity constraints) specially

well suited for sampling purposes. In a related paper [7] the database sampling

process was analysed from a more abstract point of view, without focusing on

any particular type of integrity constraint, and thus identifying the issues

that must always be addressed when consistently sampling from a database.

The authors [5,6] have extensively discussed the motivations and application

areas of database prototyping and database sampling, and reviewed existing

research on this area. The design of a database sampling tool which implements

the work presented here has been reported in [5,7].

The work presented here has similar objectives to those described in [14]. The

6

aim is also to produce a small database relation that satisfies a predefined set of

functional dependencies. The difference, however, it that the resulting relation

must be a subset of a relation in an existing operational database. Using such

a relation as the source of data in software prototyping, an end user would

be familiar with the data values presented. Also, a database designer, when

maintaining an existing database, could identify dependencies that no longer

should be satisfied or dependencies that should be added as a consequence of

a change in the environment in which the database operates. Using the entire

operational database relation may not be cost-effective in some applications

of database prototyping, as for example in legacy migration. Also, as outlined

above, the size of the entire relation is likely to be too large to be manageable

by the designer.

Note that the need for a sample of a database arises in a context significantly

different from that which motivated the database prototyping approaches de-

scribed earlier. Methods that produce synthetically generated databases are

needed when developing completely new applications. Currently, there is an

increasing need to develop or extend applications based on existing software

and data (enhancements, migration, etc.) which motivates the need to address

the traditional problem but with an additional constraint, the availability of

existing applications and related data. Refer to [8] for a more extended dis-

cussion on the challenges posed by existing (legacy) applications.

3 Definitions and Notation

This section defines the terminology and notation based on standard defini-

tions [23,1] that will be used throughout this paper. Examples to illustrate

7

these definitions will be given in later Sections, where they will be used.

Assume there is a set of attributes U , each with an associated domain. A rela-

tion schema R over U is a subset of U . A database schema D = {R1, . . . , Rn}

is a set of relation schemas. A relation r over the relation schema R is a set

of R-tuples, where a R-tuple is a mapping from the attributes of R to their

domains.

Attributes in U are denoted with uppercase letters A, B, C, . . . . The notation

will be simplified when referring to sets of attributes, e.g. {A, B}, by removing

the curly brackets, e.g. AB, which should not cause confusion as it will be clear

from the context. R-tuples are denoted by lowercase letters, possibly with

subscripts, t1, t2, . . . . The relation r to which an R-tuple belongs will always

be clear from the context. The value of an attribute, A ∈ U , for a particular

R-tuple, t1, is denoted t1(A). This notation is extended to sets of attributes

so that for Y ⊆ U , t1(Y ) denotes a tuple, u1, over Y with u1(A) = t1(A) for

all attributes A ∈ Y .

A functional dependency can be defined as follows.

Definition 3.1 If U is a set of attributes, then a Functional Dependency over

U is an expression of the form Y → Z, where Y, Z ⊆ U . A relation r over U

satisfies Y → Z, denoted r |= Y → Z, if for each pair of tuples in r, t1 and t2,

t1(Y ) = t2(Y ) implies t1(Z) = t2(Z). If Σ is a set of functional dependencies,

then r |= Σ means that all dependencies in Σ are satisfied in r.

Individual functional dependencies are denoted with lowercase Greek letters,

possibly with subscripts, σ, γ1, γ2 . . . . Sets of functional dependencies are de-

noted using uppercase Greek letters, possible with subscripts, Σ, Γ1, Γ2, . . .

8

.

Definition 3.2 Let Σ and Γ be sets of functional dependencies over U . Then

Σ implies Γ, denoted Σ |= Γ, iff for all relations r over U , r |= Σ implies

r |= Γ.

Definition 3.3 Let Σ be a set of functional dependencies over U . The closure

of Σ, denoted Σ∗, is defined as Σ∗ = {Y → Z|Y Z ⊆ U and Σ |= Y → Z}.

Σ∗ represents the set of all functional dependencies that are logical conse-

quences of Σ. This concept allows for the definition of what are called Arm-

strong Relations, a concept initially introduced in [2], although the term itself

was coined by Fagin in [10].

Definition 3.4 A database relation r is an Armstrong relation for a set of

functional dependencies Σ iff r satisfies all functional dependencies in Σ∗ and

no other functional dependency. It is minimal if every Armstrong Relation for

Σ has at least as many tuples as r.

It should be noted that a set of functional dependencies and an Armstrong

relation for it are dual representations of the same information [15]. During the

design and maintenance of a database, incorrect constraints are more easily

identified from a list of dependencies, whereas missing dependencies are better

revealed from an example of the relation. This justifies the use of Armstrong

relations for design purposes.

The construction of an Armstrong relation presented here relies on the closure

of a set of attributes under a set of functional dependencies, called fd-closure,

which is defined next [3,1].

9

Definition 3.5 Given a set Σ of functional dependencies over U and an at-

tribute set X ⊆ U , the fd-closure of X under Σ, denoted (X , Σ)∗,U or simply

X ∗ if Σ and U are understood from the context, is the set X ∗ = {A ∈ X | Σ |=

X → A}.

Intuitively, X ∗ represents the set of all attributes that are determined by X

according to Σ. Some sets of attributes, referred to as saturated [2] (or closed

sets in [3]), do not determine any other attributes but themselves.

Definition 3.6 A set of attributes X over U is saturated with respect to a

set of functional dependencies Σ over U iff X ∗ = X .

Following the terminology of [3], we define the concept of two tuples agreeing

exactly on a set of attributes as follows.

Definition 3.7 Let X be a set of attributes, X ⊂ U . A pair of tuples t1 and

t2 over U agree exactly on X iff ∀A ∈ X , t1(A) = t2(A) and ∀B /∈ X , t1(B) 6=

t2(B).

This concept is next extended to entire relations.

Definition 3.8 A database relation r over U satisfies X , with X ⊂ U , iff

there exist two different tuples in r that agree exactly on X .

This last definition is central to the purposes of this paper. As will be shown

by Theorem 5.1, if relation r satisfies precisely all saturated sets then it is an

Armstrong relation for the specified set of functional dependencies. Therefore,

the goal is to develop an algorithm that selects a set of tuples from a relation

in such a way that the resulting relation satisfies all saturated sets.

10

4 Sampling with Functional Dependencies

This Section defines the objective of the process described in this paper, that

is, what is involved when sampling a relation according to a given set of

functional dependencies, and also introduces an example.

4.1 Defining the Problem

Given a database relation r over U , and a set of functional dependencies Σ

satisfied by r, the goal is to extract a sample of this relation which is a small 1

Armstrong relation for Σ.

The exposition given here will firstly assume that r is an Armstrong relation

for Σ. Then, the same problem will be addressed in Section 6 considering a

subset Γ1 and a superset Γ2 of Σ, Γ1 Σ Γ2.

4.2 Running Example

The following is an example that will be used in this paper to illustrate the

concepts being developed.

Consider the set of functional dependencies Σ = {B → C, AC → B}, and

the database relation r over U = {A, B, C} shown in Table 1. It can be seen

how tuples t1, t2, t3 and t4 in this relation form a minimal Armstrong relation

for Σ. The entire relation r is also an Armstrong relation for Σ (adding tuples

t5 and t6 does not invalidate any functional dependency in Σ). It follows that

1 The sense in which it is small will be addressed in Section 5.1.2.

11

Table 1Running Example - Armstrong relation for Σ = {B → C, AC → B}

A B C

t1 a1 b1 c1

t2 a1 b2 c2

t3 a2 b3 c1

t4 a2 b2 c2

t5 a4 b3 c1

t6 a3 b1 c1

this is an example of a relation from which a sample can be extracted so that

the set of functional dependencies it satisfies is the same.

5 Sampling Algorithms

This Section describes the two algorithms that are used in this paper to select

a set of tuples from a relation so that the resulting sample is an Armstrong

relation for a predefined set of functional dependencies. The first algorithm

extracts a minimal size sample. The second algorithm selects tuples randomly.

Although the size of the sample resulting when using the second algorithm can

be considerably larger than using the first method, the efficiency improvement

can be very significant.

5.1 Sampling Method (I) - Minimal Sample Size

In order to guarantee the minimal possible size for the resulting sample, the

algorithm presented here relies on constructing what is called in this paper

an Agreements Table. This table represents the interactions between tuples in

the input relation r regarding the set of functional dependencies each pair of

12

tuples, if included in the sample, invalidates (see below for an example). This

information is used to guide the sampling process 2 . The construction of the

Agreements Table is based on the concept of a set of attributes being saturated

(Definition 3.6), as described below.

5.1.1 Agreements Table

The fd-closure is used in an Agreements Table to compute all sets of attributes

X ⊂ U such that X ∗ = X , that is, all saturated sets. An Agreements Table

gathers, for each tuple in the input database relation, which other tuples can

be used with that one to agree exactly (Definition 3.7) on each saturated set.

As an example, using Σ = {B → C, AC → B} as in Section 4.2, the saturated

sets of attributes are {A, C,BC, Ø} 3 . An Agreements Table has a column for

each such set. Table 2 shows the Agreements Table for the Running Example

of Section 4.2. For example, tuple t2 does not agree exactly on C∗ with any

other tuple. Even though t2 and t4 do have the same value for attribute C,

they also have the same value for B, and a different value for A, for this

reason they agree exactly on {B, C}∗, not on B∗. As another example, it can

be seen how tuple t1 can only use tuple t2 to agree exactly on A∗. This follows

from the fact that in relation r of Section 4.2 we have t1(A) = t2(A), but

t1(B) 6= t2(B) and t1(C) 6= t2(C), as required by Definition 3.7. The meaning

of this relationship between t1 and t2 is that if both tuples are included in

2 All database sampling strategies require some information structure to guide thesampling process [7].3 To simplify the exposition, this paper is considering all saturated sets. It hasbeen shown [2,3] that it suffices to consider only the set of so-called intersectiongenerators of this set. The algorithm presented below is valid in either case. Byconvention, U is the intersection of the empty collection of sets, thus it is not agenerator (although it is saturated). For this reason it has not been included here.In the relation example used in this paper, the set of generators and the set of allsaturated sets are the same except for U .

13

Table 2Agreements Table for the Running Example

A∗ C∗ {BC}∗ Ø∗ No. X ∗

t1 {t2} {t3, t5} {t6} {t4} 4

t2 {t1} {t4} {t3, t6, t5} 3

t3 {t4} {t1, t6} {t5} {t2} 4

t4 {t3} {t2} {t1, t6, t5} 3

t5 {t1, t6} {t3} {t2, t4} 3

t6 {t3, t5} {t1} {t2, t4} 3

the sample, no functional dependency with only attribute ‘A’ in the left-hand

side will be satisfied. The algorithm described below exploits this fact in order

to build a sample of the input relation that violates (i.e. does not satisfy) all

functional dependencies not in Σ∗.

The rationale behind an Agreements Table is that if a sample is selected in

such a way that it satisfies all saturated attributes sets then all functional

dependencies in Σ∗, and only those, hold in the resulting sample. This is

formalised next.

Theorem 5.1 Let r be a database relation and Σ a set of functional depen-

dencies, both over U . If relation r satisfies precisely all sets of attributes X

with X ∗ = X , then r is an Armstrong relation for Σ.

Proof. 4 Firstly, to prove that all functional dependencies in Σ∗ hold in r,

let σ be a functional dependency σ ∈ Σ∗. Without loss of generality, assume

that σ is of the form X → A, with X ⊂ U and A ∈ U . Note that Z = X∗

is saturated. By hypothesis any two distinct tuples must agree exactly on a

4 This proof has been included here for completeness only. Refer to [3, p. 37] or [2,p. 582] for alternative proofs.

14

saturated set. If t1 and t2 agree exactly on X, they must agree exactly on X∗.

Since A ∈ X∗, t1 and t2 agree also on A. It follows that r |= σ, as required.

To prove that only the dependencies in Σ∗ hold in r, take a functional de-

pendency X → A /∈ Σ∗. This implies A /∈ X∗. By hypothesis there is a

pair of tuples, t1 and t2, that agree exactly on X∗, therefore t1(X) = t2(X)

and t1(A) 6= t2(A). So, by definition r 2 X → A. This proves that r is an

Armstrong relation for Σ. �

In addition to one column for each saturated set, another column, termed ’No.

X ∗’ (Number X ∗), is shown in the Agreements Table of Table 2. It represents

the number of different sets X that could be satisfied by selecting this tuple

(i.e. number of non-empty columns in each row). This column is used to select

between several possible tuples that could be inserted into the sample, in order

for it to be of minimal size. Selecting tuples with the highest value for ‘No.

X ∗’ leads to the minimum possible additional tuples required to satisfy all

saturated sets.

5.1.2 Minimal Size Sampling Algorithm

The above discussion has shown how, using an Agreements Table, the result-

ing sample (1) is an Armstrong relation, and (2) is expected to be small. This

solves the problem as stated in Section 4.1. The algorithm presented in this

Section extracts a sample of minimal size. It is only minimal according to the

information available in the Agreements Table. Tuples are selected according

to their values for column ’No. X ∗’ in this table. These values represent the

number of saturated sets to which each particular tuple is directly related (in

the sense of being able to satisfy them), as opposed to considering all satu-

15

rated sets to which each tuple is transitively related. These two alternatives

can be identified with selecting tuples using either local information or global

information, respectively. The algorithm that follows would not change if the

second alternative was followed; however, the complexity (in terms of execu-

tion time) of building an Agreements Table would be significantly higher.

Fig 1 shows an algorithm that implements the sampling process described

above. The initial call for this recursive algorithm should be

MinimalSizeSample(r, Σ, Sample)

where r is the relation being sampled, Σ is the set of functional dependen-

cies being considered, and Sample contains the resulting sample relation. In

addition to a sample of relation r, this algorithm also returns whether sam-

pling was successfully performed. (The situation in which it is not possible to

extract an Armstrong relation for Σ from the input relation r is analysed in

Section 6.)

The algorithm initially selects a tuple, tj, with the highest value for column

‘No. X ’ according to the Agreements Table (if several tuples have the same

value, it can select randomly between them). This tuple is then included in the

sample. After that a recursive procedure, MinialSizeSamplerec, is used to

select tuples which, taken together with the tuple inserted last, agree exactly on

as many saturated sets as possible. Using this last inserted tuple, LastTuple,

function SelectTuple_local returns another tuple, ti, that agrees exactly

with LastTuple on one (or more) saturated set, X . The information stored

in the Agreements Table is used to select such a tuple. This information is

updated after each insertion in order to record that tuple ti and the newly

satisfied set (or sets) X are not to be considered in future selections. Then

16

/* Main function to extract an Armstrong relation for Σ from relation r */Function MinimalSizeSample(input r : relation, Σ : Set of fd,

output Sample : relation) output bool/*AgrT.SetofX is the family of sets X with X ∗ = X still to be satisfied.tj is a single tuple from the input relation.*/

ConstructAgreementsTable(r, Σ, AgrT);while (AgrT.SetOfX 6= Ø)

tj := SelectTuple(AgrT); /* Global selection criterion */if (tj = 〈〉) then /* Consistent Sampling not possible:

no Armstrong relation */return false;

InsertTuple(Sample, tj); /* Add to the Sample */UpdateAgreementsTable(AgrT, Sample, tj);

/* tj is not to be considered again *//* Some new sets X are now satisfied */

MinimalSizeSamplerec(AgrT, Sample, tj);end whilereturn true;

end Function

/* Recursive procedure used by the main function. */Procedure MinimalSizeSamplerec(input/output AgrT : AgreementsTable,

input/output Sample : relation,input LastTuple : tuple)

/*LastTuple is the last tuple inserted in the sample before making this

recursive call.ti is a single tuple from the input relation.*/while (AgrT.SetofX 6= Ø) do /*Satisfy all X with X ∗ = X */

ti:= SelectTuple local(AgrT, LastTuple); /* Local selection criterion */if(ti=〈〉) then return; /* No tuple selected */InsertTuple(Sample, ti); /* Add to the Sample */UpdateAgreementsTable(AgrT, Sample, ti);

/* ti is not to be considered again *//* Some new sets X are now satisfied */

MinimalSizeSamplerec(AgrT, Sample, ti); /* Keep consistency */end while

end Procedure

Fig. 1. Algorithm for Minimal Size Sampling

17

MinialSizeSamplerec is called again using the newly inserted tuple.

The chain of recursive calls terminates when either all saturated sets have been

satisfied, or no more sets can be satisfied using tuple tj. In the former case

the algorithm will terminate and the current sample is an Armstrong relation.

In the latter, a new tuple tj will be selected and the process described above

will be repeated. If no such tuple can be found, it means that there are some

saturated sets that cannot be satisfied, and therefore an Armstrong cannot be

extracted (see Section 6).

5.1.3 Sampling Example

Applying the algorithm of Figure 1 to the Running Example of Section 4.2

and the Agreements Table shown in Table 2 leads to the following scenario.

(1) According to the global selection criteria (function SelectTuple), two tu-

ples can initially be selected since they have the highest value for ‘No. X ∗’:

t1 and t3. Assume that tuple t1 is (randomly) chosen. Now, the recursive

procedure is called using MinialSizeSamplerec(AgrT, Sample, t1).

(a) Using the Agreements Table it can be seen how t2 must be selected

in order to satisfy A∗ (function SelectTuple_local).

Now MinialSizeSamplerec(AgrT, Sample, t2) is called.

(i) Set C∗ cannot be satisfied using t2. However, according to the

Agreements Table (row t2), t4 must be selected so that {BC}∗

is satisfied. Additionally, due to the interactions between t1 (al-

ready in the sample) and t4 (just inserted), set Ø∗ is also satis-

fied.

MinialSizeSamplerec(AgrT, Sample, t4) is called.

18

(A) The only saturated set that remains to be satisfied is C∗.

However, t4 does not agree exactly with any tuple on C∗.

So this recursive call terminates.

(ii) For the same reason t2 cannot be used. This recursive call also

terminates.

(b) In contrast, there are two tuples that agree exactly on C∗ with t1,

they are t3 and t5. Function SelectTuple_local, using the values

of column ’No. X ’ in Table 2, would select t3 to be inserted into the

sample. MinialSizeSamplerec(AgrT, Sample, t3) is called.

(i) All X ∗ (saturated sets) are satisfied, so there is nothing for this

recursive call to do.

(c) All X ∗ are satisfied, so this recursive call terminates.

(2) All X ∗ are satisfied, function MinialSizeSample terminates.

The resulting sample (tuples t1, t2, t4, and t3) is exactly the minimal Armstrong

relation identified in Section 4.2.

5.2 Sampling Method (II) - Random Sampling

An alternative method for sampling a database is to select the set of tuples

randomly. If such an approach is followed, the sample size cannot be assured to

be minimal, but the efficiency improvement (both in performance and memory

usage) will be significant, as discussed below.

Following this approach, a list of saturated sets of attributes still to be satis-

fied by the current sample is maintained. Whenever a new tuple is randomly

selected, the algorithm checks whether any new saturated set is satisfied. If

so, this tuple becomes part of the sample and the saturated set now satisfied

19

is deleted from the list. If not, a new tuple is sampled. This process continues

until the list of saturated sets of attributes to be satisfied is empty.

The two approaches, Minimal Size Sample and Random Sample Selection,

must be compared. It is expected that in those cases where the minimal size

sample is much smaller than the initial table, random sampling will, on av-

erage, not perform better than the minimal size sampling algorithm as the

resulting sample will be significantly larger than necessary.

5.3 Complexity of Sampling Algorithms

This Section analyses the worst-case complexity 5 of the algorithms given in

Section 5.1.2 and 5.2. The standard O-Notation [12] will be used.

The minimal size sampling algorithm consists in three phases: (1) Computing

all saturated 6 sets of attributes; (2) Constructing the Agreements Table; (3)

Extracting a minimal size sample using the algorithm shown in Figure 1.

Let n be the number of attributes in U , p the number of tuples in the databases,

and q the number of functional dependencies in Σ; that is n = |U|, p = |r|,

and q = |Σ|.

(1) Computing X ∗ can be done in linear time with the size of Σ and X [1],

and thus checking whether X is saturated has complexity O (q + n).

Beeri [3] showed that there exist some functional dependency sets for

which any Armstrong relation will have exponential size with Σ. There-

fore any algorithm must face this worst-case complexity. However, this

5 Set operations are ignored for simplicity.6 Recall from Section 5.1 that it suffices to consider only the intersection generators.

20

is only the case for highly unnormalised relations, which should be rare.

In case of relations in Boyce-Codd Normal Form [16], the number of sat-

urated sets is bounded by a polynomial in n and q [14]. The degree of

this polynomial depends on the number of keys in the relation. Refer

to this polynomial as pol(n, q). Therefore this first phase has complexity

O (pol(n, q) (p + n)).

(2) For each tuple t1 and each saturated set X compute which tuples t2

agree with t1 exactly on X . Since |X | ≤ n this phase’s complexity is

O (pol(n, q)p2n). It can be improved, however, if the relation is sorted on

X before each test [15]. Which results in complexity O (pol(n, q)np log p).

(3) Algorithm of Figure 1 has complexity O (pol(n, q)p), if one ignores the

construction of the Agreements Table, which was analysed in (1).

Therefore the entire algorithm to extract a Minimal Size Sample, if the relation

is normalised, has complexity

O (pol(n, q)np log p) .

This result is comparable to the complexity of known algorithms to build

Armstrong relations using synthetic data [14].

Following a similar analysis it can be seen how the algorithm for Random

Sampling with functional dependencies has complexity

O (pol(n, q)pn) .

21

6 Sampling with Subsets or Supersets of Σ

The previous Section assumed that the input relation r was an Armstrong

relation for Σ, and the goal was to extract a minimal Armstrong relation.

This Section relaxes this assumption in both directions. First considering a

subset of the whole set of functional dependencies satisfied by r, Σ, and then

considering a superset.

If a subset Γ1 of Σ, Γ∗1 Σ∗, is considered then sampling is not possible. That

is, no sample of a relation, which satisfies exactly Σ∗, will satisfy exactly Γ∗1.

This is proved next.

Theorem 6.1 Let r be an Armstrong relation for a set of functional depen-

dencies Σ. There can not be a sample s of relation r which is an Armstrong

relation for a proper subset Γ1 of Σ, Γ∗1 Σ∗.

Proof. Let σ be a functional dependency such that σ ∈ Σ∗ but σ /∈ Γ∗1. Since

r is an Armstrong relation for Σ there can not be any pair of tuples in r that

do not satisfy σ. Therefore, a sample s of r cannot violate σ either. Since s |= σ

for some σ /∈ Γ∗1, s cannot be an Armstrong relation for Γ1, as required. �

Consider now sampling with supersets of Σ. If the aim is also to extract an

Armstrong relation for a given superset of Σ, then this may not always be pos-

sible. To see this, assume that a sample satisfying precisely Γ2 = Σ∪{A → C}

is to be extracted from a relation containing only tuples {t1, t2, t3, t4} of Table

1. It can be seen how one tuple from {t1, t2} and another one from {t3, t4}

must be left out of the sample, in order to satisfy the new functional depen-

dency {A → C}. However, the resulting sample (e.g. {t1, t4}, {t2, t3}) satisfies

22

more functional dependencies than only those in Γ∗2 (e.g. {B → A, C → A}),

thus it cannot be an Armstrong relation for Γ2.

Therefore, in order to sample with a superset of Σ, it is necessary to weaken

the objective of sampling. Instead of requiring the resulting sample to be an

Armstrong relation, the goal is now only to ensure it satisfies all specified

functional dependencies (even if it may also satisfy others).

Theorem 6.2 Let r be an Armstrong relation for a set of functional depen-

dencies Σ over U . There exists a sample s of r which satisfies a superset Γ2

of Σ, Σ∗ ⊂ Γ∗2.

Proof. Let γ1 be a functional dependency such that γ1 ∈ Γ2 and γ1 /∈ Σ∗.

Without loss of generality assume γ1 is of the form Y → A, with Y ⊂ U and

A ∈ U . Initially, let sample s be the entire relation r. For any pair of tuples

t1 and t2 in s with t1(Y ) = t2(Y ) and t1(A) 6= t2(A), remove t2 from s. No

functional dependency in Σ can become unsatisfied by removing a tuple from

s (Theorem 6.1), therefore now s |= Σ∗∪{γ1}. (Note that by removing a tuple,

however, a new functional dependency not in Γ∗2 may become satisfied.) This

process can be repeated for any additional γ2 ∈ Γ2 \ {γ1} and γ2 /∈ Σ∗ ∪ {γ1}

until s |= Γ2, as required. �

The algorithm given in Figure 1 can easily be modified to sample with su-

persets of Σ, or to detect when this is not possible as in the previous ex-

ample. All that is required is to ensure that functions SelectTuple() and

SelectTuple_local() do not select tuples that satisfy non-saturated sets of

attributes. As an example, consider Γ2 = Σ ∪ {A → C} = {B → C, AC →

B, A → C} and the (entire) relation shown in Table 1 of Section 4.2. The cor-

23

responding set of saturated sets is {Ø, C,BC}, therefore the set that must not

be satisfied is {A∗} (see Table 2). Applying the algorithm of Figure 1 with the

suggested modification, the following sample would be extracted: {t1, t5, t4, t6}.

This sample is indeed an Armstrong relation for Γ2. The same procedure will

demonstrate why, in the previous example, no sample of {t1, t2, t3, t4} can be

an Armstrong relation of Γ2.

The two theorems presented in this Section identify what can be expected

when sampling with functional dependencies, and the minimum knowledge

required about the database relation being sampled in order to do so (i.e. a

set of functional dependencies for which the input relation is an Armstrong

relation). If the set of functional dependencies is not the minimum required,

the algorithm given in Section 5.1.2 can detect this situation, notifying the

database designer and increasing the understanding of this relation.

7 Summary and Future Work

This paper has investigated how a database relation can be sampled so that

the resulting set of tuples satisfies precisely those functional dependencies

that are logical consequences of a given set of functional dependencies, that

is, it is an Armstrong relation. This has been achieved by using the concept

of closure of a set of attributes under a set of functional dependencies, fd-

closure, which has then been used to compute the so-called saturated sets of

attributes. The saturated sets guide the sampling process to select tuples in

such a way that all functional dependencies not in the initial set are violated

(i.e. not satisfied). Two algorithms have been proposed, one which extracts a

sample with the minimum number of tuples, and the other which selects tuples

24

randomly. Initially it has been assumed that sampling would be performed

according to the whole set of functional dependencies satisfied by the relation,

denoted by Σ. Then this assumption has been relaxed, considering subsets

and supersets of the initial set of functional dependencies.

As outlined in Section 5.1.2, the algorithm presented in this paper to extract

a minimal size sample is in fact suboptimal, in the sense that a sample of

smaller size than the one generated by the algorithm could exist in some cir-

cumstances. This is due to the fact that the algorithm considers only local

information to decide which tuples must be inserted into the sample. How

global information could be used to ensure minimal sample size, without ex-

cessively increasing its complexity, is still to be investigated.

Theorem 6.2 proved that sampling with a superset of Σ is not always possi-

ble. An interesting line of work could investigate in which cases it is actually

possible to extract an Armstrong relation for a superset of Σ.

Finally, it should be noted that the concept of Armstrong relation is not

exclusively related to functional dependencies, but it has been extended to a

more general type of dependencies [10]. Database sampling with this type of

dependencies, in order to extract Armstrong relations for them, would be an

interesting area for future work.

References

[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-

Wesley, 1995.

[2] W.W. Armstrong. Dependency structures of data base relationships. In J.L.

25

Rosenfeld, editor, Information Processing 74 (Proceedings of IFIP Congress 74),

pages 580–583. IFIP, North-Holland, August 1974.

[3] C. Beeri, M. Dowd, R. Fagin, and R. Statman. On the structure of armstrong

relations for functional dependencies. Journal of the ACM, 31(1):30–46, January

1984.

[4] B. Beizer. Black-Box Testing : Techniques for Functional Testing of Software

and Systems. Wiley, 1995.

[5] J. Bisbal. Database Sampling to Support the Development of Data-Intensive

Applications. PhD thesis, University of Dublin, Trinity College, October 2000.

[6] J. Bisbal and J. Grimson. Database prototyping through consistent sampling.

In the International Conference on Advances in Infrastructure for Electronic

Business, Science, and Education on the Internet (SSGRR’2000). Scuola

Superiore Guglielmo Reiss Romoli (SSGRR), August 2000.

[7] J. Bisbal and J. Grimson. Generalising the consistent database sampling

process. In B. Sanchez, N. Nada, A. Rashid, T. Arndt, and M. Sanchez, editors,

Proceedings of the Joint meeting of the 4th World Multiconference on Systemics,

Cybernetics and Informatics (SCI’2000) and the 6th International Conference

on Information Systems Analysis and Synthesis (ISAS’2000), volume II -

Information Systems Development, pages 11–16. International Institute of

Informatics and Systemics (IIIS), July 2000.

[8] J. Bisbal, D. Lawless, B. Wu, and J. Grimson. Legacy information systems:

Issues and directions. IEEE Software, 16(5):103–111, September/October 1999.

[9] J. Bisbal, B. Wu, D. Lawless, and J. Grimson. Building consistent

sample databases to support information system evolution and migration.

In G. Quirchmayr, E. Schweighofer, and T. J.M. Bench-Capon, editors,

Proceedings of the 9th International Conference on Database and Expert

26

Systems Applications (DEXA’98), volume 1460 of Lecture Notes in Computer

Science, pages 196–205. Springer-Verlag, 1998.

[10] R. Fagin. Horn clauses and database dependencies. Journal of the ACM,

29(4):952–985, October 1982.

[11] J. Gray, P. Sundaresan, S. Englert, K. Baclawski, and P. J. Weinberger.

Quickly generating billion-record synthetic databases. In Proceedings of the

International Conference on Management of Data (SIGMOD 1994), pages 243–

252. ACM Press, 1994.

[12] D.E. Knuth. Fundamental Algorithms, volume 1 of The Art of Computer

Programming. Addison-Wesley, third edition, 1997.

[13] P. Lohr-Richter and A. Zamperoni. Validating database components of software

systems. Technical Report 94–24, Leiden University, Department of Computer

Science, 1994.

[14] H. Mannila and K.-J. Raiha. Small armstrong relations for database design. In

Proceedings of the Fourth ACM SIGACT-SIGMOD Symposium on Principles

of Database Systems (PODS’85), pages 245–250, 1985.

[15] H. Mannila and K.-J. Raiha. Dependency inference. In P.M. Stocker, W. Kent,

and P. Hammersley, editors, Proceedings of 13th International Conference on

Very Large Data Bases (VLDB’87), pages 155–158. Morgan Kaufmann, 1987.

[16] H. Mannila and K.-J. Raiha. The Design of Relational Databases. Addison-

Wesley, 1992.

[17] A. Neufeld, G. Moerkotte, and P. C. Lockemann. Generating consistent test

data: Restricting the search space by a generator formula. VLDB Journal,

2(2):173–213, April 1993.

[18] H. Noble. The automatic generation of test data for a relational database.

Information Systems, 8(2):79–86, 1983.

27

[19] I. Sommerville. Software Engineering. Addison-Wesley, fifth edition, 1995.

[20] L. Tucherman, M.A. Casanova, and A.L. Furtado. The chris consultant - a tool

for database design and rapid prototyping. Information Systems, 15(2):187–195,

1990.

[21] G. Wiederhold. Modeling and system maintenance. In M.P. Papazoglou,

editor, Proceedings of the 14th International Conference on Object-Oriented and

Entity-Relationship Modeling (OOER’95), pages 1–20. Springer-Verlag, 1995.

[22] B. Wu, D. Lawless, J. Bisbal, R. Richardson, J. Grimson, V. Wade, and

D. O’Sullivan. The butterfly methodology: A gateway-free approach for

migrating legacy information systems. In B. Werner, editor, Proceedings

of the 3rd IEEE Conference on Engineering of Complex Computer Systems

(ICECCS’97), pages 200–205. IEEE Computer Society Press, September 1997.

[23] M. Yannakakis. Perspectives on database theory. In Proceedings of the 36th

Annual Symposium on Foundations of Computer Science, pages 224–246. IEEE

Computer Society Press, 1995.

[24] A. Zamperoni and P. Lohr-Richter. Enhancing the quality of conceptual

database specifications through validation. In Proceedings of the 12th

International Conference on Entity-Relationship Approach (ER’93), volume 823

of Lecture Notes in Computer Science, pages 85–98. Springer-Verlag, 1993.

28