vertical fragmentation and allocation in distributed

Information .Systems Vol. 22, No. 1, pp. i-24, 199’7

Pergamon 0 1997 Elsevier Science Ltd. All rights reserved

Printed in Great Britain

PII: s0306-4379(97)ooo01-x 0306-4379/97 $17.00 + 0.00

VERTICAL FRAGMENTATION AND ALLOCATION IN DISTRIBUTED DEDUCTIVE DATABASE SYSTEMS+

SEUNG-JIN LIM and YIU-KAI NG

Department of Computer Science, Brigham Young University, Provo, Utah 84602, IT.S.A

(Received 15 November 1995; in final revised form 13 February 1997)

Abstract - Although approaches for vertical fragmentation and data allocation have been proposed, algorithms for vertical fragmentation and allocation of data and rules in distributed deductive database systems (DDDBSS) are lacking. In this paper, we present different approaches for vertical fragmentation of relations that are referenced by rules and an allocation strategy for rules and fragments in a DDDBS. The potential advantages of the proposed fragmentation and allocation scheme include maximal locality of query evaluation and minimization of communication cost in a distributed system, in addition to the desirable properties of (vertical) fragmentation and rule allocation as discussed in the literature. We also formulate the mathematical interpretation of the proposed vertical fragmentation and allocation algorithms. 01997 Elsevier Science Ltd

Key words: Fragmentation, Allocation, Replication, Rules, Deductive Databases, Distributed Systems

1. INTRODUCTION

Deductive database systems enhance the expressive power of conventional relational database

systems by adopting logic programming as a query language which allows recursion, while distributed database systems offer many advantages over centralized database systems which include the enhancement of reliability and availability of the involved databases, improvement of overall system performance by executing transactions in parallel, and minimization of contention for sys-

tem resources [14]. The integration of these two database systems appears to provide a promising, potentially more powerful and reliable database system for information processing.

Two of the main design activities in distributed systems are fragmentation and data allocation. Fragmentation allows parallel execution of a single query, reduces the amount of irrelevant data access and unnecessary data transfer, increases the level of concurrency and therefore the system throughput in a distributed database system [6, 141. Vertical fragmentation further enhances the performance of database transactions by closely matching fragments for the requirements of transactions [12]. Our design goals of fragmentation and allocation of rule and data aim to maxi- mize concurrent rule execution, reduce replication of rules and data, minimize communication cost during query evaluation, and decrease the query response time.

Different approaches for (vertical) fragmentation and rule allocation in distributed (deductive) database systems have been proposed. Navathe and Ma [13] introduce a vertical partitioning algorithm using a graphical technique that improves the previous work on vertical partitioning [12]. Meghini and Thanos [lo] present a theory of fragmentation and study the completeness and update problems of overlapping fragments. Ezeife and Barker [6] develop a fragmentation technique for class objects in a distributed object based system. Mohania and Sarda [ll] discuss the rule allocation problem in a distributed database system and propose a rule partitioning method. Wolf- son and Jajodia [16, 171 construct algorithms for dynamic data allocation in distributed systems. Apers [l] proposes a distributed data allocation algorithm that utilizes actual query processing schedules. The proposed method, which integrates the problems of distributed query optimization and optimal data allocation statically by sequentially optimizing query strategies and then data allocation, determines the (horizontal) fragments of relations to be allocated so that the total transmission cost of processing user queries and updates is minimized. The same problem has been

+Recommended by Nicole Bidoit

2 SEIJNG-JIN LIM and YIIJ-KAI NG

addressed by Blankinship et al. [3] that consider an iterative method for integrating query optimization and data allocation methods in distributed database design. An optimization heuristic is adopted in [3] which iteratively determines the minimum cost query strategies and minimum cost data allocation until a local minimum for the combined problem is found. All of these approaches, however, either address the problem of data allocation [16, 171 and optimal query processing [l, 31, focus on the partitioning problem of rules [ll], or deal with the fragmentation of relations or class objects in a distributed system [6, 10, 12, 131. Algorithms for vertical fragmentation of relations referenced by rules and allocation of rules and corresponding fragments in a distributed deductive database system (DDDBS) are lacking. In this paper, we present different approaches for vertical fragmentation of relations and allocation of rules and fragments. Data and rules are distributed across different sites in a network to meet the operational needs and to handle future information

processing at each site. The proposed fragmentation and allocation strategy maximizes locality of query evaluation while minimizes communication cost and execution time during query processing.

We proceed to present our results as follows. In Section 2 we provide our basic definitions for dependencies among rule expressions and base relations. In Section 3 we propose four different algorithms: RCA for rule clustering, OVF for computing overlapping vertical fragmentation, DVF for generating disjoint vertical fragmentation, and CAA for allocating rules and corresponding fragments. In Section 4 we include the mathematical interpretation of the proposed algorithms. In Sections 5.1 and 5.2 we formulate the communication costs of distributing rules and fragments and query evaluation, respectively. In Section 6 we present the proofs of correctness and complexity analysis of the proposed algorithms and give the concluding remarks in Section 7.

2. BASIC DEFINITIONS

We consider each Datalog rule r in a DDDBS of the form

P(X1;.. ,&> :- ql(~,...,Y,),~..,qt(~l,...,~s)

where p is the head (predicate) of T and is either a derived (intensional) or mixed predicate (relation)t. qi (1 5 i 2 t) is either a derived, mixed, or base (extensional) predicate (relation), and

Ql,..., qt form the body of r. An argument of a predicate is either a variable or a constant. A rule with an empty body is a fact or a base relation, i.e., extensional predicate, which contains a set of facts, and a rule without the head predicate is a query. T is recursive if at least one of the predicates in the body of T is p [4]. Two predicates p and q are mutually recursive if p and q are (in)directly dependent on each other, i.e., in order to compute p, we need to compute q, and vice versa, and we call any two rules with p and q as head predicates respectively mutually recursive rules. T

can be extended to handle complex data structures as in higher-order logic database languages [9, 151, and our proposed solutions for fragmentation and allocation problems are independent of the constructs of T, i.e., whether T is a Datalog rule or is extended as a rule in a higher-order logic database language.

We apply some basic principles of graph theory to the proposed fragmentation and allocation algorithms, and use matrices to capture the dependency relationships among rules and base relations.

Definition 1 A directed graph (digraph for short) G(N,E) consists of two sets, the non-empty set of nodes N (or N(G)) and the set of edges E (or E(G)). Each node in N represents either a rule or a base relation, whereas each edge (ni, ns) in E denotes that the head predicate of node ns (which can be a base relation) appears in the body of node ni .

Definition 2 In a digraph G, node nj is reachable from node ni if there exists a path from ni to nj, and node nj is directly reachable from node ni if there exists a path of length 1 from ni to nj.

tA predicate p is naked if there is a set of ground facts for p and p appears as the head predicate of some rules [2].

Vertical Fragmentation and Allocation in DDDBSs

Definition 3 Given a digraph G, let A be the Boolean adjacency matria: of G. Then,

3

A(“)[i,j] = 1

1 if there exists a path of length m from node ni to node nj

0 otherwise

In particular, A(‘)[i, j] is called a direct dependency matrix of G.

Definition 4 In a digraph G, a reachabiEity matrix R is defined as

R = A(1) ” A(2) v . v A(“)

where an entry of R is computed by applying the Boolean addition (V) to the corresponding entries in A(l), . . , A(“),

There exist direct dependency matrices that capture the rule-to-rule and rule-to-relation relationships, respectively in DDDBSs.

Definition 5 In a digraph G with N, distinct rules, an N, x N, direct rule-to-rule dependenq

matrix. rr. is defined as

1 if rule rj is directly reachable from rule r,, rr[i,j] =

1

i.e., the head predicate of rj appears in the body of ri 0 otherwise

Definition 6 In a digraph G with N, distinct rules and NR base relations, an N,. x NR direct rule-to-relation dependency matrix, rR, is defined as

rR[i, j] =

{

1 if base relation Rj is directly reachable from rule ri, i.e., Rj appears in the body of ri

0 otherwise

Definition 7 An NR x 1 table-size matrix TS is defined as

TS[i] = n, where i denotes base relation Ri, and n denotes the size (in bytes) of R,.

Definition 8 A network topology matrix T, which is symmetric relative to the principal diagonal, is defined as T[i,j] = w, where w denotes the total weight of the shortest path (measured by the physical distance) from site ni to site nj in a network. w = 0 if i = j.

Accordingly, the connection weight of a site Si in a network with p other sites is defined as

L&t,‘;;,“17e;7Fr; is th e sum of the total weight of the shortest path from Si to each of the other

Example 1 Consider a distributed deductive database (DDDB) D consisting of nine rules and four base relations whose relationships are captured by a direct rule-to-rule dependency matrix rr and a direct rule-to-relation dependency matrix rR. F’urther assume that a table-size matrix TS for the base relations in D and a network topology matrix T are given along with rr and rR as follows:

000100000

100100000 000000000 001000000

rr = 001000000 000100100 000100000 000000000 000000000

TS= [‘;;j;T=[;;;;l

I rR =

10 0 o-

0 0 0 0 0 0 1 0 0 0 0 0 0000; 0 1 0 0

0 0 0 1 0 0 1 0 0 0 0 1


rl : pl :-p3,Rl

r7: p5:-p3,R4

Fig. 1: A Given DDDB and its Digraph G

191 Sl I1 r 0143i

Fig. 2: A Network and its Topology Matrix

Figure 1 depicts D, in which rules are without arguments for simplicity of presentation, and relationships among rules and base relations are captured by a digraph G. Figure 2 shows a network

with labeled edges, which denote weights of the edges, and its network topology matrix T. 0

3. FRAGMENTATION AND ALLOCATION ALGORITHMS

In this section, we present a fragmentation and allocation algorithm (FAA). FAA, as defined,

consists of three subalgorithms: a rule clustering algorithm (RCA), a data clustering algorithm (DCA), and a rule and fragment allocation algorithm (CAA).

FAA aims for maximizing the locality of query evaluation and thus minimizing communication

cost and search space during query processing. In order to minimize the communication cost of transmitting data or partial answers to a query, rules with the same head predicate and, if possible, fragments of the base relations on which they depend (either directly or indirectly) are allocated

to the same site. For if not, the computation of partial answers to a query Q involving rules with the same head predicate are performed at different sites since none of these sites has all of these rules locally. As a result, these sites must communicate with one another in order to generate all

the answers to Q, and hence adding to the communication cost. Furthermore, our clustering and allocation algorithms assign mutually recursive rules to the same site. For if not, participating sites, where these rules are stored, have to communicate at each intermediate step of executing a query Q, and thus increasing the communication cost and execution time of Q since processors at different sites may spend most of their time waiting for one another or transmitting data across sites in the network [ll].

RCA, which may generate replicated rules in a DDDBS D, is properly designed so that most, if not allt, of the rules used by a query in D are executed locally and hence reduces communication overheads. Since knowledge and integrity constraints represented in a deductive database, which are captured by rule expressions, are much less time-variant than data [ll], the effect on updates of replicated rules is reduced. DCA, on the other hand, provides two alternatives: either replication or partition of fragments of base relations referenced by rules. Replication is a desirable feature in a static DDDBS since it increases the locality of query processing and the availability and reliability of a DDDBS [5].

tWhen a query Q of the form ?- ql(Vl) , . . . , qn( V,), where V; (1 5 i 5 n) denotes a vector of arguments of pi, is submitted to a site S where (a subset of) the rules for computing the answers to Q do not reside, (subqueries in) Q must be remotely executed and its answers are transmitted to S.

Vertical Fragmentation and Allocation in DDDBSs 5

It is assumed that two direct dependency matrices, rr and TR, a table-size matrix TS, and a

network topology matrix T are given as inputs (i.e., they are predetermined) to FAA. It is further

assumed that each site has its local distributed data directoy (which is called knowledge directory in [8]) that contains the information of “which site has which rules and fragments of base relations,” and all the rules and fragments of base relations to be allocated are originally stored at a particular site. called primary site, in the network.

3.1. HCA

We consider two kinds of rules in a DDDB: (i) a directly dependent rule r1 on another rule ~2,

i.e., the head pr edicate of r2 appears in the body of ~1, and (ii) an indirectly dependent rule ~1

on another rule ~2, i.e., the head predicate of r2 appears in the body of ~1 through a number of intermediate rules. For example, in Figure 1, rules ~4 and ~5 are directly dependent on rule ~3,

but rules ~~ and ~2 are indirectly dependent on rule ~3 through ~4.

RCA first constructs a digraph (DG) G using TT such that G represents both direct and indirect

dependency relationships among rules. Then RCA computes each distinct subgraph (subDG) of Gt. Distinct subgraphs of G are used by DCA to compute fragments of base relations on which rules in each distinct subgraph depend (either directly or indirectly).

Sections 3.1.1 and 3.1.2 include the steps of RCA.

3.1. i Computing Prospective Distinct Subgraphs

Wtb construct each prospective distinct subgrapht of rules, that are not base relations, o in G

such that either

l 0 consists of a single rule that is not directly connected to other rules (that are not base relations) in G, or

l one (and only one) of the rules T in CT can directly connect to other rules in g, i.e., the head

of every rule (except T) in (T appears in the body of T.

Example 2 Given the DDDB and its dependency graph G in Figure 1, the sets of rules (~1, T4j,

{Tz~T,rT4j, {T3j, { T4,T3}, {Tg,T3}, {Tg,Tq,T7), {f~,T4), {Tg}, and {Tgj with their corresponding edges as shown in Figure 4 form different prospective distinct subgraphs of G in Figure 3.

Rl

Fig. 3: The Dependency Graph G of a Given DDDB

+A subgraph SG in G is called a distinct subgmph if SG has no outgoing edges to other nodes, which denote rules, not base relations, in G. All the rules in SG are eventually distributed to a particular site in the network.

$A prospective distinct svbgmph (T = {T~,TZ} of a DG G is a subgraph of G such that ~2 is directly reachable from ~1. Each prospective distinct subgraph eventually becomes (a portion of) a distinct subgraph of G. If there exists no node ~2 (which denotes a rule) in G which is directly reachable from q , then there is no ~2 in CT.


I

Rl i

Fig. 4: Prospective Distinct Subgraphs

RI

Fig. 5: Attach {TI,~~),{Q}, and {T4,7’3} to {C?,fl,~4)

Rl

Fig. 6: Attach {vg}, (v-473) and (~7~~4) to {Q,T~,T~}, and Discard Embedded Subgraphs

q

To simplify the discussion, from now on we denote a subDG u of a DG by a set of nodes, assuming that the edges connecting nodes in o are implicitly represented.

3.1.2. Generating Distinct Subgraphs

The next step of RCA expands, if possible, a prospective distinct subgraph o iteratively to

generate a distinct subgraph by adding indirectly dependent rules to (T as follows:

l If every node in a prospective distinct subgraph uj is directly or indirectly reachable from a

node in another prospective distinct subgraph ui, expand ui by attaching crj to ui and retain uj. Repeat this process until no further changes can be made.

l Each subgraph of G which is embedded within another subgraph of G is discarded. This

yields a set of distinct subgraphs of G.

Example 3 Consider the set of prospective distinct subgraphs generated in Example 2. By applying the steps in this subsection, we obtain five distinct subgraphs: subDGr =(rr ,~z,~a,r4}, subDGs= {rs,r5), subDGs= {ra,r4,rs,r7}, subDGh= (7-s) and subDGs={ra}. Figures 5 - 6 illus- trate the process of merging the prospective distinct subgraphs shown in Figure 4. 0


3.2. DCA

In this section, we propose a data clustering algorithm (DCA) which generates vertical frag-

ments of base relations that are referenced by rules in a distinct subgraph computed by RCA.

These fragments are attached to the distinct subgraphs and are allocated along with rules, which either directly or indirectly depend on the fragments, at chosen sites in a DDDBS according to the cluster allocation algorithm to be introduced in Section 3.3.

Conventional techniques for developing fragments have tailored on the needs of user applications (queries). Our vertical fragmentation technique, however, is strictly based on the given set, of rule expressions represented in a DG as well as the access frequency of queries in one of the fragmentation algorithms. The uniqueness of our approach is two-fold. First, no information of access pattern on base relations is needed by our fragmentation approach and hence our approach

eliminates extra inputs as required by conventional fragmentation methods [12, 13, 141. Second,

attributes clustered in a vertical fragment are often determined by using an attribute affinity matrix (constructed by using an attribute usage matrix and transactions) in the conventional approaches

[12, 131; however, fragments generated by using our vertical fragmentation approaches are strictly based on the rule-to-attribute dependency matrices. We are motivated to investigate the verti-

cal fragmentation problem in DDDBSs since it is inherently more complicated than horizontal

fragmentation due to the total number of alternatives that are available in the vert,ical case [14]. More importantly, our vertical fragmentation methods generate an “optimal” clustering scheme of

database relations that are referenced by rules in a DDDB. The resultant fragmentation scheme is optimal since only relevant rules and essential data that are needed for processing a particular query in a DDDB are clustered together to enhance the efficiency and minimize overheads (in

t,erms of communication costs) during query processing.

Since disjoint fragmentation can be more easily handled by a distributed system than more

sophisticated overlapping fragmentation [14], in this paper we present a disjoint vertical fragmentation algorithm, called DVF, as one of the two subalgorithms of DCA. DVF disallows distinct fragments of a base relation R to contain common attributes of R, and these fragments are not replicated over the network. On the other hand, since disjoint fragmentation is impractical in some

real-world applications due to the constraints that it imposes to database design [lo], we also consider an alternative fragmentation scheme, the overlapping vertical fragmentation, called OVF, the other subalgorithm of DCA. OVF allows overlapped fragments of a base relation to be replicated and distributed over the network. These two vertical fragmentation approaches are based on the notion of direct and indirect rule-to-attribute dependencies. In a DDDBS where response time and

communication cost are the primary design issues and the involved databases are static, i.e., most of the data processing activities are retrievals (i.e., read), rather than modification (i.e., update), OVF is preferable than DVF. On the other hand, if communication cost is not a major concern, such as DDDBSs built based on local area networks, and database updates occur frequently, then DVF is preferable than OVF.

It is assumed that there exists a tuple identifier attribute TID [lo] for each fragmented relation R such that TID is allocated with each fragment of R to the site where the fragment resides. (Tuple identifier attributes ensure the lossless-join decomposition of various vertical fragments of a base relation, and this concept is well understood in the literature [lo, 141.) Prior to the introduction of the two fragmentation approaches, we give a few definitions that are used in the t,wo proposed vertical fragmentation algorithms.

Definition 9 An N, x NAR, rule-to-attribute dependency matrix+ AD,, of base relation &, where

N, is the number of distinct rules and NA Rk is the number of attributes in &, in a DDDBS is defined as

ADk[i, j] = 1 if the jth attribute of Rk is used by rule ri 0 otherwise

+A rule-to-attribute dependency matrix is similar to the attribute usage matrix as defined in (13, 141, whereas the former is based on rule expressions while the latter is based on past query history.

8 SEUNG-JIN LIM and YIU-KAI NG

It is assumed that given a rule r that references a base relation p, all the attributes of p that are not used by T are replaced by the “don’t care” symbol [4], i.e., _ , in p. For example, given the rule T: q(V) :- . . . ,p(Al, _,As), . . . ., where p is a base relation with attributes Al, AZ, and As, P uses only attributes AI and A3 of p since AZ of p is replaced by ‘_’ in T.

Definition 10 Given a base relation h?k and a distinct subgraph subDGi, the minimal set of

attributes Ei,k is a subset of attributes in Rk such that each attribute in Ei,k is referenced by at

least one rule in subDGi.

Definition 11 Let Ak,i denote the ith attribute of a base relation Rk. A vertical fragment of

a base relation &, denoted Fi,k = {Ak,jl, -.*, &,j, }t, is a subset of attributes of Rk that are referenced by some rules in subDGi.

3.2.1. Overlapping Vertical Fragmentation

Recall that in Section 3.1.2 subsets of rules are clustered into distinct subgraphs to be allocated to different sites in a DDDBS. In this subsection, we propose a strategy for clustering vertical fragments of base relations that are referenced by a set of rules S,. in a distinct subgraph. These fragments are allocated along with S, to a chosen site in a DDDBS.

Algorithm 1 (OVF: Overlapping Vertical Fragmentation )

Input: Distinct subgraphs subDGs and the set of rule-to-attribute dependency matrices

ADI, . . ., AD,, where n denotes the number of base relations in a DDDB.

Output: A set of overlapping vertical fragments FS. Fi,,+ in FS is a fragment of base

relation & that is referenced by a rule in subDGi.

l For each distinct subgraph SubDGi, determine the minimal set of attributes &,k of

each base relation Rk that are referenced by some rules in subDGi.

In OVF, two attributes A and B in a base relation are assigned to the same fragment if A and B

are referenced by a rule T to be stored at the same site regardless of their degree of affinity [12, 131. Hence, a query that uses T can process T at a single site. As a result, maximal locality of query evaluation is guaranteed in a DDDBS using our fragmentation and allocation approaches. Our clustering method, however, is not adopted by the conventional vertical fragmentation approaches. The vertical fragmentation algorithms in [12, 131 allocate A and B to different fragments if the

affinity of A and B is negligible.

Example 4 Consider Example 3 again and let ADI, ADZ, ADS, and AD4 be the given rule- to-attribute dependency matrices of the base relations R 1, R2, Rs , and Rq in D of Example 1,

respectively as shown below.

ADI =

1 1 1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ;ADz= 000; 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

TNote that &,k = Fi,k in OVF, but &,k _> Fi,k in DVF.

Vertical Fkagmentation and Allocation in DDDBSs

000000 -0 0 0

000000 0 0 0

110110 0 0 0 000000 0 0 0

ADS= 0 0 0 0 0 0 ;.4D,= 0 0 0 000000 0 0 0 000000 1 0 0

010011 0 0 0

000000 1 1 0

OVF yields the minimal sets of attributes for subDGi, . -, subDGs as

0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1

follows:

&,I = {Al,1,Al,2,Al,3), Cl,2 = x1,4 = 0, x1,3 = {A3,l,A3,2rA3,4,A3,5};

x2,, = x2,:! = x2,4 = {}, x2,3 = (A3,,,il3,2,A3,4,A3,5};

&,I = 0, )=3,2 = {A2,2rA2,3}, x3,3 = {A3,1, A3,2, A3.4, .43,5}, x3,4 = {A4,1,,44.5};

x4,, = x4,2 = x4,4 = {}, x4,3 = {A3,2, -43,5, A3,6};

x5,, = x5,2 q = x5,3 = {}, x5,4 = {A4,1r-44,2,A4,4, .44,5}.

Hence, we obtain Fz,k (1 5 i 5 5,l 5 k < 4) for each distinct subgraph subDG, in Example 3. Appending each Fz,k to the corresponding subDGi yields

subDGr = {Tlrr2,T3,r4,

Ft.1 = {Al,1 ,A1,2,-41,3),&,3 = (A3,1,113,2,A3,4,A3,5}};

subDG2 = {r3,r5,

Fl.3 = {A3,1,A~,~rA3,4,A3,5}}; subDG:$ = (rs,r4,‘rs, r7,

F3,2 = I&,2, A2,3}> F3,3 = (-4 3,1, A3.2, A3,4> A3.51, F3,4 = {A4,1r A4,5)};

subDG4 = {rs,

F4,3 = {A3,2, A3,5, A3.6}};

subDG5 = {rs,

F5,4 = {A4,lrA4,2,-44,4r ‘44.5)).

S’. 2.2. Disjoint Vertical Fragmentation

In this subsection, we present the approach of DVF that is used as an alternative of OVF in conjunction with RCA.

It is common in deductive databases that different rules reference the (same set of attributes of

the) same base relation. It is also likely that different distinct subgraphs, as discussed in Section 3.1, include the same subset of rules that are to be distributed over different sites in a DDDBS. Hence, different distinct subgraphs may be extended to include the (same set of attributes of the) same base relation. If we do not allow replication of a subset of attributes A of a base relation for various reasons, we must decide which distinct subgraph should include A. Therefore, one of the design issues of our disjoint vertical fragmentation approach is to determine the distinct subgraph u to which A should be assigned when different distinct subgraphs, including u, depend on Al. Our primary goal in determining the allocation of A is to minimize data transfer across the network during the query evaluation process that involves A.

Given a set of attributes A in base relation R and a distinct subgraph 0, there are two cases to be considered for the allocation of A to o: (i) o depends on A, or (ii) o does not depend on A. Obviously, we prefer to assign A to ~7 if o depends on A to minimize data transfer during the query evaluation process involving rules in u that depend on A. We also need to consider the

+A rule T depends on (a subset of attributes A of) base relation R, or (A in) R is referenced by T, if (A in) R appears (as arguments of some predicates) in the body of T. Any distinct subgraph u that includes T is said to depend on (A in) R, or (A in) R is said to be referenced by o.


situation when two or more distinct subgraphs depend on A. Our clustering strategy is based on the query access frequency of rules such that a more frequently referenced rule should be given priority on clustering with the data on which it depends. Before we discuss the approach of our DVF for assigning A to one of these distinct subgraphs, we give the following definitions,

Definition 12 Given a DDDBS D with NQ distinct queries, the query-access-frequency vector of D, denoted freqQ, is defined as

where i denotes query Qi (1 5 i 5 NQ) and n denotes the access

frequency of query Qi during a given period of time in D [14].

Rule rj is said to be referenced by query Qi, or Qi depends on rj, if the head predicate of rj appears in Qi. Furthermore, if distinct subgraph Ok includes rj, which is referenced by Qi, then we say that ok is referenced by Qi or Qi depends on ok.

Definition 13 Given a DDDBS D with NQ distinct queries and N,. distinct rules, the NQ x N, query-access-rule matrix Qr of D is defined as

Qr[i,d = { 1 if rule rj is referenced by query Qi 0 otherwise

where 15 i 5 NQ and 15 j 5 N,.

Definition 14 Given a DDDBS D, its query-access-frequency vector freqQ and query-access-rule matrix Qr, the rule-access-frequency vector freq, of D is defined as

freq, = freqQ x Qr, where f reqT [i] denotes the access frequency of rule ri according

to the given set of queries during a given period of time in D.

Definition 15 Given a DDDBS D and its rule-access-frequency vector f req,., the subgraph-access- frequency vector of a set of distinct subgraphs of D, denoted freq,, for the given set of queries during a given period of time in D is defined as

freq,[il = C freqG1, where rule rk is included in distinct subgraph gi. k

Given the definitions above, we now define the most frequently referenced rules/distinct subgraphs as follows:

Definition 16 Given a rule-access-frequency vector f req,., rule ri is the most frequently referenced rule among the set of N, rules denoted in freq,. if

max(f req, PI, . . . ,freq&%]) = freq$].

With respect to the reference frequency of rules according to a given set of queries, we can define the most frequently referenced distinct subgraph among the given set of distinct subgraphs. This can be done by replacing ri, N,, and f req,. in Definition 16 by distinct subgraph ui, the number of the distinct subgraphs N,, and the subgraph-access-frequency vector f req,, , respectively.

Example 5 Consider the query-access-frequency vector fmZ?qQ and the query-access-rule matrix Qr given below.

freqQ= [ 2 16 10 6 10 7 71; Qr=

100001010 001000000 010001100 100100010 001010001 011001000 100001001

Vertical Fragmentation and Allocation in DDDBSs

The rule-access-frequency vector is calculated as

freq, = freqQ x &r = [ 15 17 33 6 10 26 10 8 17 ]

which indicates that rs is the most frequently referenced rule according to the given set of seven queries. Subsequently, using the distinct subgraphs generated in Example 3, we compute the subgraph-access-frequency vector as follows:

freq,[l] = freqT[l] + freq,[2] + freq,[3] + freq,[4] = 15 + 17 + 33 + 6 = 71

freq,[2] = freq,[3] + freq,[5] = 33 + 10 = 43

freq, [3] = freq,[3] + freq,[4] + freqr[6] + freq,[7] = 33 + 6 + 26 + 10 = 75

freq,[4] = freq,[8] = 8

freq,[5] = freq,[9] = 17

Hence. subDGs is the most frequently referenced distinct subgraph among all the distinct subgraphs computed in Example 3, which is followed by subDGr , subDG2, subDG:,, and subDG4. n

With the above definitions, we now present our approach for assigning a subset of attributes A of base relation .R to one of the distinct subgraphs that depends on A. Given a set of queries, if a set of two or more distinct subgraphs S depends on A, we assign A to the distinct subgraph which is the most frequently referenced distinct subgraph in S so that data transfer during query processing involving A can be minimized. Hence, the following criteria are used in DVF:

IAssignment Criteria] Distinct subgraph subDGi in S is assigned a subset of at- t,ributes A of base relation R if

Criterion 1. subDGi depends on A, whereas no other distinct subgraph does, or

Criterion 2. subDGi is the most frequently referenced distinct subgraph in S. If there exist more than one most frequently referenced distinct subgraph in S, then A is assigned to subDGi if subDGi is the first to be considered for A.

Let’s consider Ckiteria 2 for A which is competed by more than one distinct subgraph. Suppose that distinct subgraph subDGi depends on the subset of attributes Si of R, subDGs depends on another subset of attributes 5’s of R, subDG3 depends on subset Ss of R, and so forth. Further assume that Si n S’Z n Ss fl . = A (# 0). If subDG i is the most frequently referenced distinct subgraph among all the distinct subgraphs that depends on A, then A is assigned to subDGr We apply the same criteria for determining the assignment of Sr - A, S2 - A, and so forth.

Algorithm 2 (DVF: Disjoint Vertical Fragmentation )

Input: Distinct subgraphs subDGl, ., subDGN,, direct rule-to-rule dependency matrix

TT, rule-to-attribute dependency matrix dDk for each base relation Rk, query-access-

frequency vector freqQ, and query-access-rule matrix Qr.

Output: A set of disjoint vertical fragments FS. Fi,k in FS is a fragment of base relation

Rk that is referenced by subDGi, 1 5 i 5 N,.

Step 1. For each distinct subgraph subDGi (1 < i < N,), identify all the attributes of

each base relation Rk that are referenced by (a rule in) subDGi using dDk.

Step 2. For each subset of attributes A of base relation Rk that is referenced by only

subDGi, apply the first criterion of the Assignment Ctiteria and let Fi,k = A.

Step 3. For each subset of attributes A of base relation & that is referenced by more

than one distinct subgraph, apply the second criterion of the Assignment Criteria to

determine the assignment of A and let Fi,k = A.


subDG rules dependent attributes

subDGr Tl,r2rT3,T4 {Al,l, A1,2, -41,3}

subDG2 {A3,1,A3,2,A3,4,A3,5} _

7’393.5 {A3,1, A3,2,A3,4, A3,5)

subDGs T3,T4,7-6,T7 f-42,2, A2,3}, fA4.1, A4,5)

{A3,1, A3,2,A3,4,A3,5) _

subDG4 Tg {A3,2, A3,5, A3,6}

subDGs Tg {-44,1, A4,2, A4,4, A4,5I

Table 1: Attributes on which Distinct Subgraphs Depend

Example 6 Consider base relation Rs of the DDDB in Example 1 and the subgraph-access- frequency vector computed in Example 5. Suppose that the rule-to-attribute dependency matrices are as given in Example 4. AD3 in Example 4 indicates that rule rs depends on attributes As,r,

A3,2, A3,4 and As+, whereas rule ~6 depends on attributes A3,2, As,5 and As,s. As a result, subDGr, subDG2, and subDGs, which include ~3, are competing for As,r, A3.2, A3,4 and Ass. SubDG4, which includes ~3, is competing with subDGr , subDG2, and subDGs for As.2 and As,s.

Using the subDGs and matrices in Examples 3 and 4, respectively, step 1 of DVF identifies the

sets of dependent attributes as shown in Table 1. Note that subDGr, subDG:!, and subDGs are competing for the same subset of attributes in

R3, i.e., {A3,r, A3,2, A3,4, As,5}, whereas subDG4 is competing for a different subset of attributes of Rs, i.e., {A3,2, A3,5, A3,6}. Since no other distinct subgraph competes for attribute As,6 with

subDG4, by step 2 of DVF, As,6 is assigned to subDG4. We now consider the assignment of A 3,1, A3,2, A3,4, and As,5 to either SubDGl, subDG2,

subDGs, or partly to subDG4. SubDGr, subDG2, and subDGs compete for the common at-

tributes A3,r, A3,2, A3,4 and As,s, whereas subDGr, subDG2, subDGs and subDG4 compete for A3,s and A3,5. According to the subgraph-access-frequency vector computed in Example 5 and the Assignment Ctiteria, we assign As,2 and As,5 to subDGs since subDGs is the most frequently referenced subgraph among the four distinct subgraphs. Also, we assign Aa,1 and As,4 to subDGa

for the same reason among SubDGr , subDGz, and subDG3. Hence, appending the fragmentation of R3 to the subDGs in Example 3 yields

subDGr > {r1,Q,,rs,r4,& = $} = {rr,rz,r3,r&

subDGz > {rs,%,&,s = 4) = {rs,%}; subDGs 1 {7-3,T4,7-6,T7,F3,3 = {A3,1rA3,2,A3,4,A3,5));

subDG4 1 (~6, F4,3 = {A3,6)}; and

subDG5 > {rg,F5,3 = 4) = (f-9).

Note that we have yet to assign attribute As,3 to any of the distinct subgraphs. This happens because no distinct subgraph depends on As,s. We discuss the strategy to allocate As,3 in the

Cluster Allocation Algorithm in the next section. cl

3.3. Cluster Allocation

Having included in each distinct subgraph a set of rules S, with the corresponding set of vertical fragments of base relations - which are necessary to evaluate the rules in S, - using RCA and either OVF or DVF, FAA proceeds to choose network sites for the allocation of the clusters, each of which contains the rules and all the corresponding vertical fragments of base relations in a particular distinct subgraph, by using the Cluster Allocation Algorithm (CAA). This section describes the steps in CAA.

Our design strategy for the allocation of clusters of rules and data is consistent with the strategy that we use for clustering rules and data discussed in the preceding sections, i.e., our primary concern in the allocation of the clusters over the network is to minimize data transfer across the network during the query evaluation process. For this purpose, we consider the access frequency of a query at a particular network site which is shown in the query-access-frequency-at-site matrix.


Definition 17 Given a DDDBS with Ns network sites and NQ queries, an NS x NQ quey-access-

frequency-at-site matrix fre@Q is defined as follows:

where 1 5 i 5 Ns, 1 5 j 5 NQ, and n denotes the number of jth

query initiated at site i during a given period of time.

Furthermore, the sum of the jth column of freqsQ denotes the access frequency of the jth query at different sites. Thus, the sum of the jth column in freqsQ must be the same as freqQ[j]

as defined in Definition 12. Subsequently, we can derive freqQ using freqsQ a.s follows:

where 1 5 j 5 NQ.

Example 7 Suppose that we are given the following query-access-frequency-at-site matrix for the

four network sites in Example 1 and the seven queries mentioned in Example 5:

1 1 331010

fWSQ = ; l; ; “0 ; ; ;

0 051334 I The summation of each column of freqsQ yields

[ ~frcrlSQ[i,l].~freqSQ,i,a],--‘,~freqSQ~i,7] ] = [ 2 16 10 6 10 7 73 i=l i=l i=l

which is fTeqQ in Example 5. 0

Using the query-access-frequency-at-site matrix and the query-access-rule matrix of a DDDBS

D, we determine the access frequency of a rule at a network site in D, represented by the rule-, access-frequency-at-site matrix.

Definition 18 Given a query-access-frequency-at-site matrix f reqsQ and a query-access-rule matrix Qr, an Ns x N,r rule-access-frequency-at-site matrix freqs, can be computed as

where freqsr[i, j] = n denotes that site i accesses

the jth rule n times during a given period of time.

Furthermore, the sum of the jth column of freqs,. denotes the access frequency of rule j at different sites. Thus, the sum of the jth column of freqs,. must be the same as freq,.[j] in

Definition 14. Subsequently, we can derive f req, using f reqs, as follows:

f=q,[jl = ~freqsr[i,.d, where 1 5 j 5 N,.. i=l

Example 8 Consider freqsQ in Example 7 and Qr in Example 5 again. We obtain the access frequency of a rule at a network site by computing the rule-access-frequency-at-site matrix f reqsr

as follows:

f Teq.9, fre%SQ x Qr

[ 0 0 1 1 11 3 2 0

rl0 0 0 0 10 101 1 0 0 3 1 I 0

02 3 4

x;

0

0 0 1 0 0 1 0 0 0 0


24 410 5320 53 742 4043 = 321605 5217 58 61312517 1

The summation of each column of freqs, yields

i=l i=l is=1

= [ 15 17 33 6 10 26 10 8 17 ]

which is freq, in Example 5. cl

Based on the notion of the access frequency of a rule at different network sites, we now define the access frequency of a distinct subgraph at a network site, represented by the subgraph-access- frequency-at-site matrix which is computed by using the rule-access-frequency-at-site matrix.

Definition 19 Given a rule-access-frequency-at-site matrix f reqs,., an Ns x N, subgraph-access- frequency-at-site matrix of the given set of sites and distinct subgraphs, denoted f reqso, is defined as follows:

freqsO[i,jI = C freqsr[i, 4, where rule rk is included in distinct subgraph k

and uj is referenced by a query at site i.

We now define the site which accesses distinct subgraph ui most frequently among the given Ns sites in a network.

Definition 20 Given the subgraph-access-frequency-at-site matrix freqso of a DDDBS with NS sites, site Si is the site which accesses distinct subgraph oj (1 5 j 5 N,) most frequentzy if

max(fwso[l,jl,. . . ,f=q~JNs,jl) = f~eqs&,jl.

Example 9 Consider f reqs+ in Example 8 and the five distinct subgraphs generated in Example 3 again. We obtain the subgraph-access-frequency-at-site matrix as follows:

freqsc, =

=

2+4+4+1 4+0 4+1+5+3 2 0 5+3+7+4 7+2 7+4+4+0 4 3

3+2+16+0 16+5 16+0+5+2 1 7 5+8+6+1 6+3 6+1+12+5 1 7 I 11 4 13 2 0 19 9 15 4 3

21 21 23 1 20 9 24 1 7 I 7

Using f reqSa, we can determine which site accesses subgraph CT~, 1 5 i 5 5, most frequently among all the network sites. As computed, site 3 is the one accessing 01 and us most frequently, sites 4 and 2 are the most frequently sites accessing us and ~4, respectively, whereas both sites 3 and 4 access 65 most frequently. Cl

We now propose a strategy for allocating each cluster of rules and vertical fragments of base relations computed by either OVF or DVF to a network site.


Algorithm 3 (CAA: Cluster Allocation )

Input: A Set Of ChSterS cl, . . ., CN,, , a set of network sites 4, . ., 5’~~ with the asso-

ciated subgraph-access-frequency-at-site matrix freqs,, the network topology matrix

T. and & which is the set of attributes not referenced by any distinct subgraph.

Output: The allocation of each cluster Ci (1 5 i 5 IV,,) and & to a network site.

Step 1. For each cluster Ci (1 5 i 5 IV,), identify the site S,! which accesses Ci most

frequently among all of the sites in the network using freqs,.

Step 2. If there exists only one site S,! which is the most frequently access site of C,,

then allocate Ci to SM.

Step 3. If there exist a number of sites 4,. . . , S, which are the most frequently access

sites of Ci, choose one of these sites whose connection weight is the smallest using

T. (If more than one site has the smallest connection weight, arbitrarily choose one

of these sites.) Allocate Ci to the chosen site.

Step 4. Allocate & to the site in the network whose connection weight is the smallest. (If

there exists more than one such site, arbitrarily choose one of them.)

Note that we consider the connection weight at step 3 above. The connection weight CWi of site Si (1 5 i 5 Ns) indicates the data transfer cost+ between Si and all other sites in the network. When two or more sites access a cluster C at the same frequency, it is reasonable to allocate C to the site whose connection weight is less than the other sites in the network in order to minimize

data transfer cost during the query evaluation process in general.

Example 10 Consider the subgraph-access-frequency-at-site matrix f reqs, in Example 9 and the network topology matrix T given in Example 1 again. We identify the site Sty (1 5 i 5 5) that accesses cluster Ci most frequently as follows using freqsb:

SF = $3 &sM = s sM = s: &sM = s, SF = S3, s,

Hence, we allocate Ci and C’s to site 3, C’s to site 4, and Cd to site 2. In case of Cs, both site 3 and site 4 access it at the same frequency, and hence we consider the connection weight of these sites. The connection weight of each site CW,, 1 5 i 5 4, is computed as follows using T:

CWi=0+1+4+3=8 CWs=1+0+3+2=6

CWs=4+3+0+5=12 CW4=3+2+5+0=10

Hence, C’s is allocated to site 4 since CWq < CW,. Furthermore, by step 4 of Algorithm CAA, As,3 in base relation Rs is allocated to site 2 since A3,3 is not referenced by any rule, and hence any distinct subgraph. q

tit is assumed that the physical distance between two network sites is proportional to the data transfer cost


4. MATHEMATICAL INTERPRETATION OF FAA

In this section, we present the mathematical implication of the proposed algorithms.

4.1. Mathematical Implication of RCA

We construct the direct and indirect rule-to-rule dependency matrix h& and the segment-to-

rule dependency matrix Rk,. which correspond to the steps of constructing (prospective) distinct subgraphs as discussed in Section 3.1. This can be done by first computing the reachability matrix

R rule, which captures all direct and indirect rule-to-rule dependencies among different rules, from the given direct rule-to-rule dependency matrix rr, and then performing a Boolean addition on

R rule and an N, x N, identity matrix I, where N, is the number of distinct rules in a given

database. (Rrule V I retains all the rules that are not reachable from other rules in the database.) Hence,

R rule = rrw v rrw v . f . v +w (1)

R,, = &vie v I (2)

Hereafter, we proceed to generate distinct subgraphs as computed in Section 3.1.2 using R,, by extracting each row of R,, that is not included in any other row of R,,t. The resultant matrix is R:,, the segment-to-rule dependency matrix, where each row is called a segment which is a vector representation of a distinct subgraph computed in Section 3.1.2.

Example 11 Using the direct rule-to-rule dependency matrix rr in Example 1, we compute

R rule = 74 V rr2 V . ’ . V rr9

000100000 100100000

000000000 001000000

= 001000000 000100100 000100000 000000000

000000000

000000000’ 000000000 000000000 000000000 000000000 000000000 000000000 000000000 000000000

V

.I L

v...v

1

001000000 001000000

000000000 000000000 000000000 001000000 001000000 000000000

000000000

001100000 101100000 000000000 001000000

= 001000000 001100100 001100000 000000000 000000000

000000000 000000000 000000000 000000000 000000000 000000000 000000000 000000000 000000000

tRow i in R,, includes row j of Rrr ifVt.I (&[j,k] = denotes the number of distinct rules in a database.

1 implies Rrr[i, k] = l), where 1 < i, j 5 NT and Nr


R TT = Rrule v I

10110000 11110000 00100000 00110000

= 00101000 00110110 00110010 00000001

00000000

11110000 00101000

R;., = 00110110 00000001 00000000

0 0 0 0 1 1

The ith row of Rk,, 1 5 i 5 5, corresponds to the distinct subgraph subDGi in Example 3. •I

4.2. Mathematical Implication of OVF

Each row of the segment-to-rule matrix Ri,. is a segment which includes rules that are referenced

by the corresponding distinct subgraph. Since the given rule-to-attribute dependency matrix ADk of base relation Rk captures the information of which rules reference which attributes in base

relation Rk, using Rc, and ADI,, we can determine which segment references which attributes

in Rk. Hence, the set of minimal overlapping vertical fragments Fk of base relation & can be

computed as follows:

Fk = R;, x AD,,

Note that the ‘x’ operation in the above formula is not a boolean multiplication. Instead, it is a normal matrix multiplication. The ith row of Fk with at least one non-zero entry yields a minimal vertical fragmentt of base relation & for the ith segment which represents the ith distinct

subgraph, i.e., subDGi, computed by RCA.

Example 12 Consider the segments captured in RL,., where R:, is computed in Example 11, and

the rule-to-attribute dependency matrices AD 1, ADZ, ADS, and AD4 of base relations RI, Rz,

RB, and RJ, respectively, given in Example 4. Then,

Fi = R;rxADl-[;;; j; F2=R;vxAD2=[i 1;

110 11 o- 0 110110 0

F3 = R& x AD3 = 110110; F4 = R;,. x AD4 = 1 010011 0 000000 1

0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1

We see that the ith row of Fk (1 5 i 5 5, 1 5 k 5 4) corresponds to Ci,k in Example 4. Using the ith row of each Fk and the ith row of R:,, we can construct the corresponding distinct subDGi

in Example 4. Cl

tThe ith row in Fk is a minimal overlapping vertical fragment of Rh since the ith row includes only the attributes of Rk that are referenced by subDGi.


4.3. Mathematical Implication of DVF

As discussed in Section 3.2.2, the major task of DVF is to determine the distinct subgraph to which a subset of attributes A of a base relation should be assigned when more than one distinct subgraph references A. This task is accomplished by Algorithm DVF using the Assignment Criteria as discussed in Section 3.2.2. We now show that the task performed by Algorithm DVF can be accomplished by manipulating the segment-to-rule matrix Rk,., the rule-to-attribute dependency matrix ADk for each base relation &, the query-access-frequency vector frt?qQ which is derived from the query-access-frequency-at-site matrix f TeqsQ, and the query-access-rule matrix Qr as follows, given N,, the number of distinct subgraphs, and N_J~, the number of attributes in Rk:

[Step 1 of DVF] For each distinct subgraph subDGi, 1 5 i 5 N,, the subset of attributes A of base relation Rk that is referenced by subDGi can be identified by using R:,. and ADk as follows:

Fk = R;, x ADk

where ‘x ’ is the matrix multiplication operation, and for any j (1 5 j 2 NAP), Fk[i, j] = 1 denotes that the jth attribute of & is referenced by subDGi, and

A = U attr(Fk[i, j]) l<jiNa,

where

{Ak,j} if Fk[i,j] = 1 otherwise.

[Step 2 of DVF] The subset of attributes & of base relation & that is referenced by only one distinct subgraph subDGi can be determined by using Fk computed above. If there exists only one i (1 5 i 5 N,) such that Fk[i, j] = 1, for any j (1 5 j 2 NAP), then the jth attribute of Rk is referenced only by subDGi. Hence,

& = U att?fFk[i, j]) l<j<Na,

where

{Ak,j} if Fk[i, j] = 1, Fk[i’, j] = 0, and attr(Fk[i, j]) = i # i’, for all i’, 1 5 i’ 5 N,

0 otherwise.

which indicates that Bi should be assigned to subDGi according to the first criterion of the Assignment Criteria.

[Step 3 of DVF] The jth attribute Ck,j of base relation Rk that is referenced by more than one distinct subgraph can also be determined by using Fk. If there exist more than one i (1 5 i 5 N,) such that Fk[i, j] = 1, for any j (1 5 j 5 NAP), Ck,j is referenced by more than

where

one distinct subgraph. Hence,

ck,j = attVk[i, f)

{Ak,j} if Fk[i, j] = 1, Fk[i', j] attr(Fk[i, j]) = i # i’ for some i’ (1

0 otherwise.

=l,and 5 i’ 5 N,)


To simulate the second criterion of the Assignment Criteria, we first compute the rule-access-,

frequency vector freq, = freqQ x Qr and the subgraph-access-frequency vector freq,[i] = XI, freqT[k], for any rule rk that is included in subgraph (pi. (Note that freq, can be computed by the matrix multiplication of R:, and freq:, i.e., freq, = Rir x freq:, where freqk is the column vector representation of freqr.) Using freq,, we determine the most. frequently referenced distinct subgraph u M from the set of distinct subgraphs S, in which each distinct subgraph depends on Ck,j. o M is the chosen distinct subgraph in S, whose corresponding entry in freq, is larger than (or equal to) each of the corresponding entry in f req, which denotes the access frequency of one of the distinct subgraphs in S,, i.e., f req,+

= 7naz(freq,[id, freq,[i2], ..., freqobM], ..., freq, [in]), where the corresponding distinct

subgraphs of f req, [ill, 1. ., f req, [in] are the distinct subgraphs in Sgt. Then, we assign Ck,j

to P.

Example 13 Consider the resultant matrices F 1, Fs, F3, and F4 in Example 12. The value 1 in each matrix indicatt:s that the corresponding attribute is referenced by a distinct subgraph. It is not difficult to see that the third column of the ith row of Table 1 includes all the attributes whose

corresponding entries are set to 1 in the ith row of Fl , Fs, F3, and F4, respectively which is step 1

of DVF. Let us consider F3. Note that Fs[4,6] denotes the 6th attribute of R3, i.e., As,~, which is

referenced only by subDG4. Hence, we assign As.6 to subDG4, which is step 2 of DVF.

On the other hand, attributes A~J, A3,2, As,~, and As,5 are referenced by subDGi, subDGa and subDG3, whereas As,2 and A 3,5 are referenced by subDGi, subDG2, subDGa, and subDG4. Therefore, we apply the second criterion of the Assignment Criteria to determine a distinct subgraph to which the set of attributes is to be assigned by the following matrix manipulation, where

freqQ and Qr are given in Example 5, and R& is given in Example 11:

f req, = freqQ x Qr

-100001010

001000000

010001100 = [ 2 16 10 6 10 7 7 ] x 1 0 0 1 0 0 0 1 0

001010001 011001000 100001001

= [ 15 17 33 6 10 26 10 8 17 ]

Furthermore, we compute freq, using R&. and freq:, the column vector representation of freq,, as follows:

f reqg = RiT x freq:

i

111100000 001010000

= 001101100 000000010 000000001

X

15 17 33

6 10 = 26 10

8 17

The resultant vector shows that the third element is the largest, i.e., subDG:_ 3j LS the most frequently

referenced subgraph, the same result as computed in Example 5. Hence, As,i, A3,2, As,4 and As,5 are assigned to subDGs since they are all referenced by subDGs, which is step 3 of Algorithm DVF. 0

71 43 75 8

17

t.5, can be determined by extracting all the corresponding entries in the jth column of Pk. If Fk[i, j] = 1 (1 5 i < N,), then subDGi depends on the jth attribute of base relation Rk, as discussed in step 1 of DVF.


4.4. Mathematical Implication of Allocating Clusters

We have discussed the strategy adopted by Algorithm CAA for allocating each cluster of rules and its corresponding vertical fragments of base relations in section 3.3. In Algorithm CAA, the access frequency of a distinct subgraph at a particular network site, represented by the subgraph- access-frequency-at-site matrix f reqso, and the weight between network sites, denoted by the

network topology matrix T, are the bases for determining the destination site of a cluster. As stated in Definition 19, f reqs,., is computed by using the rule-access-frequency-at-site matrix

freqsr , which can also be obtained by the multiplication of the query-access-frequency-at-site matrix and the query-access-rule matrix as stated in Definition 18, i.e.,

= ~(f~ewQ[i,A x Qr[i W- k

We have also shown the relationship between f reqQ and fre@Q as well as freq,. and freqs, in Definition 17 and Definition 18, respectively as

fre9Qbl = ~freqSQh_& where 1 5 j < NQ, and i=l

fre9J.il = ~fress&jl, where 1 5 j _< N,.. i=l

Using freqs,,[i, j] (1 5 i 5 Ns, 1 5 j L: N,), step 1 of Algorithm CAA can be computed by choosing the jth column of the ith row in freqso which has the largest value. If only one i is identified for a particular j, step 2 of Algorithm CAA is computed. For if not, the connection weight is considered in step 3 of Algorithm CAA.

The connection weight matrix CW of the network sites can be computed using the given network topology matrix T and an Ns x 1 identity vector I as follows:

CW = TxI

t1,1

l

t1,2 . . . tLNs 11

= x :

tNs,l tNs,2 . . . t Ns ,Ns 1 [. INS

For example, using T as given in Example 1, CW is computed as

CW = TxI

which is the same as CWiy 1 5 i 5 4, computed in Example 10.

5. COMMUNICATION COSTS

5.1. Communication Cost of Distributing Diitinct Subgraphs

We consider the communication cost of allocating clusters of rules and vertical fragments of base relations to the chosen sites according to Algorithm CAA. The communication cost ccd for allocating all the rules and fragments in N, clusters from the primary site S, to the N chosen sites is computed in the following steps:


Step 1. Construct an 1 x N row vector p from the network topology matrix T as follows, assuming that S, is the primary site, N is the number of chosen sites, and SiteI, Sitez, . . . , SiteN denote the chosen sites in the network:

P[N = TIP, 41, l<k<N

where Sitek E {SiteI, Sitez, . , SiteN}, and the kth (in p) chosen site is the actual qth site (in T) in the network, i.e., Sitek = S,.

Step 2. For each cluster Ci, 1 5 i 5 N,, calculate the size of the cluster, i.e., [Cii.

The size of a cluster is determined by the size of each fragment of a base relation and the number of rules in C,. Since the size of a rule is relatively small and insignificant, compared with the size of a base relation, we assume that the size of a rule is one byte. Furthermore, the size of a fragment F of a base relation R is calculated by using the following formula:

[table size of R x number of attributes in F

number of attributes in R 1

Step 3. Construct an N x 1 column vector I’ as follows:

I’[i] =: c IC’j 1, where clusters Cjs, 1 5 j < N,, are allocated to sit!e i

Step 4. CCd = p x I?

Besides the communication cost for allocating clusters, FAA requires message passing among the chosen sites for broadcasting the acknowledgments of receiving clusters of rules and vertical fragments of base relations. In the worst case, each site in the network sends a message to every

other site in the network to notify other sites the clusters being stored at its site. Thus, the upper bound of the communication cost CC, for message passing is N,, x C$ T[i, j], where N,, is the

size of a message, N,T is the number of sites in the network, and T[i, j] is as defined in Definition 8. Hence, the upper bound of the communication cost for cluster allocation performed by FL4A. is

CCd -t CC,,.

Example 14 First, let’s consider the communication cost of allocating clusters of rules and fragments of base relation R3 computed in Example 6 which are generated using DVF. Assume that site 1 is the primary site. Since sites 2,3, and 4 are the chosen sites in Example 10, p = [ 1 4 3 ].

Furthermore, attribute A3,3, which is not referenced by any rule, is allocated to site 2 since site 2 has the smallest connection weight. When only the fragments of RS is considered,

I‘ = [ %r~~~~]=[ ;i!],and

149

CCd = [1431x 5

[ I

= 149 + 20 + 903 = 1072. 301

Now, let’s consider the clusters of rules and fragments computed in Example 4 which are

generated using OVF. Note that attributes AI,~, A~,I, As,3 and A*,s, which are not referenced by any rule, are allocated to site 2 for the same reason given above. Hence,

[

1 + 325 x L+ 1026 x 5 + 325 x $ + 443 x $

$ + 443 x $ + 989 x 2 918 r = I[ 1 = 545 , and

5 + 1026 x 8 + 443 x $ + 989 x $ 1776

918

ccd = [l 4 3)x

[ 1 545 = 918 + 2180 + 5328 = 8426. 1776

22

In either case,


CC,,, = 2 T[i, j] = 36 x N,,,. i=l,j=l

5.2. Communication Cost of Query Evaluation

Given a DDDBS D such that rules and fragments in D are distributed using Algorithm FAA,

we determine the communication cost of processing a query in D by using a cost function. Earlier, Mohania and Sarda [Ill present a communication cost function for DDDBSs as follows:

where CDB is the communication cost due to rule-to-relation dependency, CKB is the communication cost due to rule-to-rule dependency, and CsR is the communication cost due to remote execution of rules in a DDDBS.

One of the strategies to minimize communication cost, which is adopted by OVF, is to allow

replication of rules and fragments of base relations in a DDDBS D. The replication provides full locality of query evaluation since each rule R and other rules/fragments on which R depends, either directly or indirectly, are stored together at the same site in D. The existence of communication costs CDB and CKB in C are due to the fact that one or more rules or base relations referenced by a particular rule may be allocated at different sites in a DDDBS [ll]. However, this is not the case for any DDDBS designed by using OVF in FAA. Thus, when computing the communication

cost of query evaluation, CDB and CKB are excluded, and only Csn is accountable. Csn, which denotes the total communication cost of executing any rules used by a query Q in

D at other sites in D, where Q is not originated, is defined as

CsR = gF(?i,+ id k=l

where N,. is the number of rules and Ns is the number of sites in D, and Cik, which is the communication cost of sending results of rule ri executed at site Sk, is defined as

xik X Zi X fir &l,h#k

where zik = 1 if rule Ti is allocated at site Sk

0 otherwise , Zi is the average answer size returned from

executing ri, and fir denotes the frequency of ri executed at sites other than Sk. For DVF, the communication cost of query evaluation is

C= CDB l +CSR

where

N, Ns Ns

id kc1 k=l

Cik = gxik X rR[i, j] X $k X TSL] + 2 2 xik X xi'k' X rr[i,i’] X Ci’/c’, and j=l i’=l,i’#i k’=l

ajk = 1 if rule Tj is not present at site k, 0 otherwise

The additional cost CDB is included in the cost estimation for DVF, which is excluded from the cost estimation for OVF. This is due to the sharing of data in D, since vertical fragments of base relations in D computed by using DVF are non-replicated. In either case, i.e., OVF or DVF, the communication cost for query evaluation is lower than [ll].


6. CORRECTNESS AND COMPLEXITY ANALYSIS OF FAA

Theorem 1 The fragmentation and allocation algorithm FAA is correct

Proof. Since the mathematical interpretation of FAA, as shown in Section 4, captures the step- by-step process of the proposed fragmentation and allocation algorithm, the correctness of FAA follows. 0

Theorem 2 The time complexity of FAA is 0 (Nz).

Proof. Sections 4.1 through 4.4 show that the major tasks of RCA, DCA, and CAA, which together yield FAA and are described in Sections 3.1 through 3.3. These tasks can be accomplished by matrix manipulation, and the dominating matrix manipulation operations in each of these subalgorithms

and their time complexity are given in the table in Figure 7. The time complexity of FAA is 0 (N:), assuming that N, > Ns, NQ, N,, NA,,

Matrix Manipulation Size Complexity Section Reachability N, x N, 0 (N;)t 4.1 (RCA)

Fragment-attribute N, x NA,, c3 (N, x N, x N,J~)$ 4.2 (OVF) Frequency-ref-subDG N, x 1 0 (N, x NT) 4.3 (DVF)

SubDG-access-freq-at-site N, x N, 0 (NC x NQ x NT) 4.4 (CAA)

Connection-weight Ns x 1 0 (Ns x Ns) 4.4 (CAA)

1 T Warshall’s algorithm (71.

N::

the number of rules in a DDDBS. the number of distinct subgraphs.

NA,,: the number of attributes in relation &.

NQ: the number of distinct queries.

N, : the number of sites.

Fig. 7: Computational Complexity of Subalgorithms in FAA

7. CONCLUSIONS

In this paper, the vertical fragmentation and allocation problems in distributed deductive

database systems have been examined, and four different algorithms have been proposed. In order to determine which rules and fragments of base relations should be allocated to which sites so that system throughout can be maximized and response time can be minimized for query processing in

a distributed deductive database system, we have considered the dependency relationships among rules and attributes in base relations that are used by rules. If a rule depends on a number of attributes of a base relation or other rules, it is desirable to store them at the same site in order to

minimize communication cost during query processing. Algorithm FAA clusters mutual recursive rules and rules with the same head predicate, along with the attributes in base relations that they depend on. These rules and data are allocated by FAA to the same site to facilitate sharing of data and rules, computational efficiency in terms of minimizing communication cost for data transfer, and maximal locality of query evaluation during query processing.

Acknowledgements - The authors are grateful to the anonymous referees for their many helpful comments and suggestions for improving the paper.

24 SEUNG-JIN LIM and YIIJ-KAI NG

REFERENCES

[l] P.M.G. Apers. Data allocation in distributed database systems. ACM lhnsactions on Database Systems, 13(3):263-304 (1988).

[2] F. Bancilhon and R. Ramakrishnan. An Amateur’s Introduction to Recursive Query Processing Strategies. In Readings in Database Systems, M. Stonebraker, editor, pp. 507-555, Morgan Kaufmann (1988).

[3] R. Blankinship, A.R. Hevner, and S.B. Yao. An iterative method for distributed database design. In Proceedings of the 17th International Conference on Very Large Data Bases, pp. 389-400, Barcelona, Spain (1991).

[4] S. Ceri, G. Gottlob and L. Tanca. Logic Programming and Databases. Spring-Verlag, Germany (1990).

[5] S. Ceri and G. Pelagatti. Distributed Databases: Principles and Systems. McGraw-Hill (1984).

[6] C.I. Ezeife and K. Barker. A comprehensive approach to horizontal class fragmentation in a distributed object based system. Distributed and Pamllel Databases, 3~247-272 (1995).

[7] J.L. Gersting. Mathematical Structures for Computer Science, 3rd Edition. Computer Science Press (1993).

[8] Y.P. Li. Dkm - A distributed knowledge representation framework. In Proceedings of the 2nd International Conference on Ezpert Database Systems, L. Kerschberg, editor, pp. 313-331, Benjamin/Cummings (1989).

[9] S.J. Lim and Y.-K. Ng. Set-term unification in a logic database language. In Proceedings of the 1st Annual International Computing and Combinatorics Conference, pp. 101-110, Springer-Verlag, Xian, China (1995).

[lo] C. Meghini and C. Thanos. The complexity of operations on a fragmented relation. ACM Zhnsactions on Database Systems, 16( 1):56-87 (1991).

[II) M.K. Mohania and N.L. Sarda. Rule allocation in distributed deductive database systems. Data and Knowledge Engineering, 14:117-141 (1994).

[12] S. Navathe, S. Ceri, G. Wiederhold and J. Dou. Vertical partitioning algorithms for database design. ACM Zknnsactions on Database Systems, 9(4):680-710 (1984).

[13] S. Navathe and M. Ra. Vertical partitioning for database design: A graphical algorithm. In Proceedings of the 1989 International Conference of SZGMOD, pp. 440-450, ACM (1989).

[14] M. Ozsu and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall (1991).

[15] 0. Shmueli, S. Tsur and C. Zaniolo. Compilation of set terms in the logic data language (ldl). Journal of Logic Programming, 12(1):89-119 (1992).

[16] 0. Wolfson and S. Jajodia. Distributed algorithms for dynamic replication of data. In Proceedings of the 11th Principles of Database Systems, pp. 149-163 (1992).

[17] 0. Wolfson and S. Jajodia. An algorithm for dynamic data allocation in distributed systems. Information Processing Letters, 3(2):113-119 (1995).

vertical fragmentation and allocation in distributed

Documents