[ieee 2012 2nd ieee international conference on parallel, distributed and grid computing (pdgc) -...

2012 2nd IEEE International Conference on Parallel, Distributed and Grid Computing

Algorithm for Finding Association Rules in Distributed Databases

Surbhi Bhatnagar Vellore Institute of Technology

Veil ore, India [email protected]

Abstract - In the emerging networked environment we are encountering situations in which databases residing at geographically distinct sites must collaborate with each other to analyze their data together. But due to large sizes of the datasets it is neither feasible nor safe to transport large datasets across the network to some common server. We need algorithms that can process the databases at their own locations by exchanging needed information among them and obtain the same results that would have been obtained if the databases were merged. In this paper we present an algorithm for mining association rules from distributed databases by exchanging only the needed summaries among them.

Keywords: Distributed Databases, Association rule Discovery

I. INTRODUCTlON

With an increasing number of databases being created and maintained at various geographically distributedlocations it is becoming desirable to combine information embedded in them. Many data analysis applications require that we simultaneously process the data stored in a number of distant but networked databases, These databases may be located at different locations and may have independent design and evolution histories, but the data stored in them may be closely related, and together relevant for some data analysis applications, In many applications the databases are also growing in size to such an extent that it is neither feasiblenot safe to transfer them across the network from one server node to another. The problem of performing computations with databases that are distributed across mUltiple servers IS being faced in increasing number of situations, The main problem comesin primarily two types of database partitioning, The first type of situation is that in which the records of a huge data set are distributed across mUltiple databases but each partition has exactly the same set of attributes. This is called horizontal partitioning. The second

AB CD EF

§-,�" ,��" . . .�� """ . " .-. " .-. " .-. " .-. " .-. " .-. " .-. ". ' .-. ' "

Horizontal Partitionin2:

A BC C DE E F G

ODD Vertical Partitioning

Fi2:. 1: Database Partitionin2:

978-1-4673-2925-5/12/$31.00 ©2012 IEEE 915

type of situation is that in which multiple relations of a datasetreside independently, possibly on different nodes of a network. Each partition may have a different set of attributes but some of the attributes may be shared across two or more relations. This situation is called the vertical partitioning of a database. A schematic of both types of partitioning situations is shown in Figure-I. Distributed file systems that are processed in cloud computing environments encounter both of these types of partitioning for databases.In many normal database and data warehousesituations also we need to simultaneously process multiple relations, possibly residing on servers at different nodes of a network. In data warehouses

Prodl ct Supplie

� ./ Fact

Table

Custom V r .�

Salesma

Fig. 2: A Star-shaped Data Warehouse

multiple relations are created that are connected by a Starschema to a central fact-table [1]. For very large data sets stored across multiple servers of a cloud infrastructure a similar situation often arises. Relations stored on different servers share attributes with the central Fact-table. Figure-2 shows four relations of a data warehouse connected by a central Fact-table. Discovery of association rules from a single relation can be performed by using the a-priori algorithm [3] or many of its enhanced versions. The algorithms presented in [2, 5, 6, 9] and other similar work address the problem of mining association rules from horizontally partitioned databases. The work presented in [4] presents an algorithm to build decision trees from vertically partitioned data but does not address the problem of discovering association rules. The work presented in [7] addresses the problem of mining vertically partitioned dataset but significantly simplifies the more general problem we have addressed here. We assume that the shared attributes across multiple relations can be used


to do a cross-product type of Join of the relations, and the association rules are to be discovered in this implicit dataset. The work in [5] assumes that the only attributes that are shared are the key attributes and thus there is only one record for each shared key value in each vertically partitioned relation. This simplifies the more general problem that is encountered in many real situations.

The problem of mining a vertically partitioned dataset is qualitatively different, and more difficult, from that of mining a horizontally partitioned dataset. In horizontal partitioning all the database tuples explicitly exist in one or the other partition. In vertical partitioning tuples can only be made explicit by a database Join of the two (or more) relations, mediated by the shared attributes.The set of attributes in each of the four outer relations in Figure-2 is completely distinct from those of the others, and the fact-table shares attributes with each of the four relations. Making the tuples explicit, in this case, means converting the above star structure to a single relation after performing a Join operation. But since performing this Join is too expensive and thus not desirable, the goal of our algorithm presented here is to discover the association rules from the five tables of the warehouse as shown above without having to perform the expensive explicit Join operation.

In this paper we address the problem of discovering association rules from a vertically partitioned data set without requiring the huge data in various participating relations to be moved to a single database site, Joined, and then processed by the association rule mining algorithm. We show that the association rules can be generated by exchanging messages among the relations residing at different network locations. And this requires far less data to be exchanged among the relations, compared to the case of transferring all the datasets.

The remaining paper is organized as follows. The second section describes how association rules are discovered in a single relation using the well-known apriori algorithm [3]. In section III we describe our approach for discovering the association rules from the relations of a vertically partitioned dataset and also show our results with some implementations. In section IV we present some discussion about how our association rule-mining algorithm can be made more efficient. In the final section we outline the conclusions and contributions of our research described here.

II. ASSOCIATION RULE MINING

Association rules are a very popular form of cooccurrence dependencies among attributes of a relation.

A B C D E F

1 6 1 1 2 1

1

1

2

2

6 2 3

6

2

6

6

2

2

1

1

1

1

1

1

1

2

2

2

2

1

3

Algorithms for discovering such rules [1,2,3] were among the first contributions in the field of data mining. Most of the research for association rule discovery is done for binary datasets where each attribute can take a value of 0 or 1. This can be easily adapted for the case where multiple discrete

values are possible for each attribute. A small example of a

916

relation shown here has six attributes named A through F. A kitemsetin this relation is a subset of attributes,of size k, with a specific value assigned to each attribute. The set {A=I} is a 1-item set and the set {A=I, B=6} is a 2-itemset. The support of an item set is defined to be the number of tuples in the relation that have the attribute value pairs of the item set. In the example table the support of the l-itemset {A=I} is 4 and that of {A=2} is 2. A k-itemset is called afrequent itemset if its support is above some pre-selected threshold value. If we set the threshold to be 3 then the l-itemset {A=I} is frequent with a support of 4 and the l-itemset {A=2} is not frequent because its support is only 2. The 4-itemset {B=6, C=I, D=I, E=2} has a support of 4 and is frequent. The two main operations of the apriori algorithm are: enumerate the k-itemsets of different sizes kand then compute the support of each k-itemset. A typical sequence of steps to achieve this is as follows. Enumerate all l-itemsets and then compute their support values. Those that do not have enough support are dropped and the remaining l-itemsets are then used to enumerate candidates for the 2-itemsets. The apriori principle states that if a k-itemset is not frequent then all supersets of this k-itemset will also not be frequent and therefore need not be enumerated. The second major step in this process is the counting of support for each of the candidate k-itemset. Once we know all the attributes of a relation it is not difficult to enumerate the candidates for the k-itemsets, and does not take too much computational effort. The step of computing the support for kitemsets requires scanning of the entire relation. Therefore almost all the computational effort of the algorithm is expended in performing, typically, multiple passes of the databases.

After finding the frequent k-itemsets we can derive from them the association rules that have high confidencevalues. Let us say we have a frequent 4-itemset such as {B=6, C=I, D=I, E=2} . From this we can infer association rules such as: {if B=6, C=I, D=I, then E=2} . Confidencevalue of this rule can be determined as: Support of {B=6, C=I, D=I, E=2} divided by the support of {B=6, C=I, D= 1} . Thus the computation of the confidence of a rule requires the values of support for itemset containing all the attributes in the rule, and also for the itemset containing attributes that appear on the left side of the rule. So, the main computationally expensive task required for discovering association rules is of determining the support values for various frequent k-itemsets of the dataset.

When we have a single relation to work with, it requires a few passes of the database and it is relatively simple to compute the support values for the k-itemsets of interest. The problem becomes much more complicated when the dataset is split across multiple partitions. For horizontal partitions each partition can compute its count of tuples for each k-itemset and then a coordination of their local results can help determine the global values of support the k-itemsets. In a vertically partitioned dataset the problem of computing the support becomes more complex because we don't even have the explicit tuples available to us in a single table form. The tuples remain implicit with their vertical components


distributed across individual partitions, and the effect of their imagined Join operation must be simulated by the computations. It is this last complication that is addressed efficiently by the design of our algorithm.

A. ASSOCIATION RULES FOR VERTIC ALL Y PARTITIONED

DATABASES

Let us consider the data warehouse situation described in Figure-2. It is very easy to find association rules that are limited to the attributes confined to any of the four individual relations. But let us say we want to determine if there is an association rule containing some attributes from the Product relation and some attributes from the Supplier relation. These two relations do not share any attribute between them. But the relation Fact-table records every sale transaction and for each transaction it shares the product-id with the product relation and the supplier-id with the supplier relation. If a database Join were to be performed of the three relations, Product, Fact-Table, and Supplier, we will obtain a large relational table. In the Supplier relation of the warehouse only one entry (tuple) may exist for a supplier may but this Supplier-tuple will be repeated many times in the result of the Join, because it will be included every time this Supplier is mentioned in a transaction in the Fact-Table. Similarly, a Product-tuple will exist in the warehouse relation only once but will be included in many tuples of the Joined relation, once for each tuple of the Fact-Table in which this Product-id is mentioned. The number of times each tuple from Product or Supplier is repeated in the Join depends on the number of times their ids are mentioned in the Fact-table. Let us say that color is an attribute in the Product relation and location is an attribute in the Supplier relation. It is possible that an association rule with high confidence value exists between the color of a product and the location of its supplier. For example, it may be the case that most of the suppliers located in the West manufacture red products and most of the suppliers located in the East manufacture blue products. This association rule cannot be discovered if we are limited to obtaining the support counts in individual relations Product

Node 1 Node 2 Node 3

A l B i C D I C I F A I E I C

1 2 2 1 2 2 1 1 2

1 6 2 1 1 1 1 2 1

1 6 1 1 1 3 2 2 1

2 6 1 - - - - - -

Explicit Component Databases

A C 1 1 1 2 2 1

� 2 2 haredSet

917

and Supplier of the warehouse. It is because without consulting the Fact-Table we don't know how many times each Product or Supplier tuple must be repeated to account for the multiple sales transactions in which the product of the supplier occurs. We do not want to perform a Join because it is computationally too expensive and may not be feasible in many situations of geographically distributed databases, such as in a cloud computing infrastructure. The solution presented in [I] seeks to solve this problem in a way that is not very efficient. It expects a user to select some set of attribute values in one relation and then constrained by these values it seeks patterns in other relations of the warehouse. For example, this approach would expect a user to make a choice of, say, red products, and then with this constraining information process the Product, Fact-table, and Supplier relations to discover if any patterns or association rules exist for the limited case of the red products. Such choices for limiting attribute values can be repeated in any one of the relations and then association rules discovered in the limited contexts. However, there are just too many choices possible for a user to select a restricting condition. We want to have an algorithm that does not depend on this limiting choice being made by a user and discovers all possible association rules, the same set that would have been discovered if we were allowed to Join the relations of the dataset.

Our algorithm for discovering association rules in vertically partitioned databases works as described below. We assume that one of the distributed nodes is the coordinator and is responsible for producing the discovered association rules. This coordinator first probes all the participating nodes and obtains a list of attributes present in each database relation. This information is sufficient for the coordinator to enumerate all possible k-itemsets.

Computation of support values for the k-itemsets is the step that needs a new design of the algorithm. For example, let us consider the three datasets as shown here.Node-1 has three attributes, A, B, and C; Node-2 has attributes D, C, and F, and Node-3 has attributes A, E, and C. The attribute A is shared between relations at Node-I and Node-3 and attribute C is shared among all the three relations. Each relation has at least one attribute that is available only at its own location and is not shared with other relations. These three relations, when Joined in the cross-product manner, yield the tuples of the relations containing all the six attributes mentioned in the earier discussion. So, now the goal is to compute the support of a k-itemset by obtaining some summaries from the three individual datasets, instead of creating the complete relation containing all the six attributes.

We also show here a relation called Shareds, formed of all the shared attributes, which in this case include attributes A and C. This relation is populated with all possible combinations of values that each attribute can take. The domain of possible values for each attribute can either be known to the coordinator node in advance or it can be determined by each node by making one pass of the relation and passing this information on to the coordinator node.


Our main idea of computing the support or any other value from vertically partitioned datasets can be summarized as follows. Let us say that we want to compute some quantity R which is a function of an explicit dataset D (obtained after Join of the partitions). That is,

R =F(D)

But since the explicit dataset D is not available and we only have the individual vertical partitions, an alternative approach for calculating the quantity R is:

where each D, is a local vertical partition, each gi is the function of the local partition only, and G is a combination performed by the coordinator node when it receives the local gi function results from all the participating relations. It is not possible to decompose every function F in the form of local g,

and coordinator's G functions. The computation of support for a k-itemset is easily decomposable in this manner and can be performed as follows. To compute the total number of tuples in a Joined relation, when we can only receive summaries from the individual vertical partitions, is as follows. Let us say the result R is the number of tuples in the Joined dataset. Then,

R = :t (IT (N(D;)cond) j=l ,=1

where N(DJcondjis the number of tuples in relation D, that satisfies the condition Cond;, n is the number of vertical partitions participating in the computation and m is the number of tuples in the Shareds relation. And Cond; is a condition formed by each tuple of the relation Shareds as described earlier. The Sum-aI-Products performed in the expression above is equivalent to the G function that is performed by the coordinating node. At the implementation level the above expression can be described as follows. Ask each node to look at its relation and provide the count of tuples it has satisfying the condition in each row of the Shareds relation. That is, Node-l will generate a vector vlgiving the counts of tuples it has for conditions of Shareds as: [1, 2, 1, 0]. That is, Node-l has 1 tuple where A =i and C=i, 2 tuples where A =i and C=2, 1 tuple where A =2 and C=i, and no tuple where A =2 and C=2. Each participating node generates one such vector and sends it to the coordinating node. In this case the coordinating node will receive three vectors, VI, Vb and V3, one from each of the participating nodes. The coordinating node will then perform sum of products on these three vectors. First, the entries of the three vectors are multiplied term-by-term. That is, V li) *v2(i) *v3(i) is obtained for each i'h position in the vectors, for (i=l.. m). These products are then summed up to obtain the number of tuples that would be formed if the three vertical partitions wereJoined to form an explicit relation. This process is generalizable to any number of vertical partitions of a dataset. It is simple to algebraically verify that the above computation correctly provides the number of tuples in the joined dataset. Since the computation of support for a k-

918

item set only requires the counts of tuples this information about the Join is all that we need to compute.

Our next task is to determine the number of tuples that satisfy the conditions of belonging to a k-itemset. The above expression for determining the total number of tuples can be modified as follows for this new task as follows.

n CountIes, = L(II (N(D,) ))

J ,=1 condJ . and . lest

where all the variables have the same interpretation as before and test is the condition imposed by the k-itemset of interest. That is, if the, k-itemset of interest is {B=6, C=l, D=l, E=2} then the condition formed by each row of the Shareds set will be expanded to include the conditions of this itemset. That is, the first row of the example Shareds relation will be expanded to (A=I, C=I, and B=6, C=I, D=I, E=2). Each node will now be asked to provide a count of tuples that meets this expanded condition. The vector formed by such values, for each condition of the Shareds relation, will be transmitted by each node to the coordinating node. A sum-of-products performed as earlier will provide us the count of tuples in the implicit Join of the relations that satisfy the conditions of the k-itemset of interest.

The mechanism of the communications among the nodes that we have implemented is as follows.

1. The coordinating node asks each relation for the set of attributes that it contains, and also the set of possible values that each attribute has within the relation.

2. The coordinating node then creates the Shareds relation and transmits it to each of the participating relations.

3. The coordinating node starts the apriori algorithm as per its single-relation version [3].

4. Whenever it needs to compute the support value for an itemset, it sends the conditions of the item set to each of the participating relations. These relations use SQL queries to evaluate the vectors of frequencies and transmit them back to the coordinator node.

5. The coordinator node performs the sum-of-product operation on the received vectors and thus computes the support value for the itemset.

6. The coordinator saves the support values for all the frequent item sets because these may be needed for computing the confidence values of rules generated from the frequent itemsets.

7. While generating the association rules from the frequent item sets the support value for any subset of a frequent itemset will always be available at the coordinator node because each subset must also be frequent. This will save the work of re-computing the support from the distributed relations but will require larger storage at the coordinator site.

III. IMPLEMENTATION AND RESULTS:


We have implemented the above algorithm with three different relations on three different nodes. It was a simulated situation where the three relations were asswned to reside on different nodes but were housed on the same computer. The first dataset had four attributes (A, B, C and D), the second dataset had three attributes (D, E, and F), and the third dataset had three attributes (C, F, and G). Each attribute could take any value between 0 and 9. The Shareds relation had three attributes: (C, D, and F) and one thousand rows because each attribute has three possible values.

We stored different numbers of attribute and checked for the two things: (i) if the correct values for the support were obtained for each item set tested, and (ii) the time taken by our implementation for computing the support values.

To improve the efficiency of our implementation we designed our algorithm to run in such a way that it collects summaries for all possible l-itemsets (or any k-itemsets) and sends them out to the coordinator in the same message. This saves the number of messages that need to be exchanged among the various nodes participating in the computation. For example, when the coordinator node is computing the support values for l-itemsets the communication works as follows. There are seven attributes in our system and each of them can take ten possible values. That means there are seventy possible l-itemsets for this dataset. If there are m variables and each has a domain of size d then there will be a total of m*d possible l-itemsets. These values can be arranged in a table form where each column corresponds to one itemset and each row corresponds to one condition (= extension) taken from the Shareds relation.

The coordinator is going to compute the support values for only a subset of all possible 2-itemsets. All those 2-item sets whose subset l-itemsets did not have enough support will be dropped from consideration while computing the support for the 2-itemsets. In this case the coordinator requests the relations to send summaries for only the 2-itemsets of interest and needs to communicate to the relations the conditions specifying the attribute-value pairs of these 2-itemsets. Once these are received by the individual relations, SQL queries are performed and tables of summaries are generated and transmitted back to the coordinator node. In these tables each row corresponds to one requested k-itemset and each row corresponds to one row of the Shareds relation. The process continues for larger sized k-itemsets.

The typical execution time of an SQL query ranged between 60 and 180 milliseconds depending on the number of tuples that were found to match the conditions of the query. The SQL queries were formed automatically by picking up the rows of the Shareds relation and the condition for which the summaries are sought. When the size of the dataset grows each individual relation may grow in size. The querying time will increase proportionately with the size of the relations. However, one major advantage of our algorithm and implementation is that the sizes of the sununary messages exchanged do not depend on the sizes of the relations from which the summaries are generated. This is very beneficial

919

because this makes our implementation scalable with the size of the partition relations. Their sizes may increase arbitrarily but the sizes of the summary message will remain constant. The sizes of the messages, containing tables as described above may seem large at first, but are much smaller than the very large-sized underlying relations. We verified in our implementation tests that as the nwnber of tuples in individual relations grows the sizes of the messages exchanged remain constant.

Our dataset was small in that it had only seven attributes in all. For a larger dataset the number of attributes may be much larger. This will cause the number of possible kitem sets for which summaries are exchanged to become much larger and hence the summaries tables exchanged between the relations and the coordinator node will grow to become very large. These and some other efficiency issues can be addressed to improve the performance in future adaptations of our algorithm.

IV. EFFICIENCY ENHANCEMENT

After developing the algorithm and testing it against a number of datasets we have developed following sources of some inefficiencies and present solutions to increase the efficiencies on these fronts.

We used SQL queries to determine the counts of tuples that satisfy the conditions given in the Shareds tuple and in the k-itemset of interest. The SQL queries take much longer to execute compared to a sequential scan of the complete relation. This may not be true in general but is valid in our case where we are determining the counts for each possible l-itemset. For larger values of k in the k-itemsets the number of k-itemsets may decline because of many of their subsets not having sufficient support. In such cases it may become more efficient to resort to SQL queries again. A judgment needs to be made about which one of these two methods should be used based on the size of a relation and the number of item sets for which the summaries are needed.

When the number of attributes in the implicit dataset is very large the message size will grow to become unmanageable because of there being too many item sets. This problem can be solved in one of the following two ways. We can restrict the number of attributes to only those, which we are interested in including in the association rules. Leaving out the irrelevant attributes will significantly reduce the size of each message. We may use the pruning step at individual relations and send to the coordinator the sununaries for only those item sets that have support larger than the specified threshold. That is, if the coordinator node asks nodes to provide sununaries for hundred item sets and a relation returns support values for only 55 of them then it could be asswned that the remaining 45 itemsets had support values below the threshold.

Another inefficiency we observed in our implementation is that a very large number of rows in each message contain all zero entries. That is, for many conditions of the Shareds tuple conditions there are no tuples in the relation being sununarized. The size of each message can be


reduced significantly if each row containing only zeroes as entries is removed from the messages. This will, however, increase the work of the coordinator who will now have to match the rows of messages received from different nodes. Each such message may have different rows omitted. But in most cases, for a reasonably large relation, more than half of the rows in each message were all zeroes.

These efficiencies when implemented will further enhance the performance of our algorithm.

V. CONCLUSION

In this paper we have presented a scalable algorithm that can discover association rules from a vertically partitioned dataset. The vertical partitions may be distributed across geographically distant nodes of a network. Our algorithm does not need to perform a Join of the participating relations but instead works by generating requisite summaries at individual nodes and transmitting them to a coordinator. The coordinator then combines these summaries and requests more summaries for the next step of the algorithm. We have shown that the algorithm works correctly and have presented a number of ways that can increase the efficiency of our implementation.

VI. References

1. Nenad Jukic, Svetlozar Nestorov: "Comprehensive data warehouse exploration with qualified association-rule mining." Decision Support Systems 42(2): 859-878 (2006)

920

2. David Wai-Lok Cheung, Jiawei Han, Vincent T. Y. Ng, AdaWai-Chee Fu, Yongjian Fu: "A Fast Distributed Algorithm for Mining Association Rules" PDIS 1996: 31-42.

3.

4.

5.

6.

7.

Rakesh Agrawal, Ramakrishnan Srikant: "Fast Algorithms for Mining Association Rules in Large Databases." VLDE 1994: 487-499. Raj Bhatnagar, Sriram Srinivasan: "Pattern Discovery in Distributed Databases." AAAIIIAAI 1997: 503-508. Murat Kantarcioglu, Chris Clifton: "Privacy-Preserving Distributed Mining of Association Rules on Horizontally Partitioned Data." DMKD 2002. Mafruz Zaman A shrafi , David Taniar, Kate A. Smith: "ODAM: An Optimized Distributed Association Rule Mining Algorithin. " IEEE Distributed Systems Online 5(3): (2004). Jaideep Vaidya, Chris Clifton: "Privac>, preservin& association rule mining in vertically partItioned data.' KDD 2002: 639-644.

8. Murat Kantarcioglu, Chris Clifton: "Privacy-Preserving Distributed Minin¥ of Association Rules on Horizontally Partitioned Data. ' IEEE Trans. Know!. Data Eng. 16(9): 1026-1037 (2004).

9. Sujni Paul: "An optimized distributed association rule mining algorithm in parallel and distributed data minin& with XML data for improved response time.' International Journal of Computer Sciene and Information Technology, volume 2, Number 2, April 2012, pp.88-101.

10. Rakesh Agrawal, John C. Shafer: "Parallel Mining of Association Rules." IEEE Trans. Knowl. Data Eng. 8(6): 962-969 (1996).

[ieee 2012 2nd ieee international conference on parallel, distributed and grid computing (pdgc) -...

Documents