a highly parallel algorithm for frequent itemset mining

A Highly Parallel Algorithm for FrequentItemset Mining

Alejandro Mesa1,2, Claudia Feregrino-Uribe2, Rene Cumplido2, and JoseHernandez-Palancar1

1 Advanced Technologies Application Center, CENATAV. La Habana, Cuba.{amesa, jpalancar}@cenatav.co.cu

2 National Institute for Astrophysics, Optics and Electronics, INAOE. Puebla,Mexico. {amesa, cferegrino, rcumplido}@inaoep.mx

Abstract. Mining frequent itemsets in large databases is a widely usedtechnique in Data Mining. Several sequential and parallel algorithmshave been developed, although, when dealing with high data volumes,the execution of those algorithms takes more time and resources thanexpected. Because of this, finding alternatives to speed up the executiontime of those algorithms is an active topic of research. Previous attemptsof acceleration using custom architectures have been limited because ofthe nature of the algorithms that have been conceived sequentially anddo not exploit the intrinsic parallelism that the hardware provides. Theinnovation in this paper is a highly parallel algorithm that utilizes a ver-tical bit vector (VBV) data layout and its feasibility for making supportcounting. Our results show that for dense databases a custom architec-ture for this algorithm can perform faster than the fastest architecturereported in previous works by one order of magnitude.

1 Introduction

Nowadays, many data mining techniques have emerged to extract useful knowl-edge from large amounts of data. Finding correlations between items, specificallyfrequent itemsets, is a widely used technic in data mining. The algorithms thathave been developed in this area require powerful computational resources and alot of time to solve the combinatorial explosion of itemsets that can be found ina dataset. The high computational resources required to process large databasescan render the implementation of this kind of algorithms impractical. This ismainly due to the presence of thousands of different items or the use of a verylow threshold of support (minsup3).

Attempts to accelerate the execution of algorithms for mining frequent item-sets have been reported. The most common practice in this area is the use ofparallel algorithms such as CD [1], DD [1], CDD [1], IDD [5], HD [5], Eclat [12]and ParCBMine [7]. However, all these efforts have not reported good execution

3 Is the minimum number of times in a database an itemset must occur to be consid-ered as frequent.

times given a reasonable amount of resources for some practical applications.Recently, hardware architectures have been used in order to speed up the execu-tion time of those algorithms. These architectures were presented in [2,3,9–11],and improved the execution time of software implementations by some orders ofmagnitude.

Previously proposed parallel algorithms used a coarse granularity parallelism.Generally a partition of the database is made in order to process in a parallelfashion each block of data. In this paper an algorithm that exploits the inherenttask parallelism of hardware implementation and the feasibility to perform bit-wise operations is proposed. A hardware architecture for mining frequent item-sets is developed to test the efficiency of the proposed algorithm.

The remainder of this paper is organized as follows. Section 2 describes re-lated work. Section 3 discusses the proposed algorithm. Section 4 describes thesystolic tree architecture that supports the algorithm. The results are discussedin Section 5 and Section 6 presents the conclusions.

2 Related work

Algorithms that use data parallelism deal with load balancing, costs of com-munication and synchronization. These are problems not commonly present inalgorithms that exploit task parallelism. This is because in this type of algo-rithms the efficiency is based on the high speed that can be achieved by a simpletask executed in a multiprocess environment. Therefore, normally the data areaccessed sequentially. Currently, hardware architectures for three frequent item-sets mining algorithms (Apriori, DHP and FP-Growth) have been proposed inthe literature [2, 3, 10,11], where sequential algorithms are used.

A systolic array architecture for the Apriori algorithm was proposed in [2]and [3]. A systolic array is an array of processing units that processes data ina pipelined fashion. This algorithm generates potentially frequent itemset (can-didates itemsets) following a heuristic approach. Then, it prunes the candidatesknown to be infrequent, and finally it counts the support of the remaining item-sets. This is done for each size of itemset until there is no frequent itemset left.In the proposed architectures, each item of the candidate set and the databaseare injected to the systolic array. In every unit of the array, the subset operationand the candidate generation is performed in a pipelined fashion. The entiredatabase has to be injected several times through the array. In [2], the hard-ware approach provides a minimum of a 4× time performance advantage overthe fastest software implementation running on a single computer. In [3], anenhancement to the architecture is explored, introducing a bitmaped CAM forparallel counting the support of several itemsets at once, achieving a 24× timeperformance speedup.

In [11] the authors use as a starting point the architecture proposed in [3]to implement the DHP algorithm [8]. This introduces two blocks at the end ofthe systolic array to collect useful information about transactions and candidate

itemsets to perform a trimming operation over them and to reduce the amountof data that the architecture has to process.

In [10] a systolic tree architecture has been proposed to implement the FP-Growth algorithm [6]. This algorithm uses a compact representation of the datain an FP-Tree data structure. To construct the FP-Tree, two passes through thedatabase are required and the remaining processing is made through this datastructure. This tree architecture consists of an array of processing elements,arranged in a multidimensional tree to implement the FP-Tree. With this archi-tecture a reduction of data workload is achieved.

All presented works in this section use a horizontal items-list data layout.This representation requires a number of bits per item to identify them (usually32). Also, all architectures implement algorithms that have been conceived forsoftware environments and do not take full advantage of the hardware intrinsicparallelism.

3 The proposed algorithm

Conceptually, a dataset can be defined as a two-dimensional matrix, where thecolumns represent the elements of the dataset and the rows represent the trans-actions. A transaction is a set of one or more items obtained from the domain inwhich the data is collected. Considering a lexicographical order over the itemsthey can be grouped by equivalence classes. The equivalence class (E(X)) of anitemset X is given by:

E(X) = {Y |Prefixk−1(Y ) = X ∧ |Y | = k} (1)

The proposed algorithm is based on a search over the solution space throughthe equivalence class, considering a lexicographical order over the items. Thisis a two-dimensional search, both breadth and depth is performed concurrently.Using the search through the equivalence class allows us to exploit a character-istic of VBV data layout. With this type of representation, the support of anitemset can be defined as the number of active bits of the vector that representsit (X). This vector X represents the co-occurrence of items in the database andcan be obtained as a consecutive bitwise and operation between all the vectorsthat represent each item of the itemset(see equation 2).

X = {a, b, c, . . . , n}Xn = a and b and c and . . . and n (2)

The process to obtain X can be done by steps. First, an and operationcan be performed between the first two vectors (a and b) and then the resultsare accumulated. This accumulated value can be used to perform another andoperation with the third vector to obtain X for X = {a, b, c}. To obtain Xn thismust be done for all the items in X. This procedure is shown in Algorithm 1.

Defining the search space as a tree (Figure 1 shows a 5 items search space) inwhich each node represents an item of the database, an itemset is determined by

the shortest path between two nodes of the tree. Using the procedure previouslydescribed, the vector Xn of an itemset can be obtained recursively as Xn =Xn−1 and n, being Xn−1 the accumulated value in the parent node. Once Xn−1

is calculated, it can be used to obtain all the accumulated vectors on each childnode. With this process, each node provides partial results for each one of theitemsets that includes it in the path of the tree.

a

b

c

d

e

e

d

e

e

c

d

e

e

d

e

e

ab

abd

abce

ac

aceabe

ad ae

Search spaceArchitecture structure

association

Fig. 1. Structure for processing 5 item solution tree.

In this algorithm there is no candidate generation, but the search space isexplored until reaching a node for which the support is zero. This process issustained by the downward closure property, which establishes that the supportof any itemset is greater than or equal to the support of any of its supersets.Because of this, if this node does not have active ones in the accumulated vector,the nodes in the lower subtree will not receive any contribution in the supportvalue from that vector.

The lexicographical order of the items is established according to their fre-quency, ordering first the ones with the smallest values. This order causes thevalue of the support of itemsets to decrease rapidly when descending throughthe tree architecture.

Managing large databases with VBV data layout can be expensive becausethe size of Xn is determined by the number of transactions of the database.To solve this, a horizontal partition of the database is made. Each section isprocessed independently, as shown in Algorithm 2, and each node accumulatesthe itemsets section support. Every time a node calculates a partial support, it is

added to the accumulated support of the itemset. When the database processingfinishes, each node has the global support of all the itemsets that it calculated.

A structure of processing elements is needed to implement the algorithm.This structure interchanges data as described in Algorithm 2. Two modes ofdata injection and one for data extraction are needed to achieve the task offrequent itemsets mining: “Data In”, “Support Threshold” and “Data Flush”.In the first mode (Process Data()) all data vectors are fed into the structure bythe root element to process the itemsets. The second mode (Process Minsup())injects the support threshold so the processing elements can determine whichitemset is frequent and which is not. For the data extraction Get Data(), all theelements of the structure flush the frequent itemsets through their parent andthe data exits the structure through the root element.

To implement the algorithm we use a binary tree structure of processingelements (PE ). Figure 1 shows a five-item solution tree processing structure. Thestructure is a systolic tree and it was chosen because this type of constructionallows to exponentially increase the concurrent operations at each processingstep. For this, the number of concurrent operations can be calculated as 1 +∑cc

i=0 2i, being cc the number of processing steps that have elapsed since theprocess started.

Each node of the systolic tree (narq) is associated to a node of the solutiontree (nsol). This narq determines an itemset (ISn) and it is formed with the pathfrom the tree root to nsol . The narq calculates the support of the supersets ofISn. Those supersets are the ones that are determined by all the nodes in thepath from the root to a leaf, following the node that is most to the left of thesolution tree.

A PE has a connection from its parent and it is connected to a lower PE(PEu, with an upper entry) and to a right PE (PEl , with a left entry). In general,the amount of PEs a structure has, can be calculated as follows:

STn = 1 + PEun−1 + PEln−1, for n ≥ 2,

with PEun and PEln:

PEu2 = 0, PEl1 = 0,PEu3 = 1, PEl2 = 1,PEun = 1 + PEln−2 + PEun−1 PEln = 1 + PEln−1 + PEun−1

3.1 Processing Elements

The itemsets that a PE calculates are determined by the data vectors that theparent feeds to it. This must be in such a way that the first vector to reach eachPE is the Xn of the prefix of the equivalence class that the ISn belongs to. EachPE calculates the support of the vector that receives from the parent, it accu-mulates the bitwise and operation in a local memory and propagates the datavector that it receives from the parent or the accumulated value correspondingly.To achieve this, the behavior of the two types of PEs is defined. In “Data In”

mode, the main difference between PEl (described in Algorithm 3) and PEu(described in Algorithm 4) is that PEu does not accumulate the result of thebitwise and operation of the second data vector that it receives. This operationprovides the down PE the prefix with the equivalence class of its ISn.

3.2 Feedback Strategy

As the number of PEs of the tree is directly dependent on the number of frequentitems that exist in the database, it is impractical to create this structure fordatabases with many frequent items. To solve this problem a structure for nitems can be designed and taking into account the recursive definition of treeswe could calculate the itemsets stepwise, see Figure 2. In the first step all theitemsets that can be calculated with this structure are obtained. Since it isknown which level of the solution tree is processed for a given structure (Figure2 shows it as the architecture processing border), data are injected back into thisstructure except that this time the first vector will not be the first item, but theprefix of the equivalence class of the itemset that is determined by the bordertree node that was not processed.

In Figure 2, an example of a six-items search space is shown. The dotted linesubtrees are examples of subtrees with different parents that have to be processedwith the feedback strategy. In the example, to process the first subtree to theleft, the first vector that has to be injected to the architecture is the vector thatrepresents the itemset X = {a, b, c}.

This feedback is repeated for each solution tree node that is in the borderof the nodes that were not processed. As each of these nodes defines a solutionsubtree (all the subtrees that are below the processing border in Figure 2),this structure is compatible with the entire tree and consequently can process asolution tree of any size. This process is defined recursively for the entire solution

a

b

c

d

e

e

d

e

e

c

d

e

e

d e

6 items search space

f

f f

f f

f

f

f

f

f f

f e

f

f f

f3 items architecture

processing border

Fig. 2. Feedback strategy.

tree and as a result, we obtain a partition of the tree and each subtree of thepartition represents a solution tree of n items.

4 Architecture structure

The proposed architecture has three modes of operation: “Data In”, “SupportThreshold” and “Data Flush”. In the “Data In” mode, all Xn of each section ofthe database are injected. When the entire database is injected, the architectureenters in the “Support Threshold” mode and the support threshold value isprovided and propagated through the systolic tree. Once a node receives thesupport threshold value, all the PEs change to “Data Flush” mode and theresults are extracted from the architecture.

In this architecture, the amount of clock cycles for the entire process can bedivided into two general stages: Data In (CC data) and Data Flush (CCflush).The CC data is mainly defined by the number of frequent items in the database tospecific support threshold (nf ), the number of transactions of the database (T )and the size of the data vector that is chosen (bw). The data vector defines thenumber of sections (S) in the partition of the database. CC data can be calculatedas follows:

CCdata = S ∗ (nf + 1) + 1, where S = dT/bweFor the data flush stage, CCflush is mainly defined by the number of PEs

that have frequent itemsets and the number of empty PEs in the path up tothe next PE with useful data. In this strategy, once the data are extracted froma PE , data must be extracted from the PEl , since there is no information todetermine if they will have frequent itemsets or not. In the case of the PEuin the lower subtree, if there are no frequent itemsets in the current PE , thenno frequent itemsets will be found in the lower subtree. This is because all theitemsets calculated in the subtree will have as a prefix the first itemset of thecurrent PE , and if this is not frequent, then no superset of it will be frequent.

The number of PE with and without useful data, and the position in thesystolic tree that they occupy, only depends on the nature of the data and the

support threshold. In the worst case, all the PEs will have frequent itemsets andCCflush could be calculated as STn + |FI set|.

4.1 Flush strategy

To obtain the data from the architecture, each PE enters in “Data Flush” oper-ation mode. All PE have a connection to the parent so all the frequent itemsetscan be flushed up (Parent out). In this strategy each node of the systolic treeflushes the frequent itemsets stored in local memory, then it serves as a gatewayto data from the lower PE (Down in) and when it ends, it serves as a gatewayto the data of the right PE (Right in).

Extracting the data through the root node of the systolic tree allows to takeadvantage of the characteristics of the itemsets. These characteristics permit todefine heuristics to prune subtrees in the flush strategy and therefore shortensthe amount of data that is necessary to flush from the architecture. Because ofthis, less time is needed to complete the task.

5 Implementation results

The proposed architecture was modeled using VHDL and it was verified in sim-ulation with ModelSim SE 6.5 simulation tool. Once the architecture was vali-dated, it was synthesized using ISE 9.2i. The target device was set to a XilinxVirtex-4 XC4VFX140 with package FF1517 and −10 speed grade.

For the experiments, three datasets from [4] were used: Chess, Accidents andRetail. As explained in Section 4, the size of the architecture increases ruled bythe number of frequent items that it is capable of processing, so the size of thedevice being used will highly affect the time it will take to finish the task. For theexperiments we used a 32 bits data vector and the biggest architecture that fitsthe device was a structure to process 11 items with 264 PE s. This architecture

0.1

1

10

100

1000

16 17 18 19 20

Tim

e (

ms)

Number of unique frequent items.

Chess

FP-Growth ArchitectureProposed Architecture

0.01

0.1

1

10

6 7 8 9 10 11 12

Tim

e (

ms)

Nunber of unique frequent items.

Retail

1

10

100

1000

10000

16 17 18 19 20 21 22 23

Tim

e (

ms)

Number of unique frequent items.

Accidents

Fig. 3. Mining time comparison.

consumes 74.6% of the LUTs (94,248 out of 126,336) and 18, 6% of the flip-flops(23,496 out of 126,336) available in the device.

Since the architecture is completely decentralized, there are no global connec-tions and thus the maximum operating frequency is not affected by the numberof PEs of the architecture. The maximum frequency obtained for this architec-ture was 137 MHz. The proposed algorithm is sensitive to the number of frequentitems more than the support threshold, so to show the behavior more preciselythe experiments were carried out based on this variable.

In Figure 3 we compare the mining time of the architecture against the besttime of the FP-Growth architecture presented in [9]. This Figure shows thatthe greatest improvement in execution time was for the Chess database, wheremore than one order of magnitude was obtained. In the other two databasesit is shown how the execution time of the proposed architecture grows slowerthan the FP-Growth architecture when increasing the number of frequent items.Moreover, when increasing the ratio between the number of obtained frequentitemsets and the size of the database, the percentage of CC data decreases. Inthe case of Accidents and Retail databases CC data is 99% and 97% of total timecorrespondingly and for Chess database the percentage is between 62% and 68%.This is because the execution time of the “Data in” mode depends only on theamount of frequent items and the number of transactions that the databaseshave and the remaining time will depend on the number of frequent itemsetsobtained.

This behavior shows a feature of the algorithm and its scalability. The archi-tecture performs better when the density of the processed database increases andgenerally performs better when the number of frequent itemsets obtained alsoincreases. This feature (better performance at higher density) has a great impor-tance if it is considered that the denser the databases the more difficult to obtainfrequent itemsets and the higher the number of frequent itemsets obtained.

For sparse databases and high supports (low number of frequent items), thistask can be solved without the necessity of appealing to hardware accelerationin most cases. This need is more evident when dealing with large databases withlow threshold of support or high density, or both. Because of this, this featureis desirable in the algorithms for obtaining frequent itemsets.

6 Conclusion

In this paper we proposed a parallel algorithm that is specially designed forenvironments which allow a high number of concurrent processes. This charac-teristic best suits a hardware environment and allows to use a task parallelismwith fine granularity. Furthermore, a hardware architecture to validate the algo-rithm was developed. The experiments show that our approach outperforms theFP-Growth architecture presented in [10] by one order of magnitude when pro-cessing dense databases, and the mining time grows slower than the FP-Growtharchitecture mining time when processing sparse matrices.

References

1. R. Agrawal and J.C. Shafer. Parallel mining of association rules design, implemen-tation and experience. Technical Report RJ10004, IBM Research Report, February1996.

2. Z. K. Baker and V. K. Prasanna. Efficient Hardware Data Mining with the AprioriAlgorithm on FPGAs. In Proc. of the 13th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines 2005 (FCCM ’05), pages 3–12, 2005.

3. Z. K. Baker and V. K. Prasanna. An Architecture for Efficient Hardware DataMining using Reconfigurable Computing System. In Proc. of the 14th Annual IEEESymposium on Field Programmable Custom Computing Machines 2006 (FCCM’06), pages 67–75, 2006.

4. B. Goethals. Frequent itemset mining dataset repository.http://fimi.cs.helsinki.fi/data/.

5. E.H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for associationrules. In Proc. of the ACM SIGMOD Conference, pages 277–288, 1997.

6. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.In 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 1–12. ACMPress, 2000.

7. J.H. Palancar, O.F. Tormo, J.F. Cardenas, and R.H. Leon. Distributed and sharedmemory algorithm for parallel mining of association rules. In MLDM ’07: Pro-ceedings of the 5th Intl. Conf. on Machine Learning and Data Mining in PatternRecognition, volume 4571 of LNAI, pages 349–363, 2007.

8. J. Park, M. Chen, and P. Yu. An effective hash based algorithm for mining asso-ciation rules. In Michael J. Carey and Donovan A. Schneider, editors, SIGMODConference, pages 175–186. ACM Press, 1995.

9. S. Sun, M. Steffen, and J. Zambreno. A reconfigurable platform for frequent pat-tern mining. In RECONFIG ’08: Proc. of the 2008 Intl. Conf. on ReconfigurableComputing and FPGAs, pages 55–60. IEEE Computer Society, 2008.

10. S. Sun and J. Zambreno. Mining association rules with systolic trees. In Proc.of the Intl. Conf. on Field-Programmable Logic and its Applications (FPL), pages143–148. IEEE, 2008.

11. Y. Wen, J. Huang, and M. Chen. Hardware-enhanced association rule mining withhashing and pipelining. IEEE Trans. on Knowl. and Data Eng., 20(6):784–795,2008.

12. M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fastdiscovery of association rules. In Proc. of the 3rd Intl. Conf. on KDD and DataMining (KDD97), pages 283–286, 1997.

a highly parallel algorithm for frequent itemset mining

Documents