efficient frequent mining frequent patterns without candidate fpgrowth 2004requent pattern mining...

8/12/2019 Efficient Frequent Mining Frequent Patterns without Candidate FPGrowth 2004requent Pattern Mining Based on Linear Prefix Tree 2014

1/15

Efficient frequent pattern mining based on Linear Prefix tree

Gwangbum Pyun a, Unil Yun a,, Keun Ho Ryu b

a Department of Computer Engineering, Sejong University, Seoul, Republic of Koreab Department of Computer Science, Chungbuk National University, Cheongju, Republic of Korea

a r t i c l e i n f o

Article history:

Received 24 April 2013

Received in revised form 11 October 2013

Accepted 12 October 2013

Available online 24 October 2013

Keywords:

Data mining

Frequent pattern mining

Linear tree

Pattern growth

Knowledge discovery

a b s t r a c t

Outstanding frequent pattern mining guarantees both fast runtime and low memory usage with respect

to various data with different types and sizes. However, it is hard to improve the two elements sinceruntime is inversely proportional to memory usage in general. Researchers have made efforts to

overcome the problem and have proposed mining methods which can improve both through various

approaches. Many of state-of-the-art mining algorithms use tree structures, and they create nodes

independently and connect them as pointers when constructing their own trees. Accordingly, the

methods have pointers for each node in the trees, which is an inefficient way since they should manage

and maintain numerous pointers. In this paper, we propose a novel tree structure to solve the limitation.

Our new structure, LP-tree (Linear Prefix Tree) is composed of array forms and minimizes pointers

between nodes. In addition, LP-tree uses minimum information required in mining process and linearly

accesses corresponding nodes. We also suggest an algorithm applying LP-tree to the mining process. The

algorithm is evaluated through various experiments, and the experimental results show that our

approach outperforms previous algorithms in term of the runtime, memory, and scalability.

2013 Elsevier B.V. All rights reserved.

1. Introduction

As a part of the association rule mining, frequent pattern mining

is a method for finding frequent patterns in large data [15]. The

patterns obtained from mining operations are usefully utilized to

analyze data characteristics or gain information needed for

decision-making. In addition, it can be applied in a variety of real

data analyses such as web data [20], customer data in finance,

correlation of product data, vehicle and communication data [9],

bio data[13], hardware monitoring of computer system [45], and

regular pattern mining[28]. In pattern mining, a pattern is a set of

items in a certain database, and a support of the pattern is defined

as the number of transactions containing the pattern, where we

regard patterns satisfying a given minimum support threshold as

frequent ones. Apriori [1] and FP-growth [14] are fundamentalalgorithms in frequent pattern mining, and current studies are

proceeding based on the twoalgorithms. Moreover, other numerous

methods have beensuggested. First, there are methods usingclosed

patterns such as BMCIF [8], and CEMiner [9] and those for maximal

patternssuch as MAFIA [5], FP-MAX [12], LFIMiner[16], MCWP[41],

and MWS [42]. Furthermore, there exist other approaches for

stream environments such as WMFP-SW[19], BSM[30], CPS-tree

[31], and RPS-tree[32], and for utility patterns such as HUIPM [2],

HUPMS[3], and UP-growth [34]. The following techniques apply

item weights into the mining process. WARM[33], WAS[39],and

MWFIM [40] are weight-based algorithms, and TIWS [7] adds

weights with times. In addition, there is an approach which finds

frequent patterns from the top support to kth support without

any given minimum support threshold. The method is called

Top-k pattern mining, and typical studies are MinSummary [18],

PND[24], Chenoff[35], Topk-PU[43], SpiderMine[44], etc. In the

sequential pattern mining considering item sequence, there are

SeqStream [6], StreamCloseq [10], ApproxMAP [17], TD-seq [22],

CSP [27], WSpan [38], and so forth. U2P-Miner[23] mines uncertain

data, and GAMiner[36]gives meaning to interesting patterns and

then extracts patterns. Developing an improved algorithm for the

frequentpattern mining can contribute to advancingminingperfor-

mance in various mining fields. FP-growth-based frequent patternmining, such as FP-growth [12],patricia-tree [26], and IFP-growth

[21], has the following characteristics. FP-growth has connection

information among all nodes in FP-tree in order to search thenodes.

Therefore, it has many pointers for connecting nodes, thereby using

a lot of runtime and memory resources. In this paper, we, therefore,

propose a novel tree structure, LP-tree (Linear prefixtree) and an

algorithm using the tree, called LP-growth which can conduct

mining operations more quickly and efficiently than previous

algorithms. OurLP-treecan solve the above limitation dueto its spe-

cial structure basedon thelinearform. Wecan obtain advantages by

converting trees nodes as array forms. It can increase memory effi-

ciency through arrayed nodes since they can reduce connection

0950-7051/$ - see front matter 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.knosys.2013.10.013

Corresponding author. Tel.: +82 234082902.

E-mail addresses: [email protected] (G. Pyun), [email protected] (U. Yun),

[email protected](K.H. Ryu).

Knowledge-Based Systems 55 (2014) 125139

Contents lists available at ScienceDirect

Knowledge-Based Systems

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / k n o s y s
http://dx.doi.org/10.1016/j.knosys.2013.10.013mailto:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.knosys.2013.10.013http://www.sciencedirect.com/science/journal/09507051http://www.elsevier.com/locate/knosyshttp://www.elsevier.com/locate/knosyshttp://www.sciencedirect.com/science/journal/09507051http://dx.doi.org/10.1016/j.knosys.2013.10.013mailto:[email protected]:[email protected]:[email protected]://dx.doi.org/10.1016/j.knosys.2013.10.013http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://crossmark.crossref.org/dialog/?doi=10.1016/j.knosys.2013.10.013&domain=pdfhttp://-/?-


2/15

information. We can also speed up item traversal times since

LP-tree does not use pointers in most cases and generates a large

number of nodes at once due to its linear structure. By applying

the features of LP-tree to mining process, we can obtain the follow-

ing benefits: (1) Tree generation rate of our approach becomes fas-

ter than that of FP-growth since ours can create multiple nodes at

once by a series of array operations. Meanwhile, FP-growth makes

nodes one by one. (2) We can access parent or child nodes without

corresponding pointers when searching trees since the nodes are

stored as an array form. (3) Memory usage for each node becomes

relatively small since LP-tree does not require internal node point-

ers. (4) It is possible to traverse trees more quickly compared to

searching for them with pointers since our approach directly acces-

ses corresponding memories due to the feature of the array struc-

ture. This paper is organized as follows. In Section 2, we introduce

related work with respect to LP-tree and LP-growth, and describe

details for our techniques and algorithmin Section 3. Next, wecom-

pare performance of our algorithm with those of previous algo-

rithms through various experiments in Section 4, and we finally

conclude this paper in the last section.

2. Related work

Frequent pattern mining extracts specific patterns with sup-

ports higher than or equal to a minimum support threshold, and

many of mining methods have been researched as mentioned

above, but Apriori [1] and FP-growth [14] are still regarded as

underlying algorithms. Apriori is the oldest conventional mining

algorithm, and it performs mining operations by extending pattern

lengths. The algorithm generates candidate patterns through the

pattern extension in advance, and then confirms whether the can-

didates are actually frequent patterns by scanning a database. Con-

sequently, Apriori has no choice but to scan the database as much

as the maximum length among frequent patterns. UT-Miner [37] is

an improved Apriori algorithm specialized in sparse data, where

sparse data indicate that most transactions are different from eachother. The algorithm uses an array structure, unit triple storing rela-

tions between items and transactions in a database to improve

mining performance. However, UT-Miner does not guarantee fine

performance in terms of runtime and memory usage since the

algorithm is based on Apriori method. On the other hand, FP-

growth [14]solved the above problemby scanning a database only

twice. It uses a tree structure, called FP-tree, which can prevent the

algorithm from generating candidate patterns. FP-tree consists of a

tree for storing database information and a header table containing

item names, supports, and node links. A tree is composed of nodes,

where each of them includes an item name, a support, a parent

pointer, a child pointer, and a node link. The node link is a pointer

that connects all nodes with the same item to each other. Since the

FP-growth algorithm was proposed, various algorithms have beenpublished on the basis of the algorithm. FP-growth-goethals [11]is

a FP-growth implementation which is optimized by Bart-Goethals.

To increase efficiency of search space in FP-growth, FP-growth-tiny

[25] generates conditional FP-trees using conditional patterns

without creating any conditional database. In CT-PRO [29], the

authors suggested Compressed FP-tree adding a count array into

the nodes of the FP-tree, where each entry of the array corre-

sponded to the number of itemsets occurrences. The algorithm

mines frequent patterns using the information added in the tree

without recursive calls. IFP-growth [21] enhanced pruning effect

with a new tree structure, FP-tree+, where the tree adds an address

table to the FP-tree. Therefore, the algorithm decreases the number

of conditional FP-trees, thereby improving mining speed. Mean-

while, it needs more information than the original FP-tree. In addi-tion, IFP-growth does not upgrade memory efficiency although this

contributes to reducing runtime. MAFIA-FI[5]saves data informa-

tion into a bitmap form so as to reduce the number of tree

searches. The bitmap is made up of two dimensions, where x-axis

means items and y-axis is transactions. For example, a point (2,4)

of a certain bitmap means that the second item exists in the fourth

transaction. Thus, MAFIA-FI can compute patterns or items sup-

ports through AND operations of the bitmap without tree tra-

versals. In addition, the algorithm can prevent creating needless

trees with infrequent patterns and maximize pruning efficiency.

However, MAFIA-FI requires more memory although its runtime

is faster than the original method. Patricia-tree [26]also uses an

array structure to a part of the FP-tree, where the algorithm gener-

ates paths with the same support as an array. Meanwhile, the

LP-tree proposed in this paper constructs all paths as arrays

regardless of items supports, where the shapes of the arrays vary

depending on each transactions form. FP-growth [12] proposed

FP-array with pattern information and increased pruning efficiency

with FP-array. The approach calculates supports of patterns to be

expanded in advance, and eliminates infrequent patterns effec-

tively through the proposed FP-array. However, FP-growth also

does not reduce the size of trees since it still uses the original

FP-tree-based structures. As a result, we need to develop a new

tree structure to improve fundamental performance of the mining

algorithm. Consequently, we propose a novel tree and algorithm

for satisfying both runtime and memory efficiency. In our LP-tree,

its runtime and memory performances are more outstanding than

those of FP-growth due to its special tree structure based on the

array.

3. Frequent pattern mining based on Linear Prefix-tree

In this section, we present details of LP-growth algorithm and

related techniques. The algorithm conducts mining operations

with LP-tree and corresponding growth methods.

3.1. Preliminaries

Given a transaction database, D,I= {i1, i2, . . ., in} is a set of items

composingD, andD consists of multiple transactions. All transac-

tions have each a unique set of items.D includes uniqueIDs, called

TIDs, with respect to each transaction. A pattern is defined as a sub

or whole set ofI. Assuming that any pattern Phas several items and

its first and last ones are ib and ie respectively, P is denoted as

follows.

P fib; . . . ; ieg; 1 6 b< e 6 n:

Ps support means the number of transactions containing in D.

In other words, this indicates how muchP occurs inD. Let jPj be

the number of transactions including Pandj

Djbe the number of

all transactions in D. Then, we can calculate Ps support rate,sup(P)

as follows.

supP jPj=jDj;

where 0 6 sup(P) 6 1.Pis regarded as a frequent pattern ifsup(P) is

not smaller than a given minimum support (or minsup). Denoting

the frequentPasL, it is also included in Iand satisfies sup(L)P min-

sup, where 0 6 minsup 6 1.

L fP# IjsupPP minsupg:

For instance, given a database, {{TID1: a,b,c}, {TID2: a,b}, and {TID3:

b,c,d,e}},IbecomesI= {a,b,c, d,e}. If a minimum support threshold

is 60%, a pattern, {a,b} is frequent since it appears in TID1 and 2;thus, its support is higher than the threshold. Meanwhile, another

126 G. Pyun et al. / Knowledge-Based Systems 55 (2014) 125139


3/15

pattern, {a,c} is infrequent because it is contained in onlyTID1, and

therefore, its support is lower than the threshold.

3.2. LP-tree: a novel tree structure for mining frequent patterns

There are several limitations in regard to previous frequent pat-

tern mining methods. Basically, frequent pattern mining has to find

all frequent patterns in a transaction database. Thus, in the worst

case, the method should extract 2n 1 patterns since all of the

n-items in a database are frequent. Moreover, Apriori-based algo-

rithms consume more time and memory due to generation of can-

didate patterns. Meanwhile, FP-growth spends most of the time

traversing and generating trees. Note that the problem for the

Apriori is not under consideration, and we focus on the last FP-

growth approach because FP-growth-based algorithms generally

have better performance than that of Apriori-based algorithms.

For improving performance, we should decrease tree traversal

and generation times. To do that, tree structures need to have a

simple form, and each node in the trees has to occupy smaller

memory space. Our LP-tree satisfies both of them, so LP-growth

with the tree structure can conduct the tree creation and search

efficiently.

Definition 1 (LP-tree (Linear Prefix tree)). LP-tree has the following

structure: (1) Header list consisting of item-names, supports, and

node links, (2) Linear Prefix Node (LPN) for storing frequent items

of each transaction and a corresponding header, and (3) Branch

Node List (BNL) including information of branch nodes and their

child nodes. LP-tree consisting of c LPNs has the following

structure.

LP tree fHeaderlist; BNL; LPN1; LPN2; . . . ; LPNcg:

LP-tree entirely has a linear structure. Each set of frequent items is

saved into nodes composed of an array form, where we use multi-

ple arrays since one array structure cannot express items as a tree

form with many branches. To connect each array, every array hasa header in the first part of the array, where the header indicates

its parent array. LPN contains a header and an array node storing

pattern, and the array node consists of several internal nodes.

Moreover, a header of any LPN can indicate a root of the tree when

the LPN is the first one inserted in the tree. Details of LPN are

mentioned inDefinition 2. LP-tree is composed of more than one

LPN as shown inFig. 1.

Definition 2 (LPN (Linear Prefix Node)). LPN is a fundamental

structure of LP-tree. In an LPN, there are multiple internal nodes

and a header in the top position of the LPN. Let Parent_Link, i,

S, L, and b be a parent node pointer connected to another

LPN (parent LPN), an item, a support, a node link, and branch infor-

mation respectively. Then, the following Eq. (1) represents how

LPN is composed, where each of internal node information is

described between h

and i

:

LPN fhParent Linki; hi1; S; L; bi; hi2; S; L; bi; . . . ; hin; S; L; big: 1

LPN stores item information into each node. That is, if certain items

{i1, i2, . . ., in} is added in an LPN, its array node has n-internal nodes.

In this process, finite internal nodes are generated according to the

number of inserting items, thereby dealing with whole pattern

information. In Eq. (1), we can express that a parent node of

hin, S, L, bi is hin1, S, L, bi and a child node ofhin1, S, L, bi is hin, S, L, bi.

Parent_Link is a pointer indicating a parent node ofhi1, S, L, biwhich

is the first node of the LPN. The parent node connected to the

Parent_Link becomes either a root or any node of another LPN. We

define the symbol, gpc;k , to express a pointer of a specific node insideLPN. Given a certaincth LPN withn nodes, LPNc, gpc;k indicates thekth node of the LPNc(k= [0,

. . .

, n]). gproot is a pointer to the root(0th node is a header node storing Parent_Link). If a parent of the

first node in any LPN is the 5th node of LPN1, then its Parent_Link

becomes gp1;5 . An internal node of LPN has four elements as inEq.(1), where each subscript indicates corresponding ordinal num-

bers. A header in LPN is linked to a branch node of its parent. Thus,

we can gain patterns tracking headers. A node link, L, plays a role

in concatenating nodes with the same item. LPN does not have any

pointers for connection among internal nodes. A branch node has

more than two child nodes. Therefore, LPN uses a branch node in or-

der to express nodes having more than two child nodes. The b is

used as a flag value to mark whether branch nodes exist or not. LPN

does not manage two or more child nodes due to the arrays limita-

tion. For this reason, we propose and use BNL for managing the

branch nodes in order to deal with multiple child nodes.

Definition 3 (BNL (Branch Node List)). BNL helps manage numer-

ous branch nodes when generating LP-tree. When items for each

transaction are inserted, they are sequentially inputted from a root,

and several branches can occur in this process. If a current position

reaches any branch node during the insertion, we confirm all child

nodes of the branch node and then move to appropriate location by

referring to BNL information. We can easily access multiple child

nodes through BNL, which is constructed as list forms and stores

only information of branch nodes and their child nodes. BNL is

composed of branched node table and child node list, where the

branched node table stores pointers of all branched nodes and each

element (pointer) stored in it has onechild node list. Thechild node

list has child node pointers of a corresponding branched node.Therefore, assuming that LP-tree has i branched nodes, Bi is a

pointer indicating ith branched nodes, and Ci,j is a pointer of the

jth child node of Bi. Hence, BNL consists of branched node

table= {B1, B2, . . ., Bi} and child node list= {{C1,1, C1,2, . . .}, {C2,1, C2,2,

. . .},. . ., {Ci,1, Ci,2, . . ., Ci,j}}. After matching them using the symbol,

?, we can denote BNL as follows, where {Bi? Ci,1, Ci,2, . . ., Ci,j}

means that Bi indicates a set of child node pointers, {Ci,1, Ci,2, . . ., Ci,j}.

BNL ffB1! C1;1; C1;2; . . .g; fB2! C2;1; C2;2; . . .g; . . . ;

fBi ! Ci;1; Ci;2; . . . ; Ci;jgg 2

Fig. 2 shows the entire BNL structure, where the structure is

mapped with Eq.(2).

We can also express a set of child node pointers forBistored in

BNL as the following equation:Fig. 1. Structure of Linear Prefix Nodes (LPNs).

G. Pyun et al. / Knowledge-Based Systems 55 (2014) 125139 127
http://-/?-http://-/?-http://-/?-http://-/?-


4/15

BNLBi fCi;1; Ci;2; . . . ; Ci;jg:

BNL has child node pointers as many as the number of child nodes

of the branch nodes. Pointers of child nodes stored in BNL are sorted

in their item name order to conduct a binary search. Thus, we can

directly access internal child nodes with no branches, while we

should indirectly access through BNL the other child nodes with

branches.

Example 1. Given a certain database as shown inTable 1, when 4

sorted transactions from TID 1 to 4 are inserted in a LP-tree, the

tree has 4 LPNs: LPN1-{e,a,f,c,d,g,h}, LPN2-{b,a,f}, LPN3-{c,h}, and

LPN4-{b,f,g}, where the root of the tree and the node ofLPN1 having

an item e become branch nodes. Then, B1of a branched node table

inFig. 2becomes gproot. C1,1 becomes the second child node of theroot, i.e., the node ofLPN4 with b, gp4;1 .B2 is assigned as the nodeofLPN1with e, denoted as gp1;1 , andC2,1 becomes the node ofLPN2with b which is the second child node of the node ofLPN1 having

e, denoted as gp2;1 . Similarly,C2,2 is p3,1, which is the node ofLPN3with c.

Definition 4 (Header list). One of the elements of LP-tree, Header

list, has information needed for mining patterns from the tree,

where the list consists of item-name, item-support, and node-link.

Item-name denotes names of items consisting of LP-tree, and item-

support means the number of items with the same name in the

tree. For example, if any item name is a and its support is 5, it

implies that item a occurs 5 times in the tree. Node-link is con-

nected with the first node among all of the nodes with the same

item in the tree, and then the first node is concatenated with the

second node again. Terminating the connection, one chain is gen-

erated by concatenating all the nodes with the same item.

3.3. Constructing LP-tree

In this section, we describe a method for creating LP-tree. Treeconstruction is conducted as follows. We scan a database and

count all item supports. Thereafter, we sort all items in their sup-

port descending order and then generate a corresponding header

list, where the list stores items according to the sorted order.

Namely, the upper items in the list have greater supports while

the lower ones have smaller values. The insertion approach of

LP-trees transaction is divided into the two cases. The former

one is that the first transaction is inserted into the LP-tree. Its pro-

cedure is as follows. First, we generate LP-tree by scanning the

database again and sort the first transaction depending on the se-

quence of the header list. That is, its items with smaller support

than the minimum support are deleted, and the remaining items

are sorted in support descending order. After that, we insert the

sorted transaction into the tree, where LPN is created and con-

nected to a root since the tree is initially empty. Then, the first

transaction is entered into one LPN, which has internal array nodes

as many as the transaction length. That is, if any transaction length

is n and all items of the transaction are inserted in one LPN, the size

of LPN isn + 1 including a header. We connect LPNs header to its

parent after inserting the transaction items, where the header is

linked to a root since the current LPN is initially added to the tree.

We add a pointer of the root to thebranch node table and store the

first node of the newly created LPN into the child node list con-

nected to the root pointer. The second case is when all of the trans-

actions except for the first one are added in the tree. Its insertion is

performed as follows. We remove infrequent items in the inserted

transaction and sort its frequent items in support descending or-

der. Next, we add into BNL the addresses of the root and the first

node (i.e. header) of the current LPN since the root makes a child

node and thereby a branch occurs. Then, we insert the transaction

comparing corresponding paths from the root. Thereafter, we con-

firm all the child nodes of the root with BNL information since the

previous transaction is already added in the tree, i.e. the root has

one or more child nodes, where, we initially check the internal

child node of the current LPN. If the item to be inserted is the same

as the item of the checked node (the internal node), the current

location moves to that node, and its support is increased by 1.

Otherwise, to confirm the other child nodes, we read the corre-sponding branch information in BNL and then increase support of

the current item by 1 if the item is equal to the node derived from

BNL. If it is not equal to that, we generate a new LPN and insert

remaining items of the transaction in the new LPN, where the cur-

rent node becomes a branch node and is added in BNL. Assuming

thatn is the length of any transaction and ris the number of items

already inserted in the previous LPN, we store the remaining items

in the new LPN at the same time, where the number of array nodes

in the LPN is nr+ 1 (including a header). To store all transactions

with no problems, LP-tree connects all of its nodes in one of two

ways. First, internal nodes of LPN are directly connected to each

other without any pointer. Second, when any branch occurs, LP-

tree links corresponding child nodes utilizing BNL. Processing all

transactions, we can gain a complete LP-tree. Once the LP-tree con-struction terminates, BNL is eliminated because it is not used any

longer. LP-tree generated by the above processes can store all

transactions in a given database, and all internal and external

nodes of LPNs can be connected by the following Lemma 1.

Lemma 1. We can access all internal nodes in LPN without any

pointer which connects parent nodes with child ones while nodes of

the other LPN can be linked through the LPNs header and BNL.

Proof. Since LPN is composed of array nodes, we can find nodes

directly without pointers due to characteristics of the array. Given

any node, d , its parent node and child node are denoted as d 1

and d + 1, respectively. However, d+ 2 indicates ds descendantnode, notds second child node since the array has only one child

Fig. 2. Structure of Branch Node List (BNL).

Table 1

A transaction database.

TID Original items Sorted items

1 a, c, d, e, f, g, h e, a, f, c, d, g, h

2 a, b, e, f e, b, a, f

3 c, e, h e, c, h

4 b, f, g b, f, g

5 a, b, d, e, g e, b, a, d, g

6 e, g e, g

7 b, c, e, f e, b, f, c

8 a, b, c, e, f e, b, a, f, c

9 a, d, e e, a, d

10 b, d, e e, b, d



5/15


6/15

is allocated, both of the trees use O(jfreq(Trans)j) with respect to

the time to store item information. Thus, runtimes after storingthe transaction becomes O(2 jfreq(Trans)j) in FP-tree and

O(jfreq(Trans)j+ 1) in LP-tree. In other words, since FP-tree gener-

ates and records nodes one by one, it uses the time by

O(2 jfreq(Trans)j). However, since LP-tree generates nodes at

once, it needsO(1), and the total time becomes O(jfreq(Trans)j+ 1)

by adding the information record time. In the step of transaction

insertion, LP-tree is more efficient than FP-tree when nodes related

to items of any inserted transaction have 3 or less child nodes on

average, according to the followingLemma 2.

Lemma 2. Let a and b be search times needed when we insert acertain transaction in LP-tree, and FP-tree, respectively. Then, it is true

thata < bif the average number of child nodes related to the inserted

transaction is not greater than 3.

Proof. In the transaction insertion, we first confirm whether or not

a certain itemto be inserted after the current node exists among its

child nodes, and then, the current position moves to a correspond-

ing node or a new child node is created according to the result. Let

n,c, and K be the number of nodes which we have to visit to inserta transaction, the number of child nodes for each visited node, and

a set of c, respectively. Then, K is denoted as the followingequation.

K fc1; c2; . . . ; cng; cP 1:

To check whether the next inserted item exists among the child

nodes, FP-tree finds child nodes through the binary search method.

Since it accesses child node pointers and then visits corresponding

nodes, these processes require 2 Pn

i1lgcitimes, andn times areadditionally considered because we should move to the next nodes

as many as the number of the visited nodes, n. Therefore, the total

search time of FP-tree, b is as follows.

b 2 Xni1

lgci n:

LP-tree directly accesses internal child nodes of the current LPN

while it indirectly traverses the other child nodes through BNL. That

is, LP-tree first checks if the item to be inserted is equal to that of

the internal node, and then, it accesses the other child nodes using

BNL if there is no same item. Since searching for child nodes in BNL

is based on the binary search as in the FP-tree, LP-tree needs ntimesfor traversing child nodes in BNL and 2

Pni1lgci 1 times to

search for BNL, where LP-tree needs lg(ci 1) instead oflg(ci) due

to the advantage of the internal nodes. In addition, when the cur-rent location moves to the next inserted nodes, LP-tree requires

n-1 times, not n since it can directly access the nodes if they exist

in the current LPN. Accordingly, the total time of LP-tree, a is de-noted as follows:

a 2 Xni1

lgci 1 n 1 n:

The relation,a < b is equal to

2 Xni1

lgci 1 n 1 n< 2 Xni1

lgci n:

Solving this is as follows:

Xni1

lgci Xni1

lgci 1>n 1

2

Xni1

lgci lgci 1 1

2

>

1

2:

In the above inequality, lgci lgci 1 12

should not be smaller

than 0 so that the formula is true. Thus, ifci is less than approxi-mately 3.414215, the formula lgci lgci 1 >

12

is satisfied. Con-

sequently, it is certain thata< bwhen the average number of childnodes is not greater than 3. h

In Section4.2, we will show the experimental results of calcu-

lating the average number of child nodes regarding the LP-trees

generated from various datasets, where we will be able to see thatthe number does not exceed 3 in any case.

Next, we compare runtimes with regard to searching the trees.

FP-tree uses O(1) since it can find any target node directly using

the node link. LP-tree also consumes O(1) because of the node link.

However, there occurs a difference when the two trees search from

the item selected by the node link to a root. In here, we have to

consider time calculation depending on whether the current struc-

ture is array-based or pointer-based form. The array-based form

(LP-tree) has a structure of which all data is stored contiguously.

Therefore, it can directly access any node by approaching real

memory at the same time. On the other hand, using the pointer-

based form (FP-tree), we have to access a certain node indirectly

since we confirm where any pointer is stored, and then we ap-

proach the corresponding memory. That is, the first requires oneaccess, while the second needs two tasks [4]. Thus, assuming that

Fig. 4. A state of LP-tree inserting 310 transactions additionally to Fig. 3.



7/15

t is memory access time, the direct approach uses O(t) while the

indirect one uses O(2 t). Therefore, considering all of the above

results, we know that LP-tree is faster than FP-tree in most cases.

3.4.2. Integrating LPN

In the previous section, we learned how to create LP-tree. How-

ever, this method can cause fragmentation of LPNs since each

transaction is processed individually without comprehensive con-siderations. That is, any transaction may be stored in multiple LPNs

even though it can be inserted in only one LPN sufficiently. For

example, assuming that we insert two transactions, {a,b, c} and

{a,b,c,d} in an empty LP-tree, the first one is fully stored in one

LPN. Thereafter, in the second transaction, there occurs a branch

in item c, and then a new LPN is created and the remaining item

d is added in the LPN. Thus, the second LPN has a small number of

array nodes. To generate internal nodes as many as possible for

each LPN, we can consider attaching nodes at the very end of

LPN. Through the LPN Integrating operations, certain nodes are in-

serted at the very end of any LPN. Let I= {i1, i2, . . ., in} be any itemset

to be added and a be an item of internal nodes of LPN, i.e.

LPN= {hParent_Link i, ha1, S, L, bi, ha2, S, L, b i, . . ., ham, S, L, bi}, m< n.

Then, in order to apply it, the following conditions have to be sat-isfied: (1) The length of the inserted itemset is longer than that of

the target LPN (i.e. m< n); (2) Items of internal nodes in the LPN are

equal to the upper part of the inserted items; and (3) Sequence of

the common part should be consistent (i.e. i1= a1,i2= a2, . . ., im=

am, 1 6 m< n). If these conditions are completely satisfied, we con-

duct item insertion steps according to the following process: (1)

Supports of the common part between them increase by 1. (2)

We assign an array of a new LPN with the length computed from

the inserted items. (3) All the nodes of the previous LPN are in-

serted into the new LPN. (4) The remainder of the itemset is added

in the very end of the new LPN. (5) The previous LPN is deleted. By

applying this technique, we can make LPNs have more array nodes

compared with the previous LPNs. If shapes of transactions com-

posing any database are similar to each other, the LPN integratingoperations are more needed, and LPNs length becomes longer

whenever these operations are performed.

Since the LPN integrating technique is used only when a length

of any inserted transaction is longer than that of a target LPN, the

longer the length of the LPN is, the lower the possibility of the LPN

integrating operations is.

Example 4. Let {a,b,c,d,e,f} be a set of items to be inserted in LP-

tree. Then, assume that, as shown inFig. 5(a),LPN1is connected to

a root, andLPN2

andLPN3

are linked to the node with b inLPN1

.

Inserting the set of items without the LPN integrating technique,

the corresponding LP-tree is shown in Fig. 5(b). In short, the

number of LPN increases from 3 to 4. In contrast, using

the technique, the resulting LP-tree becomes Fig. 5(c). That is, the

number of LPNs is not increased since LPN1is rebuilt depending on

a series of tasks as mentioned above.

3.5. Mining frequent patterns based on LP-tree (LP-growth)

LP-growth searches LP-tree and creates a conditional LP-tree for

mining frequent patterns. To do that, our algorithm first selects the

bottom item from the header list and traverses nodes connected to

corresponding node links. Then, supports of the visited nodes are

stored, and nodes from each linked node to a root are searched.Each node can be accessed directly if the search is conducted with-

in one LPN. In other words, given a current node,Nk, the algorithm

immediately accesses N(k1)to approach a parent node ofNk. Iterat-

ing the traversal regarding one LPN, the algorithm reaches a header

of the LPN, where the header refers to its parent node, i.e. the other

LPN. LP-growth stops operations if the next position of the header

is a root; otherwise, it continues to find nodes tracking the other

LPN linked fromthe header. If any header indicates a root, it means

that the corresponding path has been searched completely, and

items in the path become a conditional transaction with support

of the first visited node. After visiting all of the other nodes refer-

ring to the node links, the algorithm constructs a conditional pat-

tern base (conditional database) with the obtained results. After

that, we compute item supports in the conditional database, andsome of the items are eliminated in the database if their supports

are less than a given minimum support threshold. Each transaction

Fig. 5. Item insertion applying the LPN integrating strategy.

http://-/?-http://-/?-


8/15

of the conditional database is sorted in support descending order,

and then a new LP-tree is generated from the sorted database.

The new one becomes a conditional LP-tree and includes a prefix

itemset, a frequent item or pattern selected in the previous phase.

If a certain LP-tree forms a single-path, all combinations of the tree

are considered as frequent patterns in common with the FP-growthapproach. Therefore, in this case, the algorithm extracts frequent

patterns joining the prefix itemset and each of the combinations.

Searching trees in FP-tree requires numerous pointer usages in

general since it has to use pointers to move from any node to an-

other one. Meanwhile, LP-tree can minimize the number of using

pointers through the LPN strategy, which is proved as the following

Lemma 4.

Lemma 3. When any tree is traversed in bottom up manner, thenumber of using pointers in LP-tree is lower than or equal to that of

FP-tree.

Proof. Assuming that n is the length of a certain path from any

node to a root, FP-tree needs the n number of using pointers in

any case since it has to pass through n pointers with respect to

the path. In the case of LP-tree, it consists of one or more LPNs

and uses pointers (i.e. headers) only when new branches occur.

Thus, LP-growth uses the pointer when visiting headers for each

LPN. Here, we can consider the two cases as shown in Fig. 6(a and

b). The first is when all of the visited LPNs have one node. Let jKcj

be the numberof headers, i.e. LPNs, and jNj be the number of nodes.

Then, the number of visiting nodes, R is denoted as R= jKcj+ jNj. In

the first case in Fig. 6, wehaveto visit headers of LPNs jNj times. FP-tree refers to variables where parent pointers are stored so as to

Fig. 6. Cases of tree searches in LP-tree and FP-tree.

Fig. 7. Algorithm for LP-tree construction.

http://-/?-


9/15

access parent nodes as shown inFig. 6(c). Therefore, in the FP-tree,

the total number of using pointers from a certain node to a root

becomes 2 jNjconsidering not only the pointer accesses but also

node visits. In case 1, LP-tree refers to headers in order to approach

parent nodes in common with the case of FP-tree, where R= 2 jNj

because jKcj= jNj. Therefore, it is regarded as the worst case and

needs pointer usages as many as the case of FP-tree. The number

is decreased continuously as the size of LPN increases. In case 2 of

Fig. 6, wecan calculate R= 1 + jNj since all the visiting nodes belong

to the lone LPN and since we visit only one header. Thus, if there is

an item set from any node to a root in one LPN, we can visit one

header regardless of the number of the items. This is considered

as the best case. That is, FP-tree uses jNj pointers while LP-tree

needs only one pointer use. h

Example 5. In Fig. 6, (a and b) showcertain LPNs in LP-tree, and (c)

is a part of FP-tree. Note that they have the same data (items).

When they search items from i4 to their own roots, Fig. 6(a) tra-

verses the LPNs as the following sequence, hi4, 1,NULL,falsei,

hgp3;1 i, hi3, 1,NULL,false i , hgp2;1 ihi2; 1; NULL;falsei; hgp1;1 i, h i1, 1,NULL,falsei, andhgprooti. Thus, the number of memory accesses (i.e., usingpointers) needed for searching nodes is 8. Meanwhile, Fig. 6(b)searches for them as the following sequence, hi4, 1,NULL,falsei,

hi3, 1,NULL,falsei, hi2, 1,NULL,falsei, hi1, 1,NULL,falsei, and hgprooti,thereby accessing the memory 5 times. Since Fig. 6(c) uses pointers

to traverse 4 nodes, it needs 4 pointer accesses and 4 node acces-

ses. Thus, the total number of memory accesses is 8.

3.6. LP-growth algorithm

To mine frequent patterns, LP-growth constructs LP-tree with

the algorithm Construct_LP-tree shown in Fig. 7. The algorithm

first scans D to calculate items supports and generates a Header

list (lines 13), and thereafter, D is scanned again to construct

LP-tree (line 4). LP-growth performs item insertion starting fromthe root (lines 542). If there is an item matched with one of the

child nodes of the root from BNL (line 6), the algorithm moves to

the corresponding node and increases its support (line 7). Other-

wise, a new LPN is generated (lines 912). After that, the algorithm

confirms whether the current node, gpc;ris a branch node (line 15).After that, it refers to the next node, gpc;r1 ifgpc;r is not a branchnode since gpc;rhas one child node or none. It increases the corre-sponding node support by 1 ifgpc;r1 is equal to ik, i.e. the child nodehas the same item as the inserted one (line 17). If they are not

equivalent, LP-growth creates a new LPN, wheregpc;r becomes anew branch node (lines 2529). After generating the new LPN with

the size of the inserted itemset, the algorithm inserts the remain-

ing items into the new LPN (lines 26). Then, its branch information

is recorded in BNL (line 27).In the LPN integrating procedure (lines 1823), LP-growth con-

ducts the LPN integrating operations if there are remaining items

after the insertion to the current LPN. Since in is the last item, if

ik is not the last one, it means that there are still items to be in-

serted. The algorithm generates a new LPN with the length of

the inserted itemset (line 19) and copies node information of the

previous LPN to the new one (line 20). Thereafter, it stores the

remaining items in the new LPN (line 21) and deletes the previous

LPN (line 22). When the current node is a branch node, steps

corresponding to the branch node operations are performed (lines

3142). If the next array node has an item equal to the inserted

item (line 32), the items support is increased by 1. After that,

the current position moves to the next one (line 33). In case

gpc;r1 is not equal to the itemto be inserted, the algorithm confirmschild nodes with BNL. LP-growth finds a location corresponding to

gpc;rin BNL and searches for child nodes. If there is any child nodehaving the item matched with ik among the found ones (line 34),

the algorithm increases its support by 1 and regards the corre-

sponding node as the new current node (line 35). If none of the

nodes exist in BNL, this algorithm makes a new LPN and inserts

the remaining items (lines 3842).

Fig. 8 shows the overall LP-growth algorithm, and it is per-

formed as follows. First, LP-growth checks whether current LP-tree

is a single-path (line 1). If it is true, the algorithm combines all the

items in the path with the prefix (lines 24), where the results be-

come valid frequent patterns. On the other hand, if it is multiple

paths, the algorithm traverses the current tree and creates a condi-

tional LP-tree (lines 519). Thereafter, our LP-growth selects an

item, i in the header list at first in order to search the tree (line

5), and it finds nodes from each node with the selected item to a

root using the corresponding node links (line 6). Items of visited

nodes are stored into L (line 10), where LP-growth directly accesses

the immediately preceding node if the current location is inside

LPN (line 11).

If the current gpc;ris the first node (header) of LPN (line 12), thecurrent position shifts to the parent node pointer stored in the

header of LPN (lines 1314). Iterating the traversal, we obtain all

conditional transactions including i, where a set of conditional

transactions become a conditional database, L0 (line 15). Con-

struct_LP-tree procedure is called for generating a conditional LP-

tree (line 16). After that, the algorithm removes BNL since it is

not needed any longer in the current step (line 17) and then calls

LP-growth recursively to extend the pattern (line 19).

4. Performance evaluation

In this section, we present experimental results by comparing

our algorithm, LP-growth, with the state-of-art algorithms. In order

to show that the experiments are reasonable, we evaluate their

performances based on three important criteria: runtime, memory

usage, and scalability. In addition, we also present tests for theaverage number of child nodes to prove the efficiency of LP-tree.

4.1. Test environment and datasets

LP-growth is written in C++, compiled at gcc 3.4.4, and run in

3.3 GHz Intel processor, 8 Gbyte memory, and Windows 7 OS.

Based on the environment, we compare our algorithm, LP-growth

Fig. 8. LP-growth algorithm.

http://-/?-http://-/?-


10/15


11/15

number with various settings of the minimum support threshold

for each dataset. The average number of child nodes is the division

of the sum of the child nodes for each node except for the leaf

nodes by the number of all nodes without the leaves. The leaf

nodes are not considered since they have none of child nodes.Figs. 9 and 10show the results of dense datasets such as Chess,

Connect, and Pumsb, and sparse ones including Retail, Kosarak,

T10I4D1000K, BMS-WebView-1, and Chain-store, respectively.

From those results, we can observe that all of the LP-trees

generated by these datasets have less than 3 Average numbers of

child nodes (ACN) regardless of the minimum support threshold.

Note that, when the minimum support is 10%, the results of

T10I4D1000K, BMS-WebView-1, and Chain-store are not shown.

because all the items mined from these datasets have smaller sup-

port than the given minimum support threshold. That is, any tree is

not constructed and none of frequent patterns are generated under

this threshold setting.

4.3. Runtime evaluation

Figs. 1118show results of runtime experiments regarding the

real and synthetic datasets shown in Tables 2 and 3 respectively. In

these figures, we can observe that our LP-growth outperforms the

others in almost all of the cases. LP-growth uses the proposed lin-

ear structure to its trees instead of the previous tree form in order

to minimize access times to search nodes. As a result, its advanta-

ges have a positive effect on reducing runtime in whole experi-

ments. Especially as the minimum support threshold becomes

lower, the difference of runtime between our algorithm and the

others is bigger.

FP-growth shows the worst performance with respect to all the

datasets except for the Retail. Mining times of the FP-growth algo-

rithm are 3305 s when the minimum support threshold is 80% forthe Connect dataset and 1065 s when the threshold is 70% for the

Pumsb dataset. Note that the algorithm did not operate normally

for the Kosarak dataset because its memory usage exceeded the

limit allowed in our test environment. As an improved version of

the FP-growth, FP-growth-geothals shows better performance than

that of the FP-growth algorithm in the dense datasets such as Con-

nect, Pumsb, and Chess. Since FP-growth-tiny can reduce the size

of the FP-tree, it presents more improved runtime performance

compared to the above two algorithms. Its speed is also similar

to that of FP-growth in many cases but lags behind that of

CT-PRO and our LP-growth. CT-PRO has outstanding runtime per-

formance in many cases due to its technique which focuses on

increasing mining speed by storing and utilizing additional data

with a bit form. However, the CT-PRO algorithm generally usesenormous memory with respect to almost all of the used datasets,

and thus, its overall efficiency remarkably falls behind that of LP-

growth. Especially when the threshold was less than 0.1% for the

Chain-store and 0.7% for the Kosarak, the algorithm failed to mine

frequent patterns successfully because of its heavy memory con-

sumption that our system could not bear. For this reason, the re-

sults of CT-PRO for these datasets are not expressed in our graph

figures. FP-growth* shows fine runtime results in general since it

uses its own technique, FP-array for reducing the number of tree

traversals. Nevertheless, efficiency of FP-growth falls behind that

of our algorithm as shown in the figures since the benefit of LP-treeis higher than that of FP-array. MAFIA-FI requires more execution

Fig. 13. Runtime test (Pumsb).Fig. 12. Runtime test (Retail).

Fig. 14. Runtime test (Kosarak).

Fig. 15. Runtime test (Chess).

http://-/?-http://-/?-


12/15

time than that of LP-growth and FP-growth in all the cases. Espe-

cially in the sparse datasets such as Retail, Kosarak, T10I4D1000K

and BMS-WebView-1, MAFIA-FI has worse performance than that

of FP-growth. Moreover, we could not evaluate the runtime perfor-

mance of the algorithm for the Chain-store due to its enormous

memory consumption. The reason is that it uses a vertical bitmap

representation to increase mining performance, but its effect is not

applied in sparse datasets in contrast to dense ones. In Fig. 16, MA-

FIA-FI needs 1054 s with a minimum support, 0.01%, so its run-

times are not shown in these figures since we can infer that thealgorithm continues to have the worst performance in all the cases

of the T10I4D1000K dataset. This tendency is also similarly

represented in Fig. 12. From all of the runtime results shown in

Figs. 1118, we can observe that LP-growth and CT-PRO present

outstanding runtime performance. However, CT-PRO is unstable

and not available for several cases as shown in the figures because

it requires a very large amount of memory. Therefore, it is assumed

that LP-growth is better than CT-PRO in terms of overall capability

of mining frequent patterns.

4.4. Memory usage evaluation

In this section, we evaluate memory usage for each algorithmwith the same datasets as the runtime tests. In Figs. 1926,

Fig. 16. Runtime test (T10I4D1000K).

Fig. 21. Memory test (Pumsb).

Fig. 20. Memory test (Retail).

Fig. 19. Memory test (Connect).

Fig. 18. Runtime test (BMS-WebView-1).

Fig. 17. Runtime test (Chain-store).



13/15

LP-growth and FP-growth present outstanding memory perfor-

mance, while CT-PRO shows the worst results in almost all cases.

Although our algorithm does not show the best memory usage in

a few cases, it guarantees memory consumption as good as that

of the state-of-the-art algorithm, FP-growth. Moreover, our algo-

rithm presents the most outstanding results in many cases. Espe-

cially inFig. 20for Retail dataset, our LP-growth outperforms the

others including FP-growth in all of the cases. For the Kosarak

dataset, CT-PRO could not operate normally when the minimum

support threshold was 0.6% or less, and for the Chain-store dataset,

it was not performed successfully when the threshold was 0.01% orless since it consumed too much memory, which is the reason why

the results of the algorithm are not included inFigs. 22 and 25. Inaddition, CT-PRO uses a lot of memory compared to the other algo-

rithms. Meanwhile, since LP-growth uses very little memory com-

pared to CT-PRO and the others, it can be more effective in

memory-constrained environments than the others.

FP-growth also requires a lot of memory in many cases which is

common with the CT-PRO algorithm. FP-growth used too much

memory mining frequent patterns, so we could not express the re-

sults into the graph figures with respect to the following situa-

tions: it used 833 MB when the minimum support threshold was

80% for the Connect dataset and 1633 MB when the threshold

was 80% for the Pumsb dataset. Furthermore, FP-growth could

not mine patterns normally when the threshold was 0.7% or less

for the Kosarak dataset since it required more memory than the

limit allowed in our system. In addition, the algorithm consumedmore memory than that of CT-PRO with respect to the Chess data-

set as shown inFig. 23. Meanwhile, FP-growth-goethals, which is

an optimized version of FP-growth, showed relatively fine memory

efficiency although its performance still lags behind that of ours.

Due to the techniques for saving memory space by generating none

of conditional databases, FP-growth-tiny reduces memory usage

for the relatively large datasets such as Pumsb and Kosrark com-

pared to FP-growth-goethals although its effect is not available

for the small datasets such as Connect and Chess. Since the LP-tree

of LP-growth minimizes its tree sizes by using linear structure, it

guarantees outstanding performance as shown in the figures.

Moreover, LP-growth has almost constant and stable memory

consumption regardless of the threshold in comparison to the

other algorithms. Meanwhile, MAFIA-FI requires relatively muchmemory since the bitmap proposed in the algorithm needs more

Fig. 26. Memory test (BMS-WebView-1).

Fig. 24. Memory test (T10I4D1000K).

Fig. 22. Memory test (Kosarak).

Fig. 23. Memory test (Chess).

Fig. 25. Memory test (Chain-store).

http://-/?-http://-/?-http://-/?-


14/15

memory resources than the others such as LP-tree and FP-growth

.FP-growth also shows these characteristics in many cases, but it

does not guarantees them in the Retail dataset shown in Fig. 20.

As in the previous runtime test in Fig. 16, memory results of MA-

FIA-FI are not provided in Fig. 24. The reason is that when the min-

imum support is 0.01%, it consumes 487 MB to perform its own

mining operations, and therefore, we can expect that the algorithm

requires more memory in the case of the other minimum supports.

Thus, we do not need to denote the results to the figure.

4.5. Scalability evaluation

Figs. 2730 show results for scalability tests performed with the

datasets inTable 3. Note that MAFIA-FI is excluded in the test for

the datasets with increasing transactions since it needs longer run-times and more memory usages than the others in these tests, so it

is hard to express the results of its scalability experiments into the

figures with those of the other algorithms. In addition, FP-growth,

MAFIA-FI, and CT-PRO are not evaluated in the test for the other

datasets with increasing items because they cannot be performed

normally for these datasets due to their lack of memory. The min-

imum support threshold is fixed at 0.1% in these tests. In Fig. 27,

runtime increase of LP-growth is far smaller and more stable than

that of the others since LP-tree allows LP-growth to perform min-

ing operations effectively regardless of increment of transactions.

FP-growth shows better scalability than FP-growth and FP-

growth-goethals due to its special structure, FP-array, although

its efficiency is lower than ours. FP-growth-tiny also presents fine

scalability similar to that of FP-growth

. CT-PRO shows an out-standing scalability result, but our LP-growth is still better. Our

algorithm also guarantees the best runtime scalability for the testshown inFig. 29. FP-growth-goethals has the worst result, while

FP-growth shows fine runtime scalability although its perfor-

mance fall behind ours. FP-growth-tiny presents good performance

similar to that of FP-growth in the beginning, but its scalability

becomes drastically low as the number of items gradually in-

creases. In Fig. 28, all of the algorithms have almost constant mem-

ory usages since all of them are tree-based algorithms, but their

absolute values are different from each other, and especially, LP-

growth presents the smallest memory consumption due to the

advantages of LP-tree. On the other hand, Fig. 30 shows results

different from Fig. 28. Since necessary tree sizes become larger

gradually as the number of attributes is increased, memory usages

of the algorithms become bigger as shown in the figure. However,

LP-growth shows the best memory scalability while the othershave relatively poor performance, which indicates that our LP-tree

can store these increasing attributes more efficiently than the

other structures of the competitor algorithms. Through the above

experimental results, we know that the proposed algorithm,

LP-growth, outperforms the others with respect to increasing

transactions and items in terms of salability as well as runtime

and memory usage for the real datasets.

5. Conclusion

In this paper, we proposed a new tree structure, LP-tree, and an

algorithm, LP-growth, applying it to the mining process. The main

goal of the proposed algorithm was to reduce not only memoryusage needed for building trees but also time to traverse them by

Fig. 27. Scalability test (Runtime).

Fig. 30. Scalability test (Memory).Fig. 28. Scalability test (Memory).

Fig. 29. Scalability test (Runtime).



15/15

applying a linear structure instead of the previous form used in FP-

growth. LP-tree contributed to improving performance of frequent

pattern mining since it spent less memory generating nodes com-

pared to FP-tree and accessed them without any pointers in many

cases. Our experimental results showed that LP-growth presented

outstanding performance in terms of runtime, memory usage, and

scalability. We could also observe that our algorithmoutperformed

the previous algorithms especially in the runtime experiments due

to the reduced pointer accesses. The techniques and strategies de-

scribed in this paper can be applied to not only general frequent

pattern mining but also a variety of pattern mining fields such as

closed/maximal pattern mining, top-k pattern mining, and graph

mining. We expect that these future researches lead to improve-

ment of mining performance in various areas.

Acknowledgements

This research was supported by the National Research Founda-

tion of Korea (NRF) funded by the Ministry of Education, Science

and Technology (NRF Nos. 2013005682 and 20080062611).

References

[1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, Very Large

Data Bases(VLDB) (1994) 487499.

[2] C.F. Ahmed, S.K. Tanbeer, B. Jeong, H. Choi, A framework for mining interesting

high utility patterns with a strong frequency affinity, Information Science

(ISCI) 181 (21) (2011) 48784894.

[3] C.F. Ahmed, S.K. Tanbeer, B.S. Jeong, Y.K. Lee, Interactive mining of high utility

patterns over data streams, Expert System with Applications (ESWA) 39 (15)

(2012) 1197911991.

[4] B. Andres, U. kothe, T. Kroger, F.A. Hamprecht, Runtime-flexible multi-

dimensional arrays and views for C++98 and C++0x, Software: Practice and

Experience 35 (2) (2010) 159188.

[5] D. Burdick, M. Calimlim, J. Flanick, J. gehrke, T. Yiu, MAFIA: a maximal frequent

itemset algorithm, Transactions on Knowledge and Data Engineering (TKDE)

17 (11) (2005) 14901503.

[6] L. Chang, T. Wang, D. Yang, H. Luan, SeqStream: mining closed sequential

patterns over stream sliding windows, International Conference on Data

Mining (ICDM) (2008) 8392.[7] J.H. Chang, N.H. Park, Comparative analysis of sequence weighting approaches

for mining time-interval weighted sequential patterns, Expert System with

Applications (ESWA) 39 (3) (2012) 38673873.

[8] G.P. Chen, Y.B. Yang, Y. Zhang, MapReduce-based balanced mining for closed

frequent itemset, International Conference Web Services (2012) 652653.

[9] Y. Chen, W. Peng, S. Lee, CEMiner an efficient algorithm for mining closed

patterns from time interval-based data, in: International Conference on Data

Mining (ICDM), 2011, pp. 121130.

[10] C. Gao, J. Wang, Q. Yang, Efficient mining of closed sequential patterns on

stream sliding window, in: International Conference on Data Mining (ICDM),

2011, pp. 10441049.

[11] B. Goethals. .[12] G. Grahne, J. Zhu, Fast algorithms for frequent itemset mining using FP-trees,

Transactions on Knowledge and Data Engineering (TKDE) 17 (10) (2005)

13471362.

[13] M. Hamada, K. Tsuda, T. Kudo, T. Kin, K. Asai, Mining frequent stem patterns

from unaligned RNA sequences, Bioinformatics 22 (20) (2006) 24802487.

[14] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate

generation: a frequent pattern tree approach, Data Mining and KnowledgeDiscovery (DMKD) 8 (1) (2004) 5387.

[15] J. Han, H. Cheng, D. Xin, X. Yan, Frequent pattern mining: current status and

future directions, Data Mining and Knowledge Discovery (DMKD) 15 (1)

(2007) 5586.

[16] T. Hu, S.Y. Sung, H. Xiong, Q. Fu, Discovery of maximum length frequent

itemsets, Information Sciences 178 (1) (2008) 6987.

[17] H.C. Kum, J.H. Changa, W. Wang, Sequential pattern mining in multi-databases

via multiple alignment, Data Mining and Knowledge Discovery (DMKD) 12 (2)

(2006) 151180.

[18] H.T. Lam, T. Calders, Mining top-K frequent items in a data streamwith flexible

sliding windows, Knowledge Discovery and Data Mining (KDD) (2010) 283

292.

[19] G. Lee, U. Yun, K. Ryu, Sliding window based weighted maximal frequent

pattern mining over data streams, Expert Systems with Applications 41 (2)

(2014) 694708.

[20] H.F. Li, S. Lee, Mining top-K path traversal patterns over streaming web click-

sequences, Journal of Information Science and Engineering 25 (4) (2009)

11211133.

[21] K. Lin, I. Liao, Z. Chen, An improved frequent pattern growth method for

mining association rules, Expert System with Applications (ESWA) 38 (5)(2011) 51545161.

[22] H. Liu, F. Lin, J. He, Y. Cai, New approach for the sequential pattern mining of

high-dimensional sequence databases, Decision Support Systems 50 (1) (2010)

270280.

[23] Y. Liu, Mining frequent patterns from univariate uncertain data, Data and

Knowledge Engineering (DKE) 71 (1) (2012) 4768.

[24] C. Lucchese, S. Orlando, R. Perego, Mining top-K patterns from binary datasets

in presence of noise, in: Proceedings of the SIAM International Conference on

Data Mining (SDM), 2010, pp. 165176.

[25] E. Ozkural, C. Aykanat, A space optimization for FP-growth, in: FIMI 04

Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining

Implementations, November 2004.

[26] A. Pietracaprina, D. Zandolin, Mining frequent itemsets using patricia tries, in:

Workshop on Frequent Itemset Mining Implementations, 2003.

[27] C. Raissi, T. Calders, P. poncelet, Mining conjunctive sequential patterns, Data

Mining and Knowledge Discovery (DMKD) 17 (1) (2008) 7793.

[28] S. Ruggieri, Frequent regular itemset mining, Knowledge Discovery and Data

Minin (KDD) (2010) 263272.

[29] Y.G. Sucahyo, R.P. Gopalan, CT-PRO: a bottom-up non recursive frequent

itemset mining algorithm using compressed FP-tree data structure, in: FIMI

04 Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining

Implementations, November 2004.

[30] S.K. Tanbeer, C.F. Ahmed, B.S. Jeong, Y. Lee, Efficient single-pass frequent

pattern mining using a prefix-tree, Information Sciences 179 (5) (2008) 559

583.

[31] S.K. Tanbeer, C.F. Ahmed, B.S. Jeong, Y.K. Lee, Sliding window-based frequent

pattern miningover data streams, Information Sciences 179 (22) (2009)3843

3865.

[32] S.K. Tanbeer, C.F. Ahmed, B.S. Jeong, Y.K. Lee, Mining regular patterns in data

streams, Database Systems for Advanced Applications (2010) 399413.

[33] F. Tao, Weighted association rule mining using weighted support and

significant framework, Knowledge Discovery and Data Minin (KDD) (2003)

661666.

[34] V.S. Tseng, C.W. Wu, B.E. Shie, P.S. Yu, UP-Growth: an efficient algorithm for

high utility itemset mining, Knowledge Discovery and Data mining (KDD)

(2010) 253262.

[35] R.C. Wong, A.W. Fu, Mining top-K frequent itemsets from data streams, DataMining and Knowledge Discovery (DMKD) 13 (2) (2006) 193217.

[36] T. Wu, Y. Chen, J. han, Re-examination of interestingness measures in pattern

mining: a unified framework, Data Mining and Knowledge Discovery (DMKD)

21 (3) (2010) 371397.

[37] F.Y. Ye, J.D. Wang, B.L. Shao, New algorithm for mining frequent itemsets in

sparse database, in: Proc. the Fourth International Conference on Machine

Learning and Cybernetics, 2005, pp. 15541558.

[38] U. Yun, K.H. Ryu, Discovering important sequential patterns with length-

decreasing weighted support constraints, International Journal of Information

Technology and Decision Making 9 (4) (2010) 575599.

[39] U. Yun, K. Ryu, E. Yoon, Weighted approximate sequential pattern mining

within tolerance factors, Intelligent Data Analysis 15 (4) (2011) 551569.

[40] U. Yun, H. Shin, K. Ryu, E. Yoon, An efficient mining algorithm for maximal

weighted frequent patterns in transactional databases, Knowledge Based

Systems 33 (2012) 5364.

[41] U. Yun, K. Ryu, Efficient mining of maximal correlated weight frequent

patterns, Intelligent Data Analysis 17 (5) (2013) 917939.

[42] U. Yun, G. Lee, K. Ryu, Mining maximal frequent patterns by considering

weight conditions over data streams, in: Knowledge Based Systems 55 (2014)4965.

[43] X. Zhang, Y. Zhang, Sliding-window top-K pattern mining on uncertain

streams, Journal of Computational Information Systems 7 (3) (2011) 984992.

[44] F. Zhu, Q. Qu, D. Lo, X. Yan, J. Han, P.S. Yu, Mining top-K large structure pattern

in a massive network, Very Large Data Bases (VLDB) (2011) 807818 .

[45] J. Zou, J. Xiao, R. Hou, Yanqi Wang, Frequent instruction sequential pattern

mining in hardware sample data, International Conference on Data Mining

(ICDM) (2010) 12051210.

http://refhub.elsevier.com/S0950-7051(13)00324-9/h0005http://refhub.elsevier.com/S0950-7051(13)00324-9/h0005http://refhub.elsevier.com/S0950-7051(13)00324-9/h0010http://refhub.elsevier.com/S0950-7051(13)00324-9/h0010http://refhub.elsevier.com/S0950-7051(13)00324-9/h0010http://refhub.elsevier.com/S0950-7051(13)00324-9/h0015http://refhub.elsevier.com/S0950-7051(13)00324-9/h0015http://refhub.elsevier.com/S0950-7051(13)00324-9/h0015http://refhub.elsevier.com/S0950-7051(13)00324-9/h0020http://refhub.elsevier.com/S0950-7051(13)00324-9/h0020http://refhub.elsevier.com/S0950-7051(13)00324-9/h0020http://refhub.elsevier.com/S0950-7051(13)00324-9/h0020http://refhub.elsevier.com/S0950-7051(13)00324-9/h0025http://refhub.elsevier.com/S0950-7051(13)00324-9/h0025http://refhub.elsevier.com/S0950-7051(13)00324-9/h0025http://refhub.elsevier.com/S0950-7051(13)00324-9/h0025http://refhub.elsevier.com/S0950-7051(13)00324-9/h0030http://refhub.elsevier.com/S0950-7051(13)00324-9/h0030http://refhub.elsevier.com/S0950-7051(13)00324-9/h0030http://refhub.elsevier.com/S0950-7051(13)00324-9/h0030http://refhub.elsevier.com/S0950-7051(13)00324-9/h0035http://refhub.elsevier.com/S0950-7051(13)00324-9/h0035http://refhub.elsevier.com/S0950-7051(13)00324-9/h0035http://refhub.elsevier.com/S0950-7051(13)00324-9/h0040http://refhub.elsevier.com/S0950-7051(13)00324-9/h0040http://adrem.ua.ac.be/~goethals/software/http://adrem.ua.ac.be/~goethals/software/http://refhub.elsevier.com/S0950-7051(13)00324-9/h0045http://refhub.elsevier.com/S0950-7051(13)00324-9/h0045http://refhub.elsevier.com/S0950-7051(13)00324-9/h0045http://refhub.elsevier.com/S0950-7051(13)00324-9/h0050http://refhub.elsevier.com/S0950-7051(13)00324-9/h0050http://refhub.elsevier.com/S0950-7051(13)00324-9/h0055http://refhub.elsevier.com/S0950-7051(13)00324-9/h0055http://refhub.elsevier.com/S0950-7051(13)00324-9/h0055http://refhub.elsevier.com/S0950-7051(13)00324-9/h0060http://refhub.elsevier.com/S0950-7051(13)00324-9/h0060http://refhub.elsevier.com/S0950-7051(13)00324-9/h0060http://refhub.elsevier.com/S0950-7051(13)00324-9/h0065http://refhub.elsevier.com/S0950-7051(13)00324-9/h0065http://refhub.elsevier.com/S0950-7051(13)00324-9/h0070http://refhub.elsevier.com/S0950-7051(13)00324-9/h0070http://refhub.elsevier.com/S0950-7051(13)00324-9/h0070http://refhub.elsevier.com/S0950-7051(13)00324-9/h0075http://refhub.elsevier.com/S0950-7051(13)00324-9/h0075http://refhub.elsevier.com/S0950-7051(13)00324-9/h0075http://refhub.elsevier.com/S0950-7051(13)00324-9/h0080http://refhub.elsevier.com/S0950-7051(13)00324-9/h0080http://refhub.elsevier.com/S0950-7051(13)00324-9/h0080http://refhub.elsevier.com/S0950-7051(13)00324-9/h0085http://refhub.elsevier.com/S0950-7051(13)00324-9/h0085http://refhub.elsevier.com/S0950-7051(13)00324-9/h0085http://refhub.elsevier.com/S0950-7051(13)00324-9/h0090http://refhub.elsevier.com/S0950-7051(13)00324-9/h0090http://refhub.elsevier.com/S0950-7051(13)00324-9/h0090http://refhub.elsevier.com/S0950-7051(13)00324-9/h0095http://refhub.elsevier.com/S0950-7051(13)00324-9/h0095http://refhub.elsevier.com/S0950-7051(13)00324-9/h0095http://refhub.elsevier.com/S0950-7051(13)00324-9/h0100http://refhub.elsevier.com/S0950-7051(13)00324-9/h0100http://refhub.elsevier.com/S0950-7051(13)00324-9/h0105http://refhub.elsevier.com/S0950-7051(13)00324-9/h0105http://refhub.elsevier.com/S0950-7051(13)00324-9/h0105http://refhub.elsevier.com/S0950-7051(13)00324-9/h0110http://refhub.elsevier.com/S0950-7051(13)00324-9/h0110http://refhub.elsevier.com/S0950-7051(13)00324-9/h0115http://refhub.elsevier.com/S0950-7051(13)00324-9/h0115http://refhub.elsevier.com/S0950-7051(13)00324-9/h0115http://refhub.elsevier.com/S0950-7051(13)00324-9/h0120http://refhub.elsevier.com/S0950-7051(13)00324-9/h0120http://refhub.elsevier.com/S0950-7051(13)00324-9/h0120http://refhub.elsevier.com/S0950-7051(13)00324-9/h0125http://refhub.elsevier.com/S0950-7051(13)00324-9/h0125http://refhub.elsevier.com/S0950-7051(13)00324-9/h0130http://refhub.elsevier.com/S0950-7051(13)00324-9/h0130http://refhub.elsevier.com/S0950-7051(13)00324-9/h0130http://refhub.elsevier.com/S0950-7051(13)00324-9/h0135http://refhub.elsevier.com/S0950-7051(13)00324-9/h0135http://refhub.elsevier.com/S0950-7051(13)00324-9/h0135http://refhub.elsevier.com/S0950-7051(13)00324-9/h0140http://refhub.elsevier.com/S0950-7051(13)00324-9/h0140http://refhub.elsevier.com/S0950-7051(13)00324-9/h0145http://refhub.elsevier.com/S0950-7051(13)00324-9/h0145http://refhub.elsevier.com/S0950-7051(13)00324-9/h0145http://refhub.elsevier.com/S0950-7051(13)00324-9/h0150http://refhub.elsevier.com/S0950-7051(13)00324-9/h0150http://refhub.elsevier.com/S0950-7051(13)00324-9/h0150http://refhub.elsevier.com/S0950-7051(13)00324-9/h0155http://refhub.elsevier.com/S0950-7051(13)00324-9/h0155http://refhub.elsevier.com/S0950-7051(13)00324-9/h0160http://refhub.elsevier.com/S0950-7051(13)00324-9/h0160http://refhub.elsevier.com/S0950-7051(13)00324-9/h0160http://refhub.elsevier.com/S0950-7051(13)00324-9/h0160http://refhub.elsevier.com/S0950-7051(13)00324-9/h0165http://refhub.elsevier.com/S0950-7051(13)00324-9/h0165http://refhub.elsevier.com/S0950-7051(13)00324-9/h0170http://refhub.elsevier.com/S0950-7051(13)00324-9/h0170http://refhub.elsevier.com/S0950-7051(13)00324-9/h0175http://refhub.elsevier.com/S0950-7051(13)00324-9/h0175http://refhub.elsevier.com/S0950-7051(13)00324-9/h0180http://refhub.elsevier.com/S0950-7051(13)00324-9/h0180http://refhub.elsevier.com/S0950-7051(13)00324-9/h0180http://refhub.elsevier.com/S0950-7051(13)00324-9/h0180http://refhub.elsevier.com/S0950-7051(13)00324-9/h0180http://refhub.elsevier.com/S0950-7051(13)00324-9/h0180http://refhub.elsevier.com/S0950-7051(13)00324-9/h0175http://refhub.elsevier.com/S0950-7051(13)00324-9/h0175http://refhub.elsevier.com/S0950-7051(13)00324-9/h0170http://refhub.elsevier.com/S0950-7051(13)00324-9/h0170http://refhub.elsevier.com/S0950-7051(13)00324-9/h0165http://refhub.elsevier.com/S0950-7051(13)00324-9/h0165http://refhub.elsevier.com/S0950-7051(13)00324-9/h0160http://refhub.elsevier.com/S0950-7051(13)00324-9/h0160http://refhub.elsevier.com/S0950-7051(13)00324-9/h0160http://refhub.elsevier.com/S0950-7051(13)00324-9/h0155http://refhub.elsevier.com/S0950-7051(13)00324-9/h0155http://refhub.elsevier.com/S0950-7051(13)00324-9/h0150http://refhub.elsevier.com/S0950-7051(13)00324-9/h0150http://refhub.elsevier.com/S0950-7051(13)00324-9/h0150http://refhub.elsevier.com/S0950-7051(13)00324-9/h0145http://refhub.elsevier.com/S0950-7051(13)00324-9/h0145http://refhub.elsevier.com/S0950-7051(13)00324-9/h0145http://refhub.elsevier.com/S0950-7051(13)00324-9/h0140http://refhub.elsevier.com/S0950-7051(13)00324-9/h0140http://refhub.elsevier.com/S0950-7051(13)00324-9/h0135http://refhub.elsevier.com/S0950-7051(13)00324-9/h0135http://refhub.elsevier.com/S0950-7051(13)00324-9/h0135http://refhub.elsevier.com/S0950-7051(13)00324-9/h0130http://refhub.elsevier.com/S0950-7051(13)00324-9/h0130http://refhub.elsevier.com/S0950-7051(13)00324-9/h0130http://refhub.elsevier.com/S0950-7051(13)00324-9/h0125http://refhub.elsevier.com/S0950-7051(13)00324-9/h0125http://refhub.elsevier.com/S0950-7051(13)00324-9/h0120http://refhub.elsevier.com/S0950-7051(13)00324-9/h0120http://refhub.elsevier.com/S0950-7051(13)00324-9/h0120http://refhub.elsevier.com/S0950-7051(13)00324-9/h0115http://refhub.elsevier.com/S0950-7051(13)00324-9/h0115http://refhub.elsevier.com/S0950-7051(13)00324-9/h0115http://refhub.elsevier.com/S0950-7051(13)00324-9/h0110http://refhub.elsevier.com/S0950-7051(13)00324-9/h0110http://refhub.elsevier.com/S0950-7051(13)00324-9/h0105http://refhub.elsevier.com/S0950-7051(13)00324-9/h0105http://refhub.elsevier.com/S0950-7051(13)00324-9/h0100http://refhub.elsevier.com/S0950-7051(13)00324-9/h0100http://refhub.elsevier.com/S0950-7051(13)00324-9/h0095http://refhub.elsevier.com/S0950-7051(13)00324-9/h0095http://refhub.elsevier.com/S0950-7051(13)00324-9/h0095http://refhub.elsevier.com/S0950-7051(13)00324-9/h0090http://refhub.elsevier.com/S0950-7051(13)00324-9/h0090http://refhub.elsevier.com/S0950-7051(13)00324-9/h0090http://refhub.elsevier.com/S0950-7051(13)00324-9/h0085http://refhub.elsevier.com/S0950-7051(13)00324-9/h0085http://refhub.elsevier.com/S0950-7051(13)00324-9/h0085http://refhub.elsevier.com/S0950-7051(13)00324-9/h0080http://refhub.elsevier.com/S0950-7051(13)00324-9/h0080http://refhub.elsevier.com/S0950-7051(13)00324-9/h0080http://refhub.elsevier.com/S0950-7051(13)00324-9/h0075http://refhub.elsevier.com/S0950-7051(13)00324-9/h0075http://refhub.elsevier.com/S0950-7051(13)00324-9/h0075http://refhub.elsevier.com/S0950-7051(13)00324-9/h0070http://refhub.elsevier.com/S0950-7051(13)00324-9/h0070http://refhub.elsevier.com/S0950-7051(13)00324-9/h0070http://-/?-http://refhub.elsevier.com/S0950-7051(13)00324-9/h0065http://refhub.elsevier.com/S0950-7051(13)00324-9/h0065http://-/?-http://refhub.elsevier.com/S0950-7051(13)00324-9/h0060http://refhub.elsevier.com/S0950-7051(13)00324-9/h0060http://refhub.elsevier.com/S0950-7051(13)00324-9/h0060http://-/?-http://refhub.elsevier.com/S0950-7051(13)00324-9/h0055http://refhub.elsevier.com/S0950-7051(13)00324-9/h0055http://refhub.elsevier.com/S0950-7051(13)00324-9/h0055http://-/?-http://refhub.elsevier.com/S0950-7051(13)00324-9/h0050http://refhub.elsevier.com/S0950-7051(13)00324-9/h0050http://-/?-http://refhub.elsevier.com/S0950-7051(13)00324-9/h0045http://refhub.elsevier.com/S0950-7051(13)00324-9/h0045http://refhub.elsevier.com/S0950-7051(13)00324-9/h0045http://-/?-http://adrem.ua.ac.be/~goethals/software/http://adrem.ua.ac.be/~goethals/software/http://-/?-http://-/?-http://refhub.elsevier.com/S0950-7051(13)00324-9/h0040http://refhub.elsevier.com/S0950-7051(13)00324-9/h0040http://refhub.elsevier.com/S0950-7051(13)00324-9/h0035http://refhub.elsevier.com/S0950-7051(13)00324-9/h0035http://refhub.elsevier.com/S0950-7051(13)00324-9/h0035http://refhub.elsevier.com/S0950-7051(13)00324-9/h0030http://refhub.elsevier.com/S0950-7051(13)00324-9/h0030http://refhub.elsevier.com/S0950-7051(13)00324-9/h0030http://refhub.elsevier.com/S0950-7051(13)00324-9/h0025http://refhub.elsevier.com/S0950-7051(13)00324-9/h0025http://refhub.elsevier.com/S0950-7051(13)00324-9/h0025http://refhub.elsevier.com/S0950-7051(13)00324-9/h0020http://refhub.elsevier.com/S0950-7051(13)00324-9/h0020http://refhub.elsevier.com/S0950-7051(13)00324-9/h0020http://refhub.elsevier.com/S0950-7051(13)00324-9/h0015http://refhub.elsevier.com/S0950-7051(13)00324-9/h0015http://refhub.elsevier.com/S0950-7051(13)00324-9/h0015http://refhub.elsevier.com/S0950-7051(13)00324-9/h0010http://refhub.elsevier.com/S0950-7051(13)00324-9/h0010http://refhub.elsevier.com/S0950-7051(13)00324-9/h0010http://refhub.elsevier.com/S0950-7051(13)00324-9/h0005http://refhub.elsevier.com/S0950-7051(13)00324-9/h0005http://-/?-http://-/?-

efficient frequent mining frequent patterns without candidate fpgrowth 2004requent pattern mining...

Documents