[ieee 2014 sixth world congress on nature and biologically inspired computing (nabic) - porto,...
TRANSCRIPT
Genetic Algorithm versus Memetic algorithm for
Association Rules Mining
Habiba Drias LRIA, USTHB
BP 32 El Alia Bab Ezzouar , Algiers, Algeria ,
:4bstract--:This paper.
deals with association rules mining usmg evolutIOnary algorIthms. All previous bio-inspired based association rules mining approaches generate non admissible �ules, which cannot be exploited by the end-user. To cope with this Issue, we propose two approaches that avoid non admissible rules by developing a new strategy called delete and decomposition strategy. If an item appears in the antecedent and the consequent parts of a given rule, the latter is decomposed in two admissible rules. Then, we delete such item from the antecedent part of the first rule and from the consequent part of the second rule. Afterwards, we design a genetic algorithm called IARMGA and a memetic algorithm called IARMMA for association rules mining. Several experiments were carried out using both synthetic and reals instances. The results reveal a compromise between the execution time and the quality of output rules. IARMGA is faster than IARMMA whereas the latter outperforms the former in terms of rules quality.
Keywords association rules mining, bio-inspired algo
rithms, genetic algorithm, memetic algorithm.
I. INTRODUCTION
Association Rules Mining (ARM) is one of the most im
portant and well studied techniques of data mining tasks. It
aims at extracting frequent patterns, associations or causal
structures among sets of items from a given transactional
database. Formally, the association rule problem is defined
as follows: let T be a set of n transactions {tl' t2,"" tn} representing a transactional database, and I be a set of m
different items or attributes {il,i2, ... ,im}. an association
rule is an implication of the form X --+ Y where X c I Y c I and X n Y = 0. The itemset X is called anteceden� while the item set Y is called consequent and the rule means
X implies Y. The intuitive meaning is that the set {X, Y} is
a frequent pattern.
ARM consists in discovering a set of association rules
covering a large percentage of the data and tends to produce an
important number of rules. Nowadays, since the databases are
increasingly large, the user no longer looks for all the rules
but only a subset of useful ones. Two basic parameters are
commonly used for measuring the usefulness of association
rules, namely the support and the confidence of a rule. The
support of an itemset I' � I is the number of transactions
containing I'. The support of a rule X --+ Y is the support of
X U Y and is the percentage of transactions containing X and
Y together. It is computed as:
support(XUY) ITI
978-l-4799-5937-2114/$31.00 ©2014 IEEE 208
The confidence of a rule is calculated as:
support(XUY) support(X) .
and measures the strength of the rule. An assocIatIOn rule
X --+ Y with a confidence of 80% means that 80% of the
transactions that contain X also contain Y. Therefore, ARM
aims at extracting from a given database, all interesting rules,
that have a support ?: MinSup and a confidence ?: MinConf,
where MinSup and MinConf are two thresholds predefined by
users [1].
Nature-inspired approaches have attracted a lot of attention
in the artificial intelligence conununity. They are mainly based
on several inspiration sources such as swarm intelligence (SI
based), biological systems (bio-inspired), physical and chem
ical phenomena [5]. The relationship between such methods
can be roughly summarized as follows:
SI-based C bio-inspired C nature-inspired
In this work, we investigate two bio-inspired algorithms,
namely genetic and memetic algorithms for association rules
mining. In a standard genetic algorithm [2], an initial popu
lation of individuals is drawn. The principle is to make this
population evolving towards another one with better perfor
mances. To generate new individuals having high quality, the
genetic operators like crossover and mutation are applied. This
process is repeated until a given criterion is reached.
In the last decade, another approach based on genetic
algorithm and called memetic algorithm [3] was proposed.
The idea is to integrate in each new individual a cultural con
cept perceived from the environment. Most memetic models
simulate this feature by a simple local search applied on such
new individuals.
In the literature, several ARM algorithms based on evolu
tionary approach were proposed. However, they all present
a serious drawback, which is the non-admissibility of the
rules [8],[9],[10],[12]. To overcome this issue, we address the
problem with a new modelling and then adapt it to the two
mentioned evolutionary algorithms.
The rest of this paper is organized as follows. The next
section provides the state of the art for ARM algorithms.
Section III and Section IV describe respectively the algorithms
IARMGA and IARMMA. Section V presents the experimental
study and the obtained outcomes. Finally, in Section VI we
conclude this work by outlining the overall achievements and
suggest future perspectives.
II. REL ATED WORKS
Several algorithms have been designed for ARM problem.
Some of them are based on exact approaches like Apriori[4],
FP growth[6]. These algorithms are time consuming when
applied to large data sets.
As another alternatives, bio-inspired-based algorithms like
GAR[8], GENAR[9] and ARMGA[lO] were proposed. These
genetic algorithms present a main limit, which is the inefficient
representation of the individual. In [12], an adaptive GA
called AGA was developed for computing ARM. The two
major differences between classical GA and AGA reside in
the design of the mutation and crossover operators. Another
algorithm called PQGMA is proposed in [13]. Mainly, mining
is performed by applying classical GA, while the mutation
and the crossover operations are performed using quantum
computing and simulated annealing respectively. However,
the use of quantum computing in the mutation suffers from
diversification and therefore leads to premature convergence.
In [11], another GA is proposed for mining association rules.
By using an adaptive mutation rate, this algorithm provides
an important population variation. Nevertheless, the muta
tion probability is computed at each iteration increasing this
way the computational time. In [15], the authors developed
G3PARM based on genetic programming. They use the G3P
(Grammar Guided Genetic Programming) to avoid invalid
individuals found by GP process. Also, G3PARM allows
multiple variants of data by using a context free grammar.
An interesting work that provides a performance analysis
of association rules mining based on genetic algorithm is
described in [14]. All these algorithms, though their refined
design, generate non admissible rules. To tackle this difficulty,
we propose in the next section an efficient strategy called
delete and decomposition for modelling the problem and then
address it using the target evolutionary algorithms.
III. AN IMPROV ED GENETIC ALGORITHM FOR
ASSOCIATION RULES MINING
All of the above related algorithms have mainly two limits.
Firstly, they could generate false rules that is, rules that have a
high fitness quality without respecting minimum support and
minimum confidence constraints. Secondly, their respective
process creates new solutions that may not be admissible and
even more there is no special treatment to manage this issue.
To overcome these main drawbacks, the first contribution
in this work is to propose an improved association rules
mining with genetic algorithm (IARMGA for short) which
eliminates the risk of generating false rules and which solves
the admissibility problem by defining a new strategy for
crossover and mutation operators.
A. Encoding Solution
In [10] and [11], two representations called respectively
Binary encoding and Integer encoding were proposed. In
Binary encoding, each rule is encoded by a vector S of n elements where n is the number of all items. In addition, the
ith element of S is set to 1 if the item i is in the rule and 0
otherwise. The Integer encoding rule is defined by a vector S of k + 1 units where k is the number of the items in such rule.
The first element is the separator index between antecedent and
consequent parts of the solution. For all the other elements i in S, if Sri] = j then the item j appears in the ith position
of the rule. In IARMGA, the second representation is used to
facilitate the implementation of genetic operators.
Example i: Let T={tIh, ... ,tlO } be a set of items
• Sl = {3, 2, 4, 5, 3, 6, 7, 8} represents the rule
ry t2,t4 =} t5,t3,t6,t7,tS. • S2= {I, 4, 7, 1, 2, 9} represents the rule
r2: � =} t4,t7,tl,t2,t9. • S3= {5, 1,6,8, 7, 3, 2, 4} represents the rule
r3: tl,t6,tS,t7 =} t3,t2,t4.
The symbol � denotes an empty antecedent.
B. Fitness Function Design
As mentioned above, the ARM problem aims at finding all
rules satisfying the minimum support and the minimum con
fidence constraints. Let 0: and (3 be two empirical parameters,
the fitness function of the solution s is computed as: F = o:x confidence(s)+(3x support(s)
if Confidence(s) ?: MinConf and Support(s) ?: MinSup
F = -1 otherwise.
When the constraints of Minconf and MinSup are not
satisfied, the fitness of the individual is penalized by
assigning to it a negative value.
C. Crossover operator
Two parents are first selected from the given population. To
create the new offspring, we apply the following two crossover
operators:
• The antecedent part of the first parent is transferred to
the antecedent part of the first child, and the consequent
part of the first parent is copied to the consequent of the
second child.
• The antecedent part of the second parent is copied to the
antecedent part of the second child, and the consequent
part of the second parent is copied to the consequent part
of the first child.
The general algorithm of the crossover process is outlined
in algorithm 1. The second and the third instructions indicate
that the antecedent part of the first child takes the same number
of items as the antecedent part of the first parent and the
antecedent part of the second child takes the same number
of items as the antecedent part of the second parent. From
the fourth instruction to the ninth one, the items of the first
parent are copied on the first and the second children using
the first constraint described above. From the tenth instruction
to the fifteenth instruction, the items of the second parent are
copied on the first and the second children using the second
constraint.
Example 2: Let us consider the following parents:
ParentI: {2, 1, 2, 5, 3} represents the rule: tl =} t2, t5, t3. Parent2 : {3, 4, 8, 7, 6} represents the rule: t4, ts =} t7, t6.
2014 Sixth World Congress on Nature and Biologically inspired Computing (NaBiC) 209
Algorithm 1 Crossover algorithm
1: Input data: ParentI, Parent2: Arrays of integers
Output data: ChildI, Child2: Arrays of integers
2: ChilddO] +-- ParentI [0] 3: Child2[0] +-- Parent2[0] 4: for i=l to ParentI [0] - 1 do 5: ChildI[i] +-- Parenh[i] 6: end for 7: for i=ParentI [0 ] to ParentI.size do 8: Child2[i] +-- ParentI[i] 9: end for
10: for i=l to Parent2-l0] - 1 do 11: Child2[i] +-- Parent2[i] 12: end for 13: for i=Parent2[0 ] to Parent2.size do 14: ChildI[i] +-- Parent2[i] 15: end for
Algorithm 2 Mutation algorithm
1: for each new child ch do 2: a +-- choose a number between [l..nl
3: b +-- choose a number between [l..ch.sizel
4: ch[b 1 +-- a
5: end for
The results of the crossover operator is
ChildI : {2, 1, 7, 6} represents the rule: h =} h, t6. Child2 {3, 4, 8, 2, 5, 3} represents the rule: t4, ts =} t2,t5,t3'
D. Mutation operator
The mutation operation in the genetic algorithm stimulates
the diversification search. The technique we use consists in
altering one bit of each children randomly. In others terms, an
item is replaced by another one. The general algorithm of this
operation is described in algorithm 2. if we consider the two children of the previous example and
for a = 2 and b = 3, we obtain:
ChildI : {2, 1, 7, 2} represents the rule: tI =} t7, t2. Child2 : {3, 4, 8, 2, 5, 3} represents the rule: t4, ts =} t2, t5, t3.
E. Delete and decomposition strategy
The crossover and mutation operators can yield non
admissible solutions. Indeed, an item can appear in both the
antecedent and consequent parts of a generated rule after the
accomplishment of any of these two processes and can break
the non admissibility constraint. In order to palliate this situ
ation, the "delete and decomposition" strategy was conceived
with two stages. During the first phase, this item is removed
from the antecedent part while during the second phase it
is removed from the consequent part. This operation allows
us to decompose the non-admissible solution in two accepted
solutions according to the syntactical form. Furthermore, the
item belongs to the antecedent part for the first solution and
it belongs to the consequent part for the second one.
Evaluation and selection
Crossover
Delete and decomposition
strategy
Mutation
Delete and decomposition
strategy
Yes ( Exit )1+-« --- IMAXis
reached
Fig. I. 1ARMGA Algorithm
No
Consider r : A, B, C =} C, F. This rule is non admissible
because the item C belongs at the same time to its antecedent
and its consequent parts. Using delete and decomposition
strategy, r can be decomposed in two rules:
rI : A, B =} C, F, it is obtained by removing the item C from
the antecedent part of r. r2 : A, B, C =} F, it is obtained by removing the item C from
the consequent part of r. Notice that when the modified item appears already in the
same part of the rule, then no effect is generated on the non
admissibility constraint. The treatment here is to consider only
one occurrence of the item, which conducts to shorten the size
of the considered part and hence to decrement the size of the
whole rule.
F The Improved Genetic Algorithm Principle
The initial population of pop_size individuals is first gen
erated at random. Each individual is constructed with respect
to the encoding solution presented in section III-A. Then, for
each non-admissible solution, the delete and decomposition
strategy is applied. To keep the same size of the population,
all individuals are evaluated using the fitness function and
only the first good-quality pop_size individuals remain, the
others are removed. As for the classical algorithm, we apply
the crossover and mutation operators, but after each operation,
the delete and decomposition strategy is performed. The same
process is repeated until the maximum number of iterations
is reached. Figure 1 presents the overall architecture of
IARMGA algorithm, where IMAX is the maximum number
of iterations.
210 2014 Sixth World Congress on Nature and Biologically Inspired Computing (NaBIC)
Algorithm 3 Neighbors Determination
1: Input: Solution: 8
Output: Array: Neighbors
2: for each element i E 8 do 3:
4:
5:
for each element j E 8 do Neighbors [j] +--- 8m
end for 6: number +--- Choose_Number(n)
7: Neighbors[i] +--- 8[i]
8: end for 9: return Neighbors
Algorithm 4 Local Search Algorithm
1: for i =1 to IMAX_ls do 2: Neighbors +--- Neighbors_Determination( 8)
3: Delete_Decompositon_Strategy(Neighbors).
4: for each ng E Neighbors do 5: Fitness(ng)
6: end for 7: 8 +--- bescneighbors (Neighbors)
8: end for 9: return 8
IV. A MEMETIC ALGORITHM FOR ASSOCIATION RULES
MINING
This section presents the adaptation of IARMGA to IAR
MMA using the memetic approach. After the mutation step,
for each generated solution, a local search is applied. The
local search process is performed by applying successively
the neighborhood computation process. In the following, we
give details about the main points of the algorithm.
A. Neighborhood search computation
The neighbors of a given solution 8 are determined by
replacing each element of 8 by a number belonging to the
range [l..nJ, n being the number of items. This way, at most
n neighbors will be created. The detailed algorithm of this
process is presented in algorithm 3.
B. Local search process
The local search process consists in undertaking the search
successively in such a way, in each pass we start from the
best solution of the previous pass. This process is repeated
with maximum IMAX_Is iterations. To avoid non-admissible
solutions, the delete and decomposition strategy is applied on
each generated neighbor. The search process offers the best
solution situated in the specified region. The general algorithm
of this operation is shown in Algorithm 4.
C. The Memetic Algorithm principle
First, a random population is generated, as for IARMGA,
the delete and decomposition strategy is performed followed
by the crossover and mutations operators. Afterwards, for each
generated child, a local search algorithm is launched aiming at
returning the best solution located in its region. This process
( Exit
0,045 0,04
O.Q3S 0,03
0,025 0,02
0,015 0,01
0,005 o
)-Yes
F
Crossover and Mutation
Delete and decomposition
strategy
Local Search Process
Evaluation and selection
IMAXis No reached
Fig. 2. IARMMA Algorithm
Genetic
I J
( I
r -Genetic
./ )
--'"
Fig. 3. Solution Quality of Genetic Approach with different number of iterations
allows finding more and more pertinent rules for each region
in the rules space. At the end of each pass of the algorithm,
the selection operation returns the best solutions used in the
next generation. This process is repeated until the maximum
number of iterations is reached. The general framework of
IARMMA is shown in figure 2.
V. EXPERIMENTAL STUDY
In order to validate the suggested approaches, intensive
experiments were carried out. All programs were implemented
in C++ using I3 processor and 4GB memory. We first tune the
parameters for both algorithms IARMGA and IARMMA in
cluding the size of the population and the number of iterations.
The two algorithms are then compared using the real-time
scientific databases, which are frequently used in data mining
community, like Frequent and Mining Dataset Repository [17]
and Bilkent University Function Approximation Repository
[16].
2014 Sixth World Congress on Nature and Biologically Inspired Computing (NaBIC) 211
f QAS
0,15 -Mimetic
Mimetic J r I
I ../
/ /
----
0, 4
0,35
0,3
0,25
0,2
0,1
0,05
° IMAX
Fig. 4. Solution Quality of Memetic Approach with different number of iterations
1800 time
1600 -r 1400
/ 1200 / ./ 1000
// 800 -IARMGA 600 // -IARMMA
// 400 .-// 200
---=-° # transac.tions
",0> >5>'" <>'" <>'" <>'" <>'" <>'" " "'''' ",<5' ",,,,<5' ,->"'<5' .,,,,<5' >5>'" " "''''
Fig. 5. execution time (Sec) of the two approaches with different number of transactions
Recall that the solution quality is measured through the
firness function that takes into account the support and the
confidence of a rule (section IILR). Figures 3 and 4 presents
the quality of the returned solutions of the two algorithms
respectively. While augmenting the number of iterations from
1 to 100, the solutions returned by the algorithms still increase
until they stabilize at 78 for IARMGA and 80 for IARMMA.
Consequently, the maximum number of iterations is set to 80
for IARMGA and 78 for IARMMA.
Table I shows how the quality of the returned solutions
of both approaches varies while changing the size of the
population, The best quality of both algorithms is achieved
when the size of the population is set to 60, This experiment
conducts us to tune the population size of the both algorithms
to 60 individuals.
After having tuned the parameters (IMAX=80/IARMGA,
IMAX=78/IARMMA, and pop_size=60), we compare the
performance of the algorithms in terms of running time and
solution quality. Figure 5 reveals that IARMMA is high time
consuming compared to IARMGA when the number of input
transactions is too large. In fact, the local search added in
IARMMA led the process to respond very slowly,
However, Table II shows the efficiency of IARMMA com
pared to IARMGA in terms of solutions quality. Indeed, when
increasing the number of transactions, a large gap between
the methods outcomes is observed, thanks to the local search
TABLE I SOLUTION QUALITY ACCORDING TO DIFFERENT POPULATION SIZES
pop_size IARMGA IARMMA 10 0.0025 0.0028 20 0,005 0.0078 30 0.01 0.013 40 0.0025 0.003 50 0.003 0.0034 60 0.02 0.02 70 0.002 0.003 80 0.005 0.007 90 0.007 0.02
100 0,005 0.01
TABLE II SOLUTION QUALITY OF THE TWO ALGORITHMS ACCORDING TO
DIFFERENT DATA INSTANCES
Data sets Number of Number of IARMGA IARMMA name Transactions Items Bolts 40 8 1 1 Sleep 56 8 1 1
Pollution 60 16 1 1 Basket 96 5 1 1
IBM.Q.st 1000 40 0.97 1 Quake 2178 4 0.95 0.99 Cheese 3196 75 0.01 0.03
Mushroom 8124 119 0.008 0.03 BMS-WEB-l 59602 497 0.0005 0.001
Connect 100000 999 0.015 0.02 BMPPOS 515597 1657 0.0007 0.0009 WebDocs 1692082 526765 Blocked After Blocked After
15 days 15 days
incorporated in IARMMA.
The final result is sununarized by the fact that there exists
a compromise between the execution time and the solution
quality for both tools. If we desire to find an acceptable
solution in real time, IARMGA should be used, whereas, if
we seek for solutions close to the optimal solution, IARMMA
should be selected.
VI. CONCLUSION
In this paper, two bio-inspired approaches for association
rules mining problem are proposed. The first one called
IARMGA, is an efficient genetic algorithm designed to find a
set of acceptable rules in real time. The second one called
IARMMA is a memetic algorithm developed for the same
purpose. The memetic mechanism allows to enrich the quality
of the returned rules. Both approaches are capable to overcome
the solutions admissibility problem using an efficient strategy
called "delete and decomposition", To analyze the behavior of
the proposed approaches, several experiments were performed
on both synthetic and real data sets. The results show that
IARMGA is less time consuming compared to IARMMA,
whereas, the solutions returned by IARMMA are better than
those found by IARMGA. We conclude that there exists a
compromise between the CPU time and solutions quality chal
lenges. The choice of the algorithm depends on the application
at hand. If we want to extract association rules in real time
212 2014 Sixth World Congress on Nature and Biologically Inspired Computing (NaBIC)
then, IARMGA is more appropriate. However if we desire
more precision in the resulted rules, IARMMA is preferred.
In addition, the two algorithms are bluntly blocked after 15 days of processing Webdocs benchmark (the larger benchmark
existing on the web). Consequently, a parallel version of
these approaches can bring more efficiency to handle this big
instance of transactions.
AKNOWLEDGMENT: We are sincerely grateful to the
reviewers who provide us with interesting remarks to improve
the paper quality.
REFERENCES
[I] R. Agrawal, and J. Shafer, "Parallel mining of associations rules", IEEE Transactions on knowledge and Data Engineering, VOL 8, NO 6,1996, pp. 962-969.
[2] J. H. Holland, "Genetic algorithms and the optimal allocation of trials", SIAM Journal on Computing, 2(2), 1973, pp. 88-105.
[3] J. M. Aurifeille, "A bio-mimetic approach to marketing segmentation: Principles and comparative analysis", European Journal of Economic and Social Systems, 14(1), 2000, pp. 93-108.
[4] R. Agrawal, l. Tomasz, and S. Arun, "Mining association rules between sets of items in large databases", ACM SIGMOD Record. Vol. 22. No. 2. ACM, 1993.
[5] I. Jr. Fister, X.S. Yang, l. Fister, J. Brest, and D. Fister, "A Brief Review of Nature-Inspired Algorithms for Optimization", Elektrotehniski vestnik, 80(3), 2013.
[6] J. Han, P. Jian, and Y. Yiwen, "Mining frequent patterns without candidate generation", ACM SIGMOD Record. Vol. 29. No. 2. ACM, 2000.
[7] Y. Djenouri, H. Drias, and A. Chemchem, "A hybrid Bees Swarm Optimization and Tabu Search algorithm for Association rule mining", In Nature and Biologically Inspired Computing, IEEE NaBlC, 2013, pp. 120-125.
[8] J. Mata, J. Alvarez, and J. Riquelme, "An Evolutionary algorithm to discover numeric association rules", In Proceedings of the ACM symposium on Applied computing SAC, 2002, pp. 590-594.
[9] J. Mata , J. Alvarez, and J. Riquelme, "Mining numeric association rules with genetic algorithms", In Proceedings of the International Conference ICANNGA, 2001, pp. 264-267.
[10] X. Yan, and C. Zhang, "Genetic algorithm based strategy for identifying association rule without specifying minimum support", Expert system with applications, VOL 36, No 2, 2009, pp. 3066-3076.
[11] G. Hong, and Y. Zhou, "An algorithm for mining association rules based on improved genetic algorithm and its application", Third international conference on genetic and evolutionary computing, IEEE computer science, 2009, pp. 117-120.
[12] M. Wang, Q. zou, and C. lin, "Multi dimensions association rules mining on adaptive genetic algorithm", International conference on uncertainly reasoning on knowledge engineering, 20 II.
[13] D. Liu, "Improved genetic algorithm based on simulated annealing and quantum computing strategy for association rule mining", Journal of software, 2010.
[14] K. Indira, and S. Kanmani, "Performance Analysis of Genetic Algorithm for Mining Association Rules", International Journal of Computer Science Issues, Vo19, Nol, 2012.
[15] c. Romero, A. Zafra, J. Luna, and S. Ventura, "Association rule mining using genetic programming to provide feedback to instructors from multiple-choice quiz data", Journal of Expert System, 2012.
[16] H.A. Guvenir, and I. Uysal, http://funapp.cs.bilkent.edu.tr. Bilkent University Function Approximation Repository, 2000.
[17] B. Goethals, http: //fimi.ua.ac.be/ Frequent Itemset Mining Implementations Repository 2004.
2014 Sixth World Congress on Nature and Biologically Inspired Computing (NaBIC) 213