[ieee 2014 sixth world congress on nature and biologically inspired computing (nabic) - porto,...

Genetic Algorithm versus Memetic algorithm for

Association Rules Mining

Habiba Drias LRIA, USTHB

BP 32 El Alia Bab Ezzouar , Algiers, Algeria ,

[email protected]

:4bstract--:This paper.

deals with association rules mining usmg evolutIOnary algorIthms. All previous bio-inspired based association rules mining approaches generate non admissible �ules, which cannot be exploited by the end-user. To cope with this Issue, we propose two approaches that avoid non admissible rules by developing a new strategy called delete and decomposition strategy. If an item appears in the antecedent and the consequent parts of a given rule, the latter is decomposed in two admissible rules. Then, we delete such item from the antecedent part of the first rule and from the consequent part of the second rule. Afterwards, we design a genetic algorithm called IARMGA and a memetic algorithm called IARMMA for association rules mining. Several experiments were carried out using both synthetic and reals instances. The results reveal a compromise between the execution time and the quality of output rules. IARMGA is faster than IARMMA whereas the latter outperforms the former in terms of rules quality.

Keywords association rules mining, bio-inspired algo

rithms, genetic algorithm, memetic algorithm.

I. INTRODUCTION

Association Rules Mining (ARM) is one of the most im

portant and well studied techniques of data mining tasks. It

aims at extracting frequent patterns, associations or causal

structures among sets of items from a given transactional

database. Formally, the association rule problem is defined

as follows: let T be a set of n transactions {tl' t2,"" tn} representing a transactional database, and I be a set of m

different items or attributes {il,i2, ... ,im}. an association

rule is an implication of the form X --+ Y where X c I Y c I and X n Y = 0. The itemset X is called anteceden� while the item set Y is called consequent and the rule means

X implies Y. The intuitive meaning is that the set {X, Y} is

a frequent pattern.

ARM consists in discovering a set of association rules

covering a large percentage of the data and tends to produce an

important number of rules. Nowadays, since the databases are

increasingly large, the user no longer looks for all the rules

but only a subset of useful ones. Two basic parameters are

commonly used for measuring the usefulness of association

rules, namely the support and the confidence of a rule. The

support of an itemset I' � I is the number of transactions

containing I'. The support of a rule X --+ Y is the support of

X U Y and is the percentage of transactions containing X and

Y together. It is computed as:

support(XUY) ITI

978-l-4799-5937-2114/$31.00 ©2014 IEEE 208

The confidence of a rule is calculated as:

support(XUY) support(X) .

and measures the strength of the rule. An assocIatIOn rule

X --+ Y with a confidence of 80% means that 80% of the

transactions that contain X also contain Y. Therefore, ARM

aims at extracting from a given database, all interesting rules,

that have a support ?: MinSup and a confidence ?: MinConf,

where MinSup and MinConf are two thresholds predefined by

users [1].

Nature-inspired approaches have attracted a lot of attention

in the artificial intelligence conununity. They are mainly based

on several inspiration sources such as swarm intelligence (SI

based), biological systems (bio-inspired), physical and chem

ical phenomena [5]. The relationship between such methods

can be roughly summarized as follows:

SI-based C bio-inspired C nature-inspired

In this work, we investigate two bio-inspired algorithms,

namely genetic and memetic algorithms for association rules

mining. In a standard genetic algorithm [2], an initial popu

lation of individuals is drawn. The principle is to make this

population evolving towards another one with better perfor

mances. To generate new individuals having high quality, the

genetic operators like crossover and mutation are applied. This

process is repeated until a given criterion is reached.

In the last decade, another approach based on genetic

algorithm and called memetic algorithm [3] was proposed.

The idea is to integrate in each new individual a cultural con

cept perceived from the environment. Most memetic models

simulate this feature by a simple local search applied on such

new individuals.

In the literature, several ARM algorithms based on evolu

tionary approach were proposed. However, they all present

a serious drawback, which is the non-admissibility of the

rules [8],[9],[10],[12]. To overcome this issue, we address the

problem with a new modelling and then adapt it to the two

mentioned evolutionary algorithms.

The rest of this paper is organized as follows. The next

section provides the state of the art for ARM algorithms.

Section III and Section IV describe respectively the algorithms

IARMGA and IARMMA. Section V presents the experimental

study and the obtained outcomes. Finally, in Section VI we

conclude this work by outlining the overall achievements and

suggest future perspectives.

II. REL ATED WORKS

Several algorithms have been designed for ARM problem.

Some of them are based on exact approaches like Apriori[4],

FP growth[6]. These algorithms are time consuming when

applied to large data sets.

As another alternatives, bio-inspired-based algorithms like

GAR[8], GENAR[9] and ARMGA[lO] were proposed. These

genetic algorithms present a main limit, which is the inefficient

representation of the individual. In [12], an adaptive GA

called AGA was developed for computing ARM. The two

major differences between classical GA and AGA reside in

the design of the mutation and crossover operators. Another

algorithm called PQGMA is proposed in [13]. Mainly, mining

is performed by applying classical GA, while the mutation

and the crossover operations are performed using quantum

computing and simulated annealing respectively. However,

the use of quantum computing in the mutation suffers from

diversification and therefore leads to premature convergence.

In [11], another GA is proposed for mining association rules.

By using an adaptive mutation rate, this algorithm provides

an important population variation. Nevertheless, the muta

tion probability is computed at each iteration increasing this

way the computational time. In [15], the authors developed

G3PARM based on genetic programming. They use the G3P

(Grammar Guided Genetic Programming) to avoid invalid

individuals found by GP process. Also, G3PARM allows

multiple variants of data by using a context free grammar.

An interesting work that provides a performance analysis

of association rules mining based on genetic algorithm is

described in [14]. All these algorithms, though their refined

design, generate non admissible rules. To tackle this difficulty,

we propose in the next section an efficient strategy called

delete and decomposition for modelling the problem and then

address it using the target evolutionary algorithms.

III. AN IMPROV ED GENETIC ALGORITHM FOR

ASSOCIATION RULES MINING

All of the above related algorithms have mainly two limits.

Firstly, they could generate false rules that is, rules that have a

high fitness quality without respecting minimum support and

minimum confidence constraints. Secondly, their respective

process creates new solutions that may not be admissible and

even more there is no special treatment to manage this issue.

To overcome these main drawbacks, the first contribution

in this work is to propose an improved association rules

mining with genetic algorithm (IARMGA for short) which

eliminates the risk of generating false rules and which solves

the admissibility problem by defining a new strategy for

crossover and mutation operators.

A. Encoding Solution

In [10] and [11], two representations called respectively

Binary encoding and Integer encoding were proposed. In

Binary encoding, each rule is encoded by a vector S of n elements where n is the number of all items. In addition, the

ith element of S is set to 1 if the item i is in the rule and 0

otherwise. The Integer encoding rule is defined by a vector S of k + 1 units where k is the number of the items in such rule.

The first element is the separator index between antecedent and

consequent parts of the solution. For all the other elements i in S, if Sri] = j then the item j appears in the ith position

of the rule. In IARMGA, the second representation is used to

facilitate the implementation of genetic operators.

Example i: Let T={tIh, ... ,tlO } be a set of items

• Sl = {3, 2, 4, 5, 3, 6, 7, 8} represents the rule

ry t2,t4 =} t5,t3,t6,t7,tS. • S2= {I, 4, 7, 1, 2, 9} represents the rule

r2: � =} t4,t7,tl,t2,t9. • S3= {5, 1,6,8, 7, 3, 2, 4} represents the rule

r3: tl,t6,tS,t7 =} t3,t2,t4.

The symbol � denotes an empty antecedent.

B. Fitness Function Design

As mentioned above, the ARM problem aims at finding all

rules satisfying the minimum support and the minimum con

fidence constraints. Let 0: and (3 be two empirical parameters,

the fitness function of the solution s is computed as: F = o:x confidence(s)+(3x support(s)

if Confidence(s) ?: MinConf and Support(s) ?: MinSup

F = -1 otherwise.

When the constraints of Minconf and MinSup are not

satisfied, the fitness of the individual is penalized by

assigning to it a negative value.

C. Crossover operator

Two parents are first selected from the given population. To

create the new offspring, we apply the following two crossover

operators:

• The antecedent part of the first parent is transferred to

the antecedent part of the first child, and the consequent

part of the first parent is copied to the consequent of the

second child.

• The antecedent part of the second parent is copied to the

antecedent part of the second child, and the consequent

part of the second parent is copied to the consequent part

of the first child.

The general algorithm of the crossover process is outlined

in algorithm 1. The second and the third instructions indicate

that the antecedent part of the first child takes the same number

of items as the antecedent part of the first parent and the

antecedent part of the second child takes the same number

of items as the antecedent part of the second parent. From

the fourth instruction to the ninth one, the items of the first

parent are copied on the first and the second children using

the first constraint described above. From the tenth instruction

to the fifteenth instruction, the items of the second parent are

copied on the first and the second children using the second

constraint.

Example 2: Let us consider the following parents:

ParentI: {2, 1, 2, 5, 3} represents the rule: tl =} t2, t5, t3. Parent2 : {3, 4, 8, 7, 6} represents the rule: t4, ts =} t7, t6.

2014 Sixth World Congress on Nature and Biologically inspired Computing (NaBiC) 209

Algorithm 1 Crossover algorithm

1: Input data: ParentI, Parent2: Arrays of integers

Output data: ChildI, Child2: Arrays of integers

2: ChilddO] +-- ParentI [0] 3: Child2[0] +-- Parent2[0] 4: for i=l to ParentI [0] - 1 do 5: ChildI[i] +-- Parenh[i] 6: end for 7: for i=ParentI [0 ] to ParentI.size do 8: Child2[i] +-- ParentI[i] 9: end for

10: for i=l to Parent2-l0] - 1 do 11: Child2[i] +-- Parent2[i] 12: end for 13: for i=Parent2[0 ] to Parent2.size do 14: ChildI[i] +-- Parent2[i] 15: end for

Algorithm 2 Mutation algorithm

1: for each new child ch do 2: a +-- choose a number between [l..nl

3: b +-- choose a number between [l..ch.sizel

4: ch[b 1 +-- a

5: end for

The results of the crossover operator is

ChildI : {2, 1, 7, 6} represents the rule: h =} h, t6. Child2 {3, 4, 8, 2, 5, 3} represents the rule: t4, ts =} t2,t5,t3'

D. Mutation operator

The mutation operation in the genetic algorithm stimulates

the diversification search. The technique we use consists in

altering one bit of each children randomly. In others terms, an

item is replaced by another one. The general algorithm of this

operation is described in algorithm 2. if we consider the two children of the previous example and

for a = 2 and b = 3, we obtain:

ChildI : {2, 1, 7, 2} represents the rule: tI =} t7, t2. Child2 : {3, 4, 8, 2, 5, 3} represents the rule: t4, ts =} t2, t5, t3.

E. Delete and decomposition strategy

The crossover and mutation operators can yield non

admissible solutions. Indeed, an item can appear in both the

antecedent and consequent parts of a generated rule after the

accomplishment of any of these two processes and can break

the non admissibility constraint. In order to palliate this situ

ation, the "delete and decomposition" strategy was conceived

with two stages. During the first phase, this item is removed

from the antecedent part while during the second phase it

is removed from the consequent part. This operation allows

us to decompose the non-admissible solution in two accepted

solutions according to the syntactical form. Furthermore, the

item belongs to the antecedent part for the first solution and

it belongs to the consequent part for the second one.

Evaluation and selection

Crossover

Delete and decomposition

strategy

Mutation


strategy

Yes ( Exit )1+-« --- IMAXis

reached

Fig. I. 1ARMGA Algorithm

No

Consider r : A, B, C =} C, F. This rule is non admissible

because the item C belongs at the same time to its antecedent

and its consequent parts. Using delete and decomposition

strategy, r can be decomposed in two rules:

rI : A, B =} C, F, it is obtained by removing the item C from

the antecedent part of r. r2 : A, B, C =} F, it is obtained by removing the item C from

the consequent part of r. Notice that when the modified item appears already in the

same part of the rule, then no effect is generated on the non

admissibility constraint. The treatment here is to consider only

one occurrence of the item, which conducts to shorten the size

of the considered part and hence to decrement the size of the

whole rule.

F The Improved Genetic Algorithm Principle

The initial population of pop_size individuals is first gen

erated at random. Each individual is constructed with respect

to the encoding solution presented in section III-A. Then, for

each non-admissible solution, the delete and decomposition

strategy is applied. To keep the same size of the population,

all individuals are evaluated using the fitness function and

only the first good-quality pop_size individuals remain, the

others are removed. As for the classical algorithm, we apply

the crossover and mutation operators, but after each operation,

the delete and decomposition strategy is performed. The same

process is repeated until the maximum number of iterations

is reached. Figure 1 presents the overall architecture of

IARMGA algorithm, where IMAX is the maximum number

of iterations.

210 2014 Sixth World Congress on Nature and Biologically Inspired Computing (NaBIC)

Algorithm 3 Neighbors Determination

1: Input: Solution: 8

Output: Array: Neighbors

2: for each element i E 8 do 3:

4:

5:

for each element j E 8 do Neighbors [j] +--- 8m

end for 6: number +--- Choose_Number(n)

7: Neighbors[i] +--- 8[i]

8: end for 9: return Neighbors

Algorithm 4 Local Search Algorithm

1: for i =1 to IMAX_ls do 2: Neighbors +--- Neighbors_Determination( 8)

3: Delete_Decompositon_Strategy(Neighbors).

4: for each ng E Neighbors do 5: Fitness(ng)

6: end for 7: 8 +--- bescneighbors (Neighbors)

8: end for 9: return 8

IV. A MEMETIC ALGORITHM FOR ASSOCIATION RULES

MINING

This section presents the adaptation of IARMGA to IAR

MMA using the memetic approach. After the mutation step,

for each generated solution, a local search is applied. The

local search process is performed by applying successively

the neighborhood computation process. In the following, we

give details about the main points of the algorithm.

A. Neighborhood search computation

The neighbors of a given solution 8 are determined by

replacing each element of 8 by a number belonging to the

range [l..nJ, n being the number of items. This way, at most

n neighbors will be created. The detailed algorithm of this

process is presented in algorithm 3.

B. Local search process

The local search process consists in undertaking the search

successively in such a way, in each pass we start from the

best solution of the previous pass. This process is repeated

with maximum IMAX_Is iterations. To avoid non-admissible

solutions, the delete and decomposition strategy is applied on

each generated neighbor. The search process offers the best

solution situated in the specified region. The general algorithm

of this operation is shown in Algorithm 4.

C. The Memetic Algorithm principle

First, a random population is generated, as for IARMGA,

the delete and decomposition strategy is performed followed

by the crossover and mutations operators. Afterwards, for each

generated child, a local search algorithm is launched aiming at

returning the best solution located in its region. This process

( Exit

0,045 0,04

O.Q3S 0,03

0,025 0,02

0,015 0,01

0,005 o

)-Yes

F

Crossover and Mutation


strategy

Local Search Process

Evaluation and selection

IMAXis No reached

Fig. 2. IARMMA Algorithm

Genetic

I J

( I

r -Genetic

./ )

--'"

Fig. 3. Solution Quality of Genetic Approach with different number of iterations

allows finding more and more pertinent rules for each region

in the rules space. At the end of each pass of the algorithm,

the selection operation returns the best solutions used in the

next generation. This process is repeated until the maximum

number of iterations is reached. The general framework of

IARMMA is shown in figure 2.

V. EXPERIMENTAL STUDY

In order to validate the suggested approaches, intensive

experiments were carried out. All programs were implemented

in C++ using I3 processor and 4GB memory. We first tune the

parameters for both algorithms IARMGA and IARMMA in

cluding the size of the population and the number of iterations.

The two algorithms are then compared using the real-time

scientific databases, which are frequently used in data mining

community, like Frequent and Mining Dataset Repository [17]

and Bilkent University Function Approximation Repository

[16].

2014 Sixth World Congress on Nature and Biologically Inspired Computing (NaBIC) 211

f QAS

0,15 -Mimetic

Mimetic J r I

I ../

/ /

----

0, 4

0,35

0,3

0,25

0,2

0,1

0,05

° IMAX

Fig. 4. Solution Quality of Memetic Approach with different number of iterations

1800 time

1600 -r 1400

/ 1200 / ./ 1000

// 800 -IARMGA 600 // -IARMMA

// 400 .-// 200

---=-° # transac.tions

",0> >5>'" <>'" <>'" <>'" <>'" <>'" " "'''' ",<5' ",,,,<5' ,->"'<5' .,,,,<5' >5>'" " "''''

Fig. 5. execution time (Sec) of the two approaches with different number of transactions

Recall that the solution quality is measured through the

firness function that takes into account the support and the

confidence of a rule (section IILR). Figures 3 and 4 presents

the quality of the returned solutions of the two algorithms

respectively. While augmenting the number of iterations from

1 to 100, the solutions returned by the algorithms still increase

until they stabilize at 78 for IARMGA and 80 for IARMMA.

Consequently, the maximum number of iterations is set to 80

for IARMGA and 78 for IARMMA.

Table I shows how the quality of the returned solutions

of both approaches varies while changing the size of the

population, The best quality of both algorithms is achieved

when the size of the population is set to 60, This experiment

conducts us to tune the population size of the both algorithms

to 60 individuals.

After having tuned the parameters (IMAX=80/IARMGA,

IMAX=78/IARMMA, and pop_size=60), we compare the

performance of the algorithms in terms of running time and

solution quality. Figure 5 reveals that IARMMA is high time

consuming compared to IARMGA when the number of input

transactions is too large. In fact, the local search added in

IARMMA led the process to respond very slowly,

However, Table II shows the efficiency of IARMMA com

pared to IARMGA in terms of solutions quality. Indeed, when

increasing the number of transactions, a large gap between

the methods outcomes is observed, thanks to the local search

TABLE I SOLUTION QUALITY ACCORDING TO DIFFERENT POPULATION SIZES

pop_size IARMGA IARMMA 10 0.0025 0.0028 20 0,005 0.0078 30 0.01 0.013 40 0.0025 0.003 50 0.003 0.0034 60 0.02 0.02 70 0.002 0.003 80 0.005 0.007 90 0.007 0.02

100 0,005 0.01

TABLE II SOLUTION QUALITY OF THE TWO ALGORITHMS ACCORDING TO

DIFFERENT DATA INSTANCES

Data sets Number of Number of IARMGA IARMMA name Transactions Items Bolts 40 8 1 1 Sleep 56 8 1 1

Pollution 60 16 1 1 Basket 96 5 1 1

IBM.Q.st 1000 40 0.97 1 Quake 2178 4 0.95 0.99 Cheese 3196 75 0.01 0.03

Mushroom 8124 119 0.008 0.03 BMS-WEB-l 59602 497 0.0005 0.001

Connect 100000 999 0.015 0.02 BMPPOS 515597 1657 0.0007 0.0009 WebDocs 1692082 526765 Blocked After Blocked After

15 days 15 days

incorporated in IARMMA.

The final result is sununarized by the fact that there exists

a compromise between the execution time and the solution

quality for both tools. If we desire to find an acceptable

solution in real time, IARMGA should be used, whereas, if

we seek for solutions close to the optimal solution, IARMMA

should be selected.

VI. CONCLUSION

In this paper, two bio-inspired approaches for association

rules mining problem are proposed. The first one called

IARMGA, is an efficient genetic algorithm designed to find a

set of acceptable rules in real time. The second one called

IARMMA is a memetic algorithm developed for the same

purpose. The memetic mechanism allows to enrich the quality

of the returned rules. Both approaches are capable to overcome

the solutions admissibility problem using an efficient strategy

called "delete and decomposition", To analyze the behavior of

the proposed approaches, several experiments were performed

on both synthetic and real data sets. The results show that

IARMGA is less time consuming compared to IARMMA,

whereas, the solutions returned by IARMMA are better than

those found by IARMGA. We conclude that there exists a

compromise between the CPU time and solutions quality chal

lenges. The choice of the algorithm depends on the application

at hand. If we want to extract association rules in real time

212 2014 Sixth World Congress on Nature and Biologically Inspired Computing (NaBIC)

then, IARMGA is more appropriate. However if we desire

more precision in the resulted rules, IARMMA is preferred.

In addition, the two algorithms are bluntly blocked after 15 days of processing Webdocs benchmark (the larger benchmark

existing on the web). Consequently, a parallel version of

these approaches can bring more efficiency to handle this big

instance of transactions.

AKNOWLEDGMENT: We are sincerely grateful to the

reviewers who provide us with interesting remarks to improve

the paper quality.

REFERENCES

[I] R. Agrawal, and J. Shafer, "Parallel mining of associations rules", IEEE Transactions on knowledge and Data Engineering, VOL 8, NO 6,1996, pp. 962-969.

[2] J. H. Holland, "Genetic algorithms and the optimal allocation of trials", SIAM Journal on Computing, 2(2), 1973, pp. 88-105.

[3] J. M. Aurifeille, "A bio-mimetic approach to marketing segmentation: Principles and comparative analysis", European Journal of Economic and Social Systems, 14(1), 2000, pp. 93-108.

[4] R. Agrawal, l. Tomasz, and S. Arun, "Mining association rules between sets of items in large databases", ACM SIGMOD Record. Vol. 22. No. 2. ACM, 1993.

[5] I. Jr. Fister, X.S. Yang, l. Fister, J. Brest, and D. Fister, "A Brief Review of Nature-Inspired Algorithms for Optimization", Elektrotehniski vestnik, 80(3), 2013.

[6] J. Han, P. Jian, and Y. Yiwen, "Mining frequent patterns without candidate generation", ACM SIGMOD Record. Vol. 29. No. 2. ACM, 2000.

[7] Y. Djenouri, H. Drias, and A. Chemchem, "A hybrid Bees Swarm Optimization and Tabu Search algorithm for Association rule mining", In Nature and Biologically Inspired Computing, IEEE NaBlC, 2013, pp. 120-125.

[8] J. Mata, J. Alvarez, and J. Riquelme, "An Evolutionary algorithm to discover numeric association rules", In Proceedings of the ACM symposium on Applied computing SAC, 2002, pp. 590-594.

[9] J. Mata , J. Alvarez, and J. Riquelme, "Mining numeric association rules with genetic algorithms", In Proceedings of the International Conference ICANNGA, 2001, pp. 264-267.

[10] X. Yan, and C. Zhang, "Genetic algorithm based strategy for identifying association rule without specifying minimum support", Expert system with applications, VOL 36, No 2, 2009, pp. 3066-3076.

[11] G. Hong, and Y. Zhou, "An algorithm for mining association rules based on improved genetic algorithm and its application", Third international conference on genetic and evolutionary computing, IEEE computer science, 2009, pp. 117-120.

[12] M. Wang, Q. zou, and C. lin, "Multi dimensions association rules mining on adaptive genetic algorithm", International conference on uncertainly reasoning on knowledge engineering, 20 II.

[13] D. Liu, "Improved genetic algorithm based on simulated annealing and quantum computing strategy for association rule mining", Journal of software, 2010.

[14] K. Indira, and S. Kanmani, "Performance Analysis of Genetic Algorithm for Mining Association Rules", International Journal of Computer Science Issues, Vo19, Nol, 2012.

[15] c. Romero, A. Zafra, J. Luna, and S. Ventura, "Association rule mining using genetic programming to provide feedback to instructors from multiple-choice quiz data", Journal of Expert System, 2012.

[16] H.A. Guvenir, and I. Uysal, http://funapp.cs.bilkent.edu.tr. Bilkent University Function Approximation Repository, 2000.

[17] B. Goethals, http: //fimi.ua.ac.be/ Frequent Itemset Mining Implementations Repository 2004.

2014 Sixth World Congress on Nature and Biologically Inspired Computing (NaBIC) 213

[ieee 2014 sixth world congress on nature and biologically inspired computing (nabic) - porto,...

Documents