applied soft computingstatic.tongtianta.site/paper_pdf/953a421e-5cec-11e9-affb... · 2019. 4....

A

GD

a

ARRAA

KDRA

1

icm

cOr(rcwtnratww

(

h1

Applied Soft Computing 36 (2015) 519–533

Contents lists available at ScienceDirect

Applied Soft Computing

j ourna l h o mepage: www.elsev ier .com/ locate /asoc

novel approach to adaptive relational association rule mining

abriela Czibula, Istvan Gergely Czibula ∗, Adela-Maria Sîrbu, Ioan-Gabriel Mirceaepartment of Computer Science, Babes -Bolyai University, 1, M. Kogalniceanu Street, 400084 Cluj-Napoca, Romania1

r t i c l e i n f o

rticle history:eceived 21 July 2014eceived in revised form 28 April 2015ccepted 30 June 2015vailable online 29 July 2015

a b s t r a c t

The paper focuses on the adaptive relational association rule mining problem. Relational association rulesrepresent a particular type of association rules which describe frequent relations that occur between thefeatures characterizing the instances within a data set. We aim at re-mining an object set, previouslymined, when the feature set characterizing the objects increases. An adaptive relational association rule

eywords:ata miningelational association ruledaptive algorithm

method, based on the discovery of interesting relational association rules, is proposed. This method,called ARARM (Adaptive Relational Association Rule Mining) adapts the set of rules that was establishedby mining the data before the feature set changed, preserving the completeness. We aim to reach theresult more efficiently than running the mining algorithm again from scratch on the feature-extendedobject set. Experiments testing the method’s performance on several case studies are also reported. The

t the e
obtained results highligh
. Introduction

It is well known that mining different kinds of data is of greatnterest in various domains such as medicine, bioinformatics, bioar-haeology, as it can lead to the discovery of useful patterns andeaningful knowledge.Association rule mining [6] means searching attribute-value

onditions that occur frequently together in a data set [25,42,44].rdinal association rules [8] are a particular type of association

ules. Given a set of records described by a set of characteristicsfeatures or attributes), the ordinal association rules specify ordinalelationships between record features that hold for a certain per-entage of the records. However, in real world data sets, featuresith different domains and relationships between them, other

han ordinal, exist. In such situations, ordinal association rules areot powerful enough to describe data regularities. Consequently,elational association rules were introduced in [40] in order to beble to capture various kinds of relationships between record fea-
ures. The DRAR method (Discovery of Relational Association Rules)as introduced for mining interesting relational association rulesithin data sets [40].
∗ Corresponding author.E-mail addresses: [email protected] (G. Czibula), [email protected]

I.G. Czibula), [email protected] (A.-M. Sîrbu), [email protected] (I.-G. Mircea).URLs: http://www.cs.ubbcluj.ro/ gabis (G. Czibula), http://www.cs.ubbcluj.ro/

istvanc (I.G. Czibula).1 Tel.: +40 264 405 327; fax: +40 264 591 906.

ttp://dx.doi.org/10.1016/j.asoc.2015.06.059568-4946/© 2015 Elsevier B.V. All rights reserved.

fficiency of the ARARM method and confirm the potential of our proposal.© 2015 Elsevier B.V. All rights reserved.

Relational association rule mining can be used in solving prob-lems from a variety of domains, such as: data cleaning, naturallanguage processing, databases, healthcare, bioinformatics, bioar-chaeology, etc. We have previously applied, so far, relationalassociation rule mining in different data mining tasks such as: med-ical diagnosis prediction [41], predicting if a DNA sequence containsa promoter region or not [15], software defect prediction [16], soft-ware design defect detection [17], data cleaning [8].

The method DRAR for relational association rule mining startswith a known set of objects, measured against a known set of fea-tures and discovers interesting relational association rules withinthe data set. But there are various applications where the object setis dynamic, or the feature set characterizing the objects evolves.Obviously, for obtaining the interesting relational association ruleswithin the object set in these conditions, the mining algorithmcan be applied over and over again, beginning from scratch, everytime when the objects or the features change. But this can beinefficient.

In this paper, we propose an adaptive relational associationrule method, named Adaptive Relational Association Rule Mining(ARARM), that is capable to efficiently mine relational associationrules within the object set, when the feature set increases with oneor more features. The ARARM method starts from the set of interest-ing rules that was established by applying DRAR before the featureset changed and adapts it considering the newly added features.
The result is reached faster than running DRAR again from scratchon the feature-extended object set.
We have to mention that the adaptive relational association rulemining method, proposed in this paper, is a novel approach. Thereexist in the data mining literature approaches which consider the

5 ft Com

abp

gTiepaotd

2

ivis

brr

av

iladtest

D(a

D

b

oriia

ssu

Ofiw

the attribute a9 in 88% of instances within the data set (i.e in 22

20 G. Czibula et al. / Applied So

daptive association rule mining process for particular problems,ut none of them deal with relational association rules as in ourroposal.

The remaining of the paper is organized as follows. A back-round on relational association rule mining is given in Section 2.he Adaptive Relational Association Rule Mining (ARARM) methods described in Section 3. Section 4 presents the experimentalvaluation of our approach and shows the efficiency of the pro-osed method on several case studies. An analysis of the adaptivepproach introduced in this paper, as well as a discussion on thebtained results and comparison to related work are given in Sec-ion 5. Section 6 outlines the conclusions of the paper as well asirections for further improvements.

. Background on relational association rule mining

There is a continuous interest in applying association rule min-ng [39] in order to discover relevant patterns and rules in largeolumes of data. Data mining methods [46,2] are applied in var-ous domains such as medicine, bioinformatics, bioarchaeology,oftware engineering.

In order to be able to capture various kinds of relationshipsetween record attributes, the definition of ordinal associationules from [8,7] was extended in [40] toward relational associationules.

In the following we will briefly review the concept of relationalssociation rules, as well as the mechanism for identifying the rele-ant relational association rules that hold within a data set.

Let R = {r1, r2, . . ., rn} be a set of instances (entities or recordsn the relational model), where each instance is characterized by aist of m attributes, (a1, . . ., am). We denote by �(rj, ai) the value ofttribute ai for the instance rj. Each attribute ai takes values from aomain Di, which contains the empty value denoted by ε. Betweenwo domains Di and Dj relations can be defined, such as: less (<),qual (=), greater or equal (≥), etc. We denote by M the set of all pos-ible relations that can be defined on Di x Dj and by A = {a1, . . ., am}he attribute set.

efinition 1. [40] A relational association rule is an expressionai1 �1 ai2 �2 ai3 . . .��−1 ai�

), where {ai1 , ai2 , ai3 , . . ., ai�} ⊆ A,

ij/= aik

, j, k ∈ {1 . . . � }, j /= k and �i ∈ M is a relation over Dij× Dij+1

,

ijis the domain of the attribute aij

. If:

a) ai1 , ai2 , ai3 , . . ., ai�occur together (are non-empty) in s% of the n

instances, then we call s the support of the rule,and

) we denote by R′ ⊆ R the set of instances where ai1 , ai2 , ai3 , . . ., ai�occur together and �(r′, aij

) �1 �(r′, aij+1) is true ∀ 1 ≤ j ≤ � −1

and for each instance r′ from R′; then we call c = |R′|/|R| the con-fidence of the rule.

The length of a relational association rule is given by the numberf attributes in the rule. Users usually need to uncover interestingelational association rules that hold within a data set; they arenterested in relational rules which hold in a minimum number ofnstances, that are rules with support at least smin, and confidencet least cmin (smin and cmin are user-provided thresholds).

A relational association rule in R is called interesting [40] if itsupport s is greater than or equal to a user-specified minimumupport, smin, and its confidence c is greater than or equal to a
ser-specified minimum confidence, cmin.
In [7] an A-Priori [1] like algorithm, called DOAR (Discovery ofrdinal Association Rules), was introduced in order to efficientlynd all ordinal association rules (i.e. relational association rules inhich the relations are ordinal) of any length, that hold over a data

puting 36 (2015) 519–533

set. The DOAR algorithm was proven to be correct and completeand it efficiently explores the search space of the possible rules [7].

The DOAR algorithm was further extended in [40,15] towardthe DRAR algorithm (Discovery of Relational Association Rules) forfinding interesting relational association rules, i.e. association ruleswhich are able to capture various kinds of relationships betweenrecord attributes. The DRAR algorithm provides two functionali-ties: (a) it finds all interesting relational association rules of anylength; (b) it finds all maximal interesting relational associationrules of any length, i.e. if an interesting rule r of a certain length lcan be extended with one attribute and it remains interesting (itsconfidence is greater than the threshold), only the extended rule iskept.

So far, relational association rules were successful in differentdata mining tasks in domains like: medicine (for diagnosis predic-tion [41]), bioinformatics (for predicting if a DNA sequence containsa promoter region or not [15], software engineering [16,17], as wellas for data cleaning tasks [8].

2.1. Example

It is well known that mining medical data (like the exampleconsidered in the following) is of great interest in modern medicine,as it can lead to the discovery of useful patterns and knowledgethat can be important for the diagnosis and treatment of differentdiseases. That is why researchers are still focusing on applying datamining techniques to medical data in order to discover interestingpatterns [13].

In order to better explain the concept of relational associationrules and the DRAR algorithm [40] that is used for discovering inter-esting relational association rules, we give an example that illus-trates how it can be applied on a medical data set sample. The dataset considered in our experiment is a subset of the breast cancerdatabase obtained from the University of Wisconsin Hospitals,Madison, Dr. William H. Wolberg. The file for this experiment wasobtained from [35].

The instances (entities) in this experiment are patients: eachpatient is identified by 9 attributes [47]. The attributes representmeasurements from malignant and benign tumors and have inte-ger values between 1 and 10. Each instance has one of 2 possibleclasses: benign or malignant.

In this example we have considered only the first 25 recordsrepresenting “malignant” instances. The data set considered in ourexample is given in Table 1.

As all attributes in the experiment have integer values, we havedefined three possible binary relations between integer valuedattributes: =, <, >.

We executed the DRAR algorithm with a minimum supportthreshold of 1 and a minimum confidence threshold of 0.65. Thediscovered interesting relational rules are shown in Table 2 andthe maximal interesting association rules are given in Table 3. Foreach discovered rule, its confidence is also provided. The attributescharacterizing the instances are denoted by a1, a2, . . .,a9.

Each line from Table 2 expresses a relational association rule ofa certain length, which was discovered in the data set indicated inTable 1 with a specified confidence. For example, the first line inTable 2 refers to the relational association rule a1 > a9 of length 2(i.e the rule contains two attributes) having a confidence of 0.88.That is, the value of the attribute a1 is greater than the value of

instances).As it can be seen in the results above, interesting relational asso-

ciation rules can be discovered within the set of malignant patients.Further analysis of these relational association rules may give rel-evant information regarding the diagnosis process.

G. Czibula et al. / Applied Soft Com

Table 1Sample data set.

a1 a2 a3 a4 a5 a6 a7 a8 a9

7 5 10 10 10 10 4 10 310 8 8 2 3 4 8 7 8

5 6 7 8 8 10 3 10 35 3 3 1 3 3 3 3 35 5 5 2 5 10 4 3 1

10 10 10 3 10 8 8 1 15 6 6 2 4 10 3 6 14 1 1 3 1 5 2 1 15 10 10 3 7 3 8 10 25 7 9 8 6 10 8 10 17 3 2 10 5 10 5 4 48 10 10 10 5 10 8 10 68 10 10 1 3 6 3 9 18 4 4 1 6 10 2 5 23 10 3 10 6 10 5 1 48 10 10 10 6 10 10 10 105 10 10 10 10 2 10 10 107 6 6 3 2 10 7 1 1

10 2 2 1 2 6 1 1 210 4 4 10 6 10 5 5 1

7 6 10 5 3 10 9 10 25 10 10 10 6 10 6 5 2

10 10 10 2 10 10 5 3 310 10 7 8 7 1 10 10 3

3 7 7 4 4 9 4 8 1

Table 2Interesting relational association rules.

Length Rule Confidence

2 a1 > a9 0.882 a2 = a3 0.722 a2 > a9 0.722 a3 > a9 0.682 a4 > a9 0.682 a5 < a6 0.722 a5 > a9 0.722 a6 > a7 0.722 a6 > a9 0.82 a7 > a9 0.72

3 a1 > a9 < a2 0.683 a1 > a9 < a3 0.683 a1 > a9 < a5 0.683 a1 > a9 < a6 0.763 a1 > a9 < a7 0.683 a2 > a9 < a3 0.683 a2 > a9 < a5 0.683 a2 > a9 < a6 0.683 a5 > a9 < a6 0.683 a9 < a6 > a7 0.723 a6 > a9 < a7 0.68

4 a1 > a9 < a6 > a7 0.68

Table 3Maximal interesting relational association rules.

Length Rule Confidence

2 a2 = a3 0.722 a4 > a9 0.682 a5 < a6 0.72

3 a1 > a9 < a2 0.683 a1 > a9 < a3 0.683 a1 > a9 < a5 0.683 a1 > a9 < a7 0.683 a2 > a9 < a3 0.683 a2 > a9 < a5 0.683 a2 > a9 < a6 0.683 a5 > a9 < a6 0.683 a6 > a9 < a7 0.68

4 a1 > a9 < a6 > a7 0.68

puting 36 (2015) 519–533 521

3. Methodology

In this section, we introduce the adaptive relational associa-tion rule mining approach, as well as an algorithm called ARARM(Adaptive Relational Association Rule Mining) that is capable to effi-ciently mine relational association rules within a data set, when thefeature set increases with one or more features.

Let us consider a data set R = {r1, r2, . . ., rn} consisting of high-dimensional instances (objects). Each instance is characterized by alist of m attributes (also called features), (a1, . . ., am) and is thereforedescribed by an m-dimensional vector ri = (ri1, . . ., rim). Differenttypes of relations can be defined between the values of the featurescharacterizing the instances from the data set. We denote by Relthe set of all possible relations that can be defined between thefeatures values. As presented in Section 2, interesting relationalassociation rules that are able to express relations (from the setRel) between the features values may be discovered using the DRARmethod [15].

The measured set of features is afterwards extendedwith s (s ≥ 1) new features, numbered as m + 1, m + 2, . . .,m + s. After extension, the objects’ feature vectors becomerexti

= (ri1, . . ., rim, ri,m+1, . . .ri,m+s), 1 ≤ i ≤ n. The set of extendedinstances is denoted by Rext = {rext

1 , rext2 , . . ., rext

n }.Considering certain minimum support and confidence thresh-

olds (denoted by smin and cmin), we want to analyze the problem ofmining interesting relational association rules within the data setRext, i.e. after object extension, and starting from the set of rules dis-covered in the data set R before the feature set extension. We aimat obtaining a better time performance with respect to the miningfrom scratch process. We denote in the following by RAR the setof interesting relational association rules having a minimum sup-port and confidence within the data set R, and by RARext the set ofinteresting relational association rules having a minimum supportand confidence within the extended data set Rext.

Certainly, the newly arrived features can generate new rela-tional association rules. The new set of rules RARext could be ofcourse obtained by applying the DRAR method from scratch onthe set of extended objects. But we try to avoid this process andreplace it with one less expensive, but preserving the completenessof the rule generation process. More specifically, we will proposea method called ARARM (Adaptive Relational Association Rule Min-ing), which starts from the set RAR of rules mined from the dataset before feature extension and adapts it (considering the newlyadded features) in order to obtain the set of interesting relationalassociation rules within the set of extended objects Rext. Definitely,through the adaptive process, we want to preserve the complete-ness of the DRAR method.

Let us denote by l the maximum length of the rules from the setRAR and by RARk (1 ≤ k ≤ l) the set of interesting relational associa-tion rules of length k discovered in the data set R (before the featureset extension). Obviously, RAR =

⋃lk=1RARk.

In the following we give a brief description of the idea of discov-ering the set RARext through adapting the set RAR of rules minedin the data set R before feature extension.

The ARARM algorithm identifies the interesting relational asso-ciation rules using an iterative process that consists in length-levelgeneration of rules, followed by the verification of the candidatesfor minimum support and confidence compliance. ARARM performsmultiple passes over the data set Rext. In the first pass, it calculatesthe support and confidence of the 2-length rules and determineswhich of them are interesting, i.e. verify the minimum support and
confidence requirements. Every subsequent pass over the data con-sists of two phases. The k-length (k ≥ 2) rules from Rext will certainlycontain the k-length rules from RAR (the interesting rules discov-ered in the data set before extension) – if such rules exist. But,there is another possibility to obtain a k-length rule in the extended

522 G. Czibula et al. / Applied Soft Com

dktodtmwi

caggaow[t

together with their mirrors). For optimization reasons, when apply-

Fig. 1. The candidate generation process in the ARARM algorithm.

ata set, through generating a candidate rule through joining two − 1-length rules from RARext (generated at the previous itera-ion). During the second phase, a scan over Rext is performed inrder to compute the actual support and confidence of the candi-ate rules generated as described above. At the end of this step,he algorithm keeps the rules that are deemed interesting (have

inimum support and satisfy the confidence requirements), whichill be used in the next iteration. The process stops when no new

nteresting rules were found in the latest iteration.At a certain iteration performed by the ARARM algorithm, the

andidate generation process (denoted by GenCandidates in thelgorithm from Fig. 2) is essentially the same as the candidateseneration process of the DRAR method [7] (the particularities ofenerating the candidates in the ARARM algorithm will be discussedfterwards). More specifically, for joining two k − 1 length rules in
rder to obtain a k-length candidate rule, there are four possibilitieshich are illustrated in Fig. 1. Similarly to the proof presented in
7] it may be proven that the candidate generation process ensureshe correctness and completeness of the ARARM algorithm.

Fig. 2. The ARARM

puting 36 (2015) 519–533

In the following we denote by �−1 the inverse of the relationdenoted by �.

Regarding the binary relations that may be defined between theattributes domains, we mention that we do not assume any par-ticular property (such that the transitivity property), both DRARand ARARM are working with general relationships between theattributes domains. Supposing we have three attributes a1, a2 anda3 and the set of relations { < , > , = }, the relational association rulesr1 = (a1 < a2 < a3) and r2 = (a1 < a3 > a2) are viewed as distinct rules(otherwise, if the transitivity of the relation < would be consid-ered, then r2 would be seen as a generalization of r1). Furthermore,there are no constraints placed on the relationships. Thus, therules a1 = a2 < a3 and a2 = a1 < a3 are considered by ARARM distinctrules (even if they are equivalent since = is a symmetric relation).Obviously, if one would need to deal with relations having particu-lar properties (e.g transitive or symmetric), after the set of relationalassociation rules are discovered by ARARM (or DRAR), a filtering stepmay be added. This step may be used to remove equivalent rules,like those from the examples above.

Fig. 2 gives the ARARM algorithm.In the algorithm presented in Fig. 2, by ⊕ we have denoted

a special “add” operation. If r is a relational association rule andL is a set of relational association rules, by r ⊕ L we refer to theset of relational association rules obtained by adding r to L if Ldoes not contain the “mirror” rule of r. If L contains the “mirror”of r, then r ⊕ L equals to L. We mention that by the “mirror” ofa relational association rule r ≡ (a1�1a2�2 . . . �n−1an) we refer tothe rule r−1 ≡ (an�−1

n−1. . .�−12 a2�−1

1 a1) (e.g the “mirror” of the rulea1 < a2 > a3 is the rule a3 < a2 > a1). The special operation ⊕ we haveused in constructing the set of relational association rules assuresthat the resulting set does not contain duplicate rules (i.e rules

ing the ⊕ operation, the verification if the “mirror” of rule r is in L isskipped, when appropriate. For example, the candidate generationprocess may generate a “mirror” of a rule only if the set Rel of rela-tions used in the mining process contains at least a rule � together

algorithm.

G. Czibula et al. / Applied Soft Computing 36 (2015) 519–533 523

Candi

wt

dwprficIFikf

persfdatwtoTfdieTdra

tadftait

Fig. 3. The Gen

ith its inverse �−1. Only in this case, when adding a rule r withhe ⊕ operation into a set L we must verify if the “mirror” of r is in L.

The most important step in the ARARM algorithm is the candi-ate generation process (denoted by the GenCandidates function),hich is also computationally expensive. This function has as aarameter a set Rules of k-length relational association rules andeturns a set of k + 1 length relational association rules generatedrom Rules through the join operations depicted in Fig. 1. The maindea of the candidate generation process is the following. All distinctombinations of two rules (r1, r2) from the set Rules are considered.f r1 and r2 match for join (in one of the four cases indicated inig. 1, the rule rjoin obtained by joining r1 and r2 is constructed ands added to the resulting set of rules. Obviously, since r1 and r2 are-length rules, rjoin will be a k + 1-length rule. The GenCandidatesunction is described in Fig. 3.

It has to be stated that running the ARARM method with m = 0rovides the set of interesting relational association rules discov-red in the input data set of s-dimensional entities. Thus, thisunning is equivalent with applying the DRAR method on the dataet of s-dimensional entities. It can be seen in the GenCandidatesunction (Fig. 3) that two rules are joined during the adaptive can-idate generation process only if at least one rule has at least anttribute from the additional attributes, which are present only inhe enlarged attribute set. Obviously, it is not necessary to join ruleshich contain only attributes from the original attribute set, since

he joint rules are already known (these rule are in the set RARf relational rules discovered in the set of m-dimensional entities).his way, when generating the k-length relational association rulesrom the extended data set of m + s dimensional entities, the candi-ate generation process (expressed by the GenCandidates function)

s not applied on the set of k − 1 length rules from the set off inter-sting rules extracted from the data set of m-dimensional entities.hus, unlike in the DRAR algorithm applied from scratch on the m + simensional entities, the join operations between the k − 1 lengthules from the data set of entities before the attribute set extensionre skipped.

The time savings in the ARARM execution time come from theime reduction of the candidate generation process as well as fromn reduced number of support and confidence computations (aetailed analysis will be given in Section 4.4). Obviously, as seen
rom Fig. 3, the step of computing the support and confidence forhe rules from the set RAR is skipped, since for these rules welready have their support and confidence. Certainly, the reductionn the execution time of the ARARM increases with the increase ofhe set RAR. This usually happens when decreasing the number s
dates function.

of added attributes. A smaller number of added attribute means alarger set of already known rules and this implies a smaller numberof rules generated by GenCandidates function, as well as less timefor support and confidence computations.

In the current implementation of the ARARM method, we did notdeal with optimizing the rules matching step in the candidate gen-eration process described in Fig. 1. Since the same implementationfor this step is used both in the ARARM and DRAR implementation, amore efficient implementation of it will lead to lower running timesfor both methods. Still, it will not significantly impact the improve-ment in running time (in %) of ARARM with respect to DRAR (whatwe are interested in). We will further investigate optimizations ofthe rules matching step in order to increase the efficiency of thecandidate generation process (e.g using efficient data structuressuch as hash tables or canonical forms).

The current implementation of the ARARM algorithm providesthe functionality of discovering all interesting relational associationrules of any length, as well as the functionality of finding all maxi-mal interesting relational association rules of any length. Moreover,we have parameterized the set of features used in the mining pro-cess, thus it is very easy to remove certain features from the miningprocess.

4. Experimental evaluation

In order to show the effectiveness of the adaptive relationalassociation rule mining method that was introduced in Section 3,we consider in the following two case studies that will be furtherdescribed. All the experiments presented in this section were car-ried out on a PC with an Intel Core i7 Processor at 1.87 GHz, with8 GB of RAM.

We have to mention that the adaptive method we propose in thispaper is appropriate for processes (like gene expression or bioarche-ological tasks which will be considered in Sections 4.2 and 4.3) thathave a temporal evolution and are conducted over periods of time.The use of the ARARM method is justified in such cases since theattributes characterizing the instances are obtained progressively.The data sets we will consider for evaluating the performance ofthe ARARM methods are not important, but the obtained results arerelevant, since our goal is only to show that the adaptive relational
association rule mining process is less expensive (from the runningtime point of view) than the non-adaptive one. Still, the first twocase studies we have chosen for evaluation (a gene expression dataand a human skeletal remains data) are appropriate for the adaptivemining process, since there is a temporal evolution of the attribute

5 ft Com

srepd

4

ihwwvamem

1

2

twnt

aaaeotaudAceiAm

tiwtocb

p

truteIsto


et characterizing the instances. We are currently working on usingelational association rules on these two problems, and thereforemerged the need of the adaptive relational association rule miningrocess. The last two data sets used in our experiments are largerata sets synthetically generated.

.1. Experiments

In the experiments we have performed for discovering thenteresting relational association rules within the data sets, weave initially considered m attributes characterizing the instancesithin the data set and afterwards the set of features was extendedith s attributes. The experiments were made considering different

alues for the minimum support and confidence thresholds (sminnd cmin) and different type of relational association rules, i.e maxi-al rules vs. all rules (as indicated in Section 2). For each performed

xperiment, the set of interesting relational association rules on the + s dimensional instances were obtained in two ways:

. by applying the DRAR method from scratch on the data set afterthe feature set extension (containing all m + s features).

. by adapting (through the ARARM algorithm) the rules obtainedon the data set before the feature set extension (containing mfeatures).

We mention that the same set of interesting relational associa-ion rules is discovered in data, independent to the way the rulesere generated (1 or 2), but, obviously, we are expecting the run-ing time of the adaptive algorithm to be lower than the runningime of the DRAR method applied from scratch.

For all experiments that will be presented in the following, weim at emphasizing that ARARM has a lower running time that DRARpplied from scratch on the set of m + s dimensional entities. Thedaptive method we propose in this paper uses intermediary inter-sting relational association rules in order to obtain subsequentnes, thus avoiding the need to start the mining algorithm fromhe very beginning each time. Due to the temporal evolution of thettribute set, at a given time t when there are available the val-es of the first m attributes for the instances, the set RAR of rulesiscovered in the data set of m-dimensional entities are extracted.ssuming that at time t + 1 new values for the attributes are pro-ured, the ARARM method uses the existing rules from RAR andfficiently obtains new accurate relational association rules thatnclude the latest available data. Thus, the running time of theRARM algorithm (Fig. 2) is not influenced by the time needed toine the rules on the data set before the attribute set extension.For implementation, we have developed a Java API which allows

he discovery, in an adaptive manner (using the ARARM algorithmntroduced in Section 3), of interesting relational association rules

hich occur within a data set. The API can be used to uncover (adap-ive) relational association rules in various data sets, independentf the objects (instances) within the data set, the type of featuresharacterizing the instances, as well as the relations that are definedetween the features.

The experiments that will be presented in Sections 4.2–4.4 wereerformed using the ARARM API.

In all the data sets which will be further considered for evalua-ion, we are dealing with numerical attributes. Before applying theelational association mining process, we first normalize the datasing the min-max normalization method. For all the experiments,wo possible relations between the attributes values are consid-
red in the relational association rule mining process: Rel = {≤, >}.t has to be noted that for the first human skeletal remains dataet which contains missing attribute values, the minimum supporthreshold smin used in the mining process is less than 1, for all thether data sets smin is set to 1.
puting 36 (2015) 519–533

Regardless of the threshold cmin value, the ARARM methodadaptively mines the set of interesting relational association ruleshaving a minimum confidence of cmin within a data set, when newattributes are added to the data set. Even if the minimum confi-dence threshold value is irrelevant, in our experiments we haveselected the value for cmin such that the number of discovered rulesin data to be large enough. This way, with a reasonable number ofrules, the time reduction of the ARARM algorithm with respect toDRAR may be better illustrated. Heuristics for selecting appropri-ate values for the threshold cmin would be useful when applyingthe ARARM algorithm in concrete data mining tasks.

4.2. Gene expression data

The first experiment used in our evaluation is related to theproblem of mining gene expression data, a problem of major impor-tance within the bioinformatics and the genomic research. In thisfirst experiment we used a subset of a real-life gene expression dataset [21], which contains the levels of expression of 65 genes belong-ing to the Saccharomyces cerevisiae organism, measured during itsmetabolic shift from fermentation to respiration. The genes withinthe data set are characterized by 7 attributes which represent geneexpression levels measured at seven time points: 0, 9, 11.5, 13.5,15.5, 18.5 and 20.5 h. These measurements are numerical valuesobtained progressively, that is why the adaptive mining process isappropriate.

In the experiments we have performed in order to discoverinteresting relational association rules within the gene expressiondata set, we have initially considered m attributes (measurementsof gene expression levels) and afterwards the set of features wasextended with s attributes.

The experiments were made (as indicated in Section 4.1) byapplying the ARARM algorithm (the adaptive method introducedin Section 3) and the DRAR algorithm (applied from scratch afterthe feature set extension). Different values for the minimum confi-dence threshold cmin, different type of relational association rules(maximal rules vs. all rules) and different combinations of valuesfor m and s are considered in the experiments performed on thegene expression data set. These values as well as the comparativeresults obtained through the performed experiments are given inTable 4. The running times are given in milliseconds and for eachexperiment, the lower running time value is marked with bold. Inthis table, nm denotes the number of rules discovered in the dataset before the attribute set extension (containing m attributes) andnm+s represents the number of rules mined in the data set after theattribute set extension (containing m + s attributes).

Table 4 illustrates the performance of the ARARM method, foreach of the performed experiment. It can be observed that the timeneeded to obtain the rules adaptively is less than the time neededto obtain the rules from scratch.

4.3. Human skeletal remains data

We also experimented the ARARM algorithm on two humanskeletal remains data sets. The data sets contain instances rep-resenting skeletal remains which are represented by a list ofbone measurements. The considered case studies are relevant withrespect to the present research since the measurements (attributes)are usually obtained incrementally in time as some of the attributesare computed earlier in the process than others.

4.3.1. First data setThe first data set contains actual data representing the encod-

ing of 12 measurements concerning the skull and pelvis of humanremains from several archeological sites. The data set containingmeasurements for 28 human skeletal remains was obtained from


Table 4Results for the gene expression data set for smin = 1.

Experiment cmin No. ofattributes (m)

No. of addedattributes (s)

Type of rules No. of rules nm No. of rulesnm+s

Time fromscratch (ms)

Time adaptive (ms)

13 0.3 3 4 Maximal 3 53 118 3814 0.3 4 3 Maximal 6 53 118 3215 0.3 5 2 Maximal 18 53 118 19

4 0.3 6 1 Maximal 35 53 118 12

16 0.3 3 4 All 6 114 31 3317 0.3 4 3 All 14 114 31 1518 0.3 5 2 All 35 114 31 13

4 0.3 6 1 All 67 114 31 4

1 0.4 3 4 Maximal 3 33 49 132 0.4 4 3 Maximal 6 33 49 123 0.4 5 2 Maximal 15 33 49 124 0.4 6 1 Maximal 22 33 49 10

4 0.4 3 4 All 6 64 20 95 0.4 4 3 All 12 64 20 76 0.4 5 2 All 24 64 20 64 0.4 6 1 All 41 64 20 2

7 0.5 3 4 Maximal 3 21 28 158 0.5 4 3 Maximal 5 21 28 79 0.5 5 2 Maximal 10 21 28 64 0.5 6 1 Maximal 15 21 28 4

10 0.5 3 4 All 3 38 11 7

12

tBs

itcecsds

(a(eav

t“liemia

ifi

4

icA

11 0.5 4 3 All

12 0.5 5 2 All

4 0.5 6 1 All

he Babes -Bolyai University Interdisciplinary Research Institute onio & Nano Sciences [20]. We have to mention that within this dataet there are attributes with missing values.

In the experiments we have performed for discovering thenteresting relational association rules within the human skele-al remains data set described in Section 4.3.1, we have initiallyonsidered m attributes and subsequently the set of features wasxtended with s new features. The data set was divided in twolasses of instances: the set of instances representing males and theet of instances representing females. We have considered indepen-ently the two classes, since each class is characterized by its ownet of interesting relational association rules.

As for the gene expression data set, the experiments were madeas indicated in Section 4.1) by applying the ARARM algorithm (thedaptive method introduced in Section 3) and the DRAR algorithmapplied from scratch after the extension of the feature set), consid-ring different values for the minimum confidence threshold cminnd different type of relational association rules, i.e maximal ruless. all rules.

We are indicating in Table 5 the comparative results obtainedhrough experiments performed on instances belonging to bothMale” and “Female” classes. The running times are given in mil-iseconds and for each experiment, the lower running time values marked with bold. By nm we denote the number of rules discov-red in the data set before the attribute set extension (containing

attributes) and by nm+s we denote the number of rules minedn the data set after the attribute set extension (containing m + sttributes).

From Table 5 it can be observed that, for all the performed exper-ments, the ARARM method is more effective than the DRAR appliedrom scratch, since the time needed to obtain the rules adaptivelys less than the time needed to obtain the rules from scratch.

.3.2. Second data setThe second data set contains bone measurements published

n a paper which is a milestone for forensic analysis [43]. Itontains seven bone measurements for 92 Caucasians and 92fro-American males and females. This data set was previously

8 38 11 54 38 11 42 38 11 2

used in the literature for the estimation of stature from the lengthof long bones of the free limbs [43].

In the experiments we have performed for discovering theinteresting relational association rules within the human skeletalremains data set described in Section 4.3.2, m attributes are ini-tially considered. Subsequently, the set of attributes was extendedwith s new attributes. The data set was divided in four classes ofinstances: the set of instances representing caucasian males, a set ofinstances representing Caucasian females, a set of instances repre-senting Afro-American males, and the set of instances representingAfro-American females. We have considered independently the fourclasses, since each class is characterized by its own set of interestingrelational association rules.

We are indicating in Table 6 the results obtained through exper-iments performed on instances belonging to the “Caucasian Male”and “Caucasian Female” classes. Table 7 shows the results obtainedthrough experiments performed on instances belonging to the“Afro-american Male”and “Afro-american Female” classes. The run-ning times are given in milliseconds and for each experiment, thelower running time value is marked with bold. The column denotedby nm contains the number of rules discovered in the data setbefore the attribute set extension (containing m attributes) andthe column denoted by nm+s contains the number of rules minedin the data set after the attribute set extension (containing m + sattributes).

For all the experiments indicated in Tables 6 and 7, the ARARMmethod is more effective than DRAR from scratch, since the timeneeded to obtain the rules adaptively is less than the time neededto obtain the rules from scratch.

In order to highlight the usefulness of the relational associationrules mined in the data sets of Caucasian males and females, welooked into the obtained set of rules. We mention that the rulesare discovered in data after applying the min–max normalization
to scale the data in [0,1]. We noticed, for example, that the ruler = (a2 > a3 ≤ a4) was discovered with a confidence of 0.77 in theset of caucasian females, but was not discovered in the set of cau-casian males when the minimum confidence threshold was set to0.55. The obtained rule has the following interpretation. For 77%

526 G. Czibula et al. / Applied Soft Computing 36 (2015) 519–533

Table 5Results for the first human skeletal data set for smin = 0.3.

Experiment cmin Class No. ofattributes (m)




Time adaptive(ms)

1 0.3 Male 6 6 Maximal 8 34 58 222 0.3 Male 8 4 Maximal 8 34 58 133 0.3 Male 10 2 Maximal 13 34 58 12

4 0.3 Male 6 6 All 14 66 11 85 0.3 Male 8 4 All 14 66 11 86 0.3 Male 10 2 All 28 66 11 4



13 0.3 Female 6 6 Maximal 3 39 83 3114 0.3 Female 8 4 Maximal 5 39 83 2615 0.3 Female 10 2 Maximal 18 39 83 16

16 0.3 Female 6 6 All 5 105 16 1217 0.3 Female 8 4 All 14 105 16 1118 0.3 Female 10 2 All 42 105 16 10



Table 6Results for the “CaucasianMale” and “CaucasianFemale” classes for smin = 1.





Time adaptive(ms)








22 0.4 Female 3 4 All 4 86 49 16

o(atiCtf

23 0.4 Female 4 3 All


f the Caucasian females, the normalized length of the radius bonea2) is greater than the normalized length of the ulna bone (a3)nd the normalized length of the ulna bone (a3) is less or equalo the normalized length of the femur (a4). Much more, this rule
s not valid for a significant percentage of the caucasian males.onsequently, the rule r may be very useful in a gender detectionask, for discriminating between Caucasian males and Caucasianemales.
11 86 49 725 86 49 6

4.4. Synthetic data

In the following we aim at testing the performance of the ARARMalgorithm on larger data sets. We have to mention that the time
performance of the ARARM algorithm for adaptive relational associ-ation rule mining depend not only on the dimensionality of the dataset (number of instances and attributes), but mainly on the numberof interesting relational associations rules which were discovered


Table 7Results for the “Afro-americanMale” and “Afro-americanFemale” classes for smin = 1.





Time adaptive(ms)








22 0.4 Female 3 4 All 4 72 19 10

isorl

aiaOmi

ceafastwmdtia

tarae

mTa



n data. Since not the data set is relevant in our study, but its dimen-ionality, we have considered two synthetic data sets which werebtained by merging publicly available data sets from the NASAepository [33] which are used in the software defect predictioniterature.

Our goal is to investigate the effectiveness of the ARARMlgorithm when (1) the number of attributes characterizing thenstances within the data sets is large (the first synthetic data)nd when (2) the number of instances within the data set is large.bviously, in both cases, the number of relational association rulesined in data may be very large and this also depends on the min-

mum confidence threshold.The first synthetic data set considered in our experiments

onsists of 125 instances characterized by 116 attributes. Thexperiments were performed by applying the adaptive ARARMlgorithm and the DRAR algorithm applied from scratch after theeature set extension. Different values for the initial number ofttributes (m) and the minimum confidence threshold cmin are con-idered for discovering all the relational association rules withinhe data set. The obtained results are presented in Tables 8 and 9here the improvement achieved through the adaptive process isarked with bold. In these tables, nm denotes the number of rules

iscovered in the data set before the attribute set extension (con-aining m attributes) and nm+s represents the number of rules minedn the data set after the attribute set extension (containing m + sttributes).

From Tables 8 and 9 one can observe that, for all the experiments,he ARARM method has a lower running time than the DRAR methodpplied from scratch. Even if the number of relational associationules mined in the data set is large, our adaptive method obtains anverage improvement in running time of 21%, which indicates the
ffectiveness of the ARARM method.
The second synthetic data set used for evaluating the ARARMethod consists of 2366 instances characterized by 38 attributes.

he experiments were performed by applying the adaptive ARARMlgorithm and the DRAR algorithm applied from scratch after the

11 72 19 622 72 19 4

feature set extension using a minimum confidence threshold of0.82. A number of 38, 811 interesting relational association ruleswere discovered in data. Different values for the initial number ofattributes (m) are considered in our experiments. Table 10 presentsthe performed experiments and the obtained results. For all the per-formed experiments, there is an improvement achieved throughthe adaptive process and it is marked with bold. In this table nm

denotes the number of interesting relational association rules dis-covered in the data set before the feature set extension (containingm features) and nm+s represents the rules obtained in the extendeddata set (with m + s features).

Table 10 reveals that, for all the experiments performed onthe second synthetic data, the ARARM method is more perfor-mant (from the running time point of view) than the DRAR methodapplied from scratch.

For a better analysis of the ARARM performance on the secondsynthetic data set, the running time (both for the ARARM and DRARmethods) was decomposed in: time for support and confidencecomputations and time for the candidate generation process. Theobtained values are indicated in Table 11. nscscratch and nscadaptiverepresent the number of support and confidence computations per-formed by the DRAR method applied from scratch and by the ARARMmethod, respectively. By tscscratch and tscadaptive we denote the timeneeded for the support and confidence computations performed bythe DRAR and ARARM methods. The time needed for the candidategeneration process is represented by tcgscratch (for DRAR) and bytcgadaptive (for ARARM).

Analyzing the results indicated in Table 11 we observe thatthe time for support and confidence computations performed byARARM decreases as the number rules found on the data set ofm-dimensional instances is large enough and increases. The results
reveal that the performance of ARARM (with respect to DRAR) is sig-nificant when the number nm of rules is large. Fig. 4 illustrates, foreach experiment performed on the second synthetic data, the per-centage of execution time used by ARARM and DRAR for confidenceand support computations, as well as for the candidate generation

528 G. Czibula et al. / Applied Soft Computing 36 (2015) 519–533

Table 8Results for the first synthetic data set for smin = 1.



No. of rules nm No. of rulesnm+s


Time adaptive(ms)

Improvementin running time(%)

1 0.96 26 90 23 13,437 3350 3000 10.452 0.96 30 86 40 13,437 3350 2908 13.193 0.96 34 82 64 13,437 3350 2928 12.64 0.96 38 78 208 13,437 3350 2918 12.95 0.96 42 74 310 13,437 3350 2940 12.246 0.96 46 70 366 13,437 3350 2932 12.487 0.96 50 66 366 13,437 3350 2924 12.728 0.96 54 62 369 13,437 3350 2897 13.529 0.96 58 58 399 13,437 3350 2907 13.22

10 0.96 102 14 3688 13,437 3350 2684 19.8811 0.96 103 13 4050 13,437 3350 2615 21.9412 0.96 104 12 4642 13,437 3350 2518 24.8413 0.96 105 11 4646 13,437 3350 2472 26.2114 0.96 106 10 4815 13,437 3350 2466 26.3915 0.96 107 9 4920 13,437 3350 2457 26.6616 0.96 108 8 5140 13,437 3350 2408 28.1217 0.96 109 7 5150 13,437 3350 2387 28.7518 0.96 110 6 5412 13,437 3350 2295 31.4919 0.96 111 5 6772 13,437 3350 2156 35.6420 0.96 112 4 7124 13,437 3350 2014 39.8821 0.96 113 3 9218 13,437 3350 1517 54.7222 0.96 114 2 11,280 13,437 3350 884 73.6123 0.96 115 1 11,358 13,437 3350 880 73.73

24 0.94 26 90 41 65,887 57,173 56,195 1.7125 0.94 30 86 72 65,887 57,173 55,978 2.0926 0.94 34 82 114 65,887 57,173 56,484 1.2127 0.94 38 78 389 65,887 57,173 56,505 1.1728 0.94 42 74 578 65,887 57,173 56,469 1.2329 0.94 46 70 719 65,887 57,173 56,523 1.1430 0.94 50 66 723 65,887 57,173 56,595 1.0131 0.94 54 62 808 65,887 57,173 56,634 0.9432 0.94 58 58 1077 65,887 57,173 56,596 1.0133 0.94 102 14 16,501 65,887 57,173 51,653 9.6534 0.94 103 13 18,226 65,887 57,173 50,474 11.7235 0.94 104 12 20,835 65,887 57,173 49,335 13.7136 0.94 105 11 20,936 65,887 57,173 49,547 13.3437 0.94 106 10 21,917 65,887 57,173 48,986 14.3238 0.94 107 9 22,652 65,887 57,173 48,236 15.6339 0.94 108 8 24,010 65,887 57,173 47,319 17.2440 0.94 109 7 24,082 65,887 57,173 47,101 17.6241 0.94 110 6 25,067 65,887 57,173 46,478 18.7142 0.94 111 5 31,230 65,887 57,173 41,997 26.54

prs

5

pci

5

udd

wTr

43 0.94 112 4 33,094

44 0.94 113 3 43,893

45 0.94 114 2 54,613

46 0.94 115 1 55,106

rocess. We note that as the number nm of rules increases, theeduction in execution time of the adaptive method given by theupport and confidence computations becomes greater.

. Discussion

In the following we aim to analyze the method proposed in thisaper by emphasizing its advantages and drawbacks, as well asomparing ARARM method with other related approaches existingn the data mining literature.

.1. Analysis of the ARARM method

Experiments were conducted in Section 4 in order to test thesefulness of ARARM method on five data sets: a gene expressionata set, two human skeletal remains data sets and two synthetic
ata sets.
There are no existing approaches in the literature that dealith the adaptive relational association rule mining problem.

hus, we have conducted our evaluations toward comparing theunning time of the adaptive method with the running time of

65,887 57,173 40,225 29.6465,887 57,173 29,575 48.2765,887 57,173 18,465 67.765,887 57,173 18,111 68.32

the non-adaptive ones. We have focused on highlighting that theARARM method provides the results faster than the DRAR methodapplied from scratch.

In order to conclude about the efficiency of the adaptive methodagainst the non-adaptive one, we depict in Figs. 5–9 (for eachexperiment we have performed on the case studies consideredfor evaluation in Section 4) the reduction (in percentage) of therunning time of ARARM with respect to the running time ofDRAR.

We notice that, for all the experiments represented in Figs. 5–9,the running time for ARARM is less than the running time of DRARapplied from scratch. Larger the value on the y-axis is, greater isthe reduction in the execution time of the ARARM method. Thus,the time needed to adaptively discover the interesting relationassociation rules is less than the time needed to obtain the rulesnon-adaptively, i.e by running from scratch the algorithm for find-
ing the rules, and this emphasizes the effectiveness of our proposal.
To conclude about the performance of the ARARM methodintroduced in this paper, additional experiments were performed.The experiments were performed on an open source data set fromthe NASA repository [33] which was used in the literature for


Table 9Results for the first synthetic data set for smin = 1 and cmin = 0.935.



No. of rules nm No. of rulesnm+s


Time adaptive(ms)


47 0.935 26 90 59 147,042 292,652 291,665 0.3448 0.935 30 86 95 147,042 292,652 292,531 0.0449 0.935 34 82 159 147,042 292,652 292,556 0.0350 0.935 38 78 511 147,042 292,652 292,021 0.2251 0.935 42 74 757 147,042 292,652 288,363 1.4752 0.935 46 70 980 147,042 292,652 290,428 0.7653 0.935 50 66 1020 147,042 292,652 288,040 1.5854 0.935 54 62 1251 147,042 292,652 289,367 1.1255 0.935 58 58 1896 147,042 292,652 285,637 2.456 0.935 102 14 38,434 147,042 292,652 259,216 11.4357 0.935 103 13 42,142 147,042 292,652 254,846 12.9258 0.935 104 12 47,967 147,042 292,652 247,048 15.5859 0.935 105 11 48,179 147,042 292,652 246,894 15.6460 0.935 106 10 50,424 147,042 292,652 245,175 16.2261 0.935 107 9 52,070 147,042 292,652 240,410 17.8562 0.935 108 8 55,007 147,042 292,652 234,799 19.7763 0.935 109 7 55,217 147,042 292,652 234,950 19.7264 0.935 110 6 57,679 147,042 292,652 230,088 21.3865 0.935 111 5 70,601 147,042 292,652 209,675 28.3566 0.935 112 4 75,478 147,042 292,652 201,583 31.1267 0.935 113 3 99,043 147,042 292,652 147,498 49.668 0.935 114 2 122,452 147,042 292,652 89,480 69.4269 0.935 115 1 123,500 147,042 292,652 86,725 70.37

Table 10Results for the second synthetic data set for smin = 1 and cmin = 0.82.

Experiment No. ofattributes (m)


nm nm+s Time fromscratch (ms)

Time adaptive(ms)


1 15 23 76 38,811 36,314 34,435 5.172 16 22 93 38,811 36,314 34,361 5.383 17 21 102 38,811 36,314 34,544 4.874 18 20 201 38,811 36,314 34,635 4.625 19 19 290 38,811 36,314 34,781 4.226 20 18 381 38,811 36,314 34,887 3.937 21 17 580 38,811 36,314 34,770 4.258 22 16 905 38,811 36,314 34,527 4.929 23 15 1314 38,811 36,314 34,266 5.64

10 24 14 1813 38,811 36,314 34,167 5.9111 25 13 2988 38,811 36,314 33,772 712 26 12 4041 38,811 36,314 32,982 9.1813 27 11 4132 38,811 36,314 32,891 9.4314 28 10 4215 38,811 36,314 32,993 9.1515 29 9 6476 38,811 36,314 31,701 12.716 30 8 7279 38,811 36,314 31,101 14.3617 31 7 10,676 38,811 36,314 29,154 19.7218 32 6 14,077 38,811 36,314 26,928 25.8519 33 5 16,711 38,811 36,314 25,071 30.96

sfE6nc

dtoorod

20 34 4 25,882

21 35 3 32,937

22 36 2 32,937

23 37 1 32,963

oftware defect prediction. The PC1 data set is built for functionsrom a flight software for earth orbiting satellite, written in C.xperiments were performed on the subset of PC1 consisting of44 instances which are non-defects. On this data set, a totalumber of 40, 947 relational association rules having a minimumonfidence of 0.9 were identified.

Experiments with different number of instances (which lead toifferent number of relational association rules) are performed onhe PC1 non-defects data set (with a minimum confidence thresh-
ld of 0.9, m = 35 and s = 2). Fig. 10 depicts how the running timef ARARM decreases with respect to the running time of DRARegarding the number of considered instances. The performancef ARARM in comparison with DRAR considering the number ofiscovered relational association rules is illustrated in Fig. 11.
38,811 36,314 16,888 53.4938,811 36,314 8740 75.9338,811 36,314 8740 75.9338,811 36,314 8702 76.04

From Figs. 10 and 11 one can conclude that the adaptive rela-tional association rule method is more effective (in running timeperformance) than the non-adaptive one.

Table 12 indicates, for each data set we have considered in ourexperimental evaluation, the reduction in the average running timeof ARARM with respect to the average running time of DRAR appliedfrom scratch.

In order to perform a statistical analysis the obtained results, foreach case study considered for evaluation, we computed a 95% Con-
fidence Interval (CI) [5] for the average of running time reductionsobtained by applying the ARARM method.
The reduction in running time obtained through the adaptivemethod is natural. It comes from the fact that the ARARM methodstarts from the set of relational association rules discovered in

530 G. Czibula et al. / Applied Soft Com

Table 11Additional results for the second synthetic data set.

Experiment nscscratch nscadaptive tscscratch tscadaptive tcgscratch tcgadaptive

1 49,744 49,458 20,204 18,370 16,110 16,0652 49,744 49,412 20,204 18,277 16,110 16,0843 49,744 49,311 20,204 18,367 16,110 16,1774 49,744 49,190 20,204 18,339 16,110 16,2965 49,744 49,052 20,204 18,313 16,110 16,4686 49,744 48,920 20,204 18,590 16,110 16,2977 49,744 48,640 20,204 18,297 16,110 16,4738 49,744 48,172 20,204 18,143 16,110 16,3849 49,744 47,615 20,204 17,969 16,110 16,297

10 49,744 46,980 20,204 17,943 16,110 16,22411 49,744 45,518 20,204 17,406 16,110 16,36612 49,744 44,109 20,204 16,757 16,110 16,22513 49,744 43,945 20,204 16,784 16,110 16,10714 49,744 43,771 20,204 16,737 16,110 16,25615 49,744 41,150 20,204 15,724 16,110 15,97716 49,744 40,205 20,204 15,312 16,110 15,78917 49,744 36,588 20,204 14,055 16,110 15,09918 49,744 32,983 20,204 12,635 16,110 14,29319 49,744 30,042 20,204 11,588 16,110 13,48320 49,744 19,741 20,204 7524 16,110 936421 49,744 9652 20,204 3667 16,110 507322 49,744 9652 20,204 3667 16,110 507323 49,744 9546 20,204 3601 16,110 5101

Fig. 4. The candidate generation process influence on the A

Fig. 5. Results for the fir

puting 36 (2015) 519–533

the data set of m-dimensional instances and adapts it (using theadaptive algorithm from Fig. 2) considering the newly added sattributes. This, obviously, is more effective, since we are start-ing from the set of relational association rules identified in thedata set of m-dimensional instances and adapt it, instead of run-ning the relational association rule mining method from scratch onthe m + s dimensional instances. It is very likely that the adaptationtime decreases with decreasing of the number of added attributes.Clearly, the reduction in running time of ARARM with respect toDRAR increases as the number n of rules from the data set beforethe feature set extension is larger. When n becomes close to 0, thenthe ARARM running time becomes closer to the running time ofDRAR.

From Table 12 we notice an average of 53% for the ARARM run-ning time reduction with respect to the running time from scratch.We note that, when the number of rules is very large (e.g 147,042in Table 10) the reduction in running time obtained through theadaptive mining algorithm (an average of 30% when adding at least14 new attributes) is significant, since it may lead to saving hours(even days) of running. Fig. 12 illustrates, for all of the case stud-
ies considered for evaluation, the time performance improvementobtained using the ARARM method (the values for all the runningtimes are scaled proportionally).
RARM running time for the second synthetic data set.

st synthetic data.


Fig. 6. Results for the gene expression data set.

Fig. 7. Results for first human skeletal data set (both “Male” and “Female” classes).

Fig. 8. Results for the second human skeletal remains data set (all classes:“CaucasianMale”, “CaucasianFemale”, “Afro-americanMale” and “Afro-americanFemale”).

ifsA

Fig. 10. ARARM performance – number of instances vs. running time for the PC1data set.

Fig. 11. ARARM performance – number of rules vs. running time for the PC1 dataset.

TA

Fig. 9. Results for the second synthetic data.

Considering the experiments from Section 4 and the compar-sons shown above, we can conclude that ARARM is more efficient,rom the running time point of view, than DRAR applied fromcratch. Still, there may be situations in which the running time ofRARM is not significantly smaller than the running time of DRAR

able 12verage running time reduction for the performed experiments. A 95% CI is used for the v

Data set Average ofnumber of rule

Averrunnfor A

Gene expression data 54

First human skeletal data 31

Second human skeletal data 73

First synthetic data 75,455 112,Second synthetic data 38,811 33,PC1 data set 25,637 4

Fig. 12. ARARM performance for all considered case studies.

applied from scratch. These situations occur when a small num-ber of interesting relational association rules occur within the dataset before the feature set extension. As shown by the experimental
results, this usually happens when the number s of initial attributesis small.
An extension of the ARARM method to a fuzzy approach [10,11]would be relevant in practical applications, where we are dealing

alues on the fifth column.

age ofing timeRARM (ms)

Average ofrunning timesfor DRAR fromscratch (ms)

ARARM runningtimereductions (%)

11.21 42.83 69.8±5.769.75 23.375 61.58±9.33

18.48 66.38 71.05±23.38907.54 137,110.59 21.75 ± 4.7161 36,314 20.38±2.46767 15,389 71.18±3.58

5 ft Com

wrafbro

tmrwwwtfop

5

iiaapith

mttlmo

mFtledaso

tosmtC

tocpsiirau


e noisy data. We are currently working on extending the conceptelational association rules toward fuzzy relational association rulesnd developing a method (similar to DRAR) for mining interestinguzzy relational association rules in data sets. Afterwards, it woulde possible to investigate the adaptive fuzzy relational associationule mining. The fuzzy extension of the ARAM method will be onef our further research.

The main goal of the proposed ARARM algorithm is to adap-ively mine relational association rules within a data set. The ARARM

ethod is complete and it reduces the time needed to discover theules from scratch, when the attribute set increases. Certainly, itould be of great interest to analyze the relational association ruleshich were discovered in data (as shown in Section 4.3.2). Furtherork will be carried out in order to investigate the usefulness of

he mined information [11,12]. We plan to use the ARARM methodor classification tasks: in bioarchaeology for detecting the genderf skeletal remains (see Section 4.3) or in software engineering forredicting defective software entities (see the PC1 data set).

.2. Comparison to related work

The adaptive relational association rule mining approachntroduced in Section 3 is innovative, since there are no exist-ng similar approaches in the data mining literature. Existingpproaches deal with non-relational associational rules and theyre adaptive with respect to other aspects, like data dependentarameters. Thus, we cannot compare our approach with the exist-

ng ones, since the perspectives are different. Still, we will present inhe following several existing data mining methods that are some-ow similar with our approach.

Sarda and Srinivas [37] present an adaptive algorithm for incre-ental mining of association rules, which is able to decide based on

he type of increment – similar or significantly different – whethero scan the original database for updating the rules obtained in ear-ier mining processes. This approach deals with incremental rule

ining, in which new instances are added to the data set, unlikeurs in which new features are added to the existing objects.

Het et al propose in [27] an adaptive fuzzy association ruleining approach for decision support. Their algorithm, called

ARM − DS, builds a decision support system for binary classifica-ion problems in biomedical applications, which besides the classabel also returns the rules fired for an unseen sample. The param-ters of the algorithm are adaptive in the sense that they areata-dependent and optimized by cross validation. Our ARARMlgorithm is essentially different from the FARM − DS algorithm,ince the only adaptive viewpoint in [27] is related to parametersptimization.

An association rule mining model with dynamic adaptivehresholds is introduced in [38]. Their algorithm, called DASApri-ri is an extension of Apriori algorithm for finding large itemets. They propose two minimum support counts: Dynamic Mini-um Support and Adaptive Minimum Support and two confidence

hresholds: Dynamic Minimum Confidence and Adaptive Minimumonfidence.

Zhang et al introduce in [45] an adaptive association rule miningechnique, which adapts the support value according to the F-scorebtained through prior classification in order to classify web videoontent based on mixed visual and textual information. The sup-ort in this case is computed as a measure for the similarity of theet of terms characterizing the videos under classification. A sim-
lar technique for adapting the support value is presented in [32],n which the minimum support threshold is established during theule generation process in order to maintain the number of gener-ted rules within a desired range. An algorithm called FRG − AARM,sing a similar approach to the aforementioned paper in order to
puting 36 (2015) 519–533

devise an efficient market basket analysis method, is introduced in[18].

The adaptive association rule mining approaches [38,45,32,18]described above focus on adapting the support value used in rulegeneration to the actual training data set, such that the generatedrules are relevant. However, the algorithm proposed in our paper isadaptive to changes in the nature of the data set, such as the appear-ance of new features over time, whereas the articles which werepresented above focus on improving the quality of the generatedrules by adapting certain parameters of the rule mining processsuch as the support value to the data set particularities.

6. Conclusions and future work

In this paper we have approached the problem of adaptive rela-tional association rule mining and we proposed a novel algorithm,called ARARM for adapting the set of relational association rulesmined in a data set, when the feature set describing the objectsincreases. The experiments are performed on six case studies forwhich the application of adaptive relational association rule min-ing is useful. The obtained results on these data sets prove that theresult is reached more efficiently using the proposed method thanrunning the mining algorithm again from scratch, on the feature-extended object set.

So far we have focused only on the discovery of the relationalassociation rules within the data sets considered in our experi-ments. Further investigations will be made in order to find out thesuitability of the relational association rule mining for the prob-lems considered in our case studies (Sections 4.2 and 4.3). Thus,we want to obtain a feedback from experts that will analyze theresults of the mining processes and will interpret the significanceof the relational association rules that were found to be interestingwithin the data sets.

Another possible direction to continue our research would be toextend the linear “chain-like” representation for relational associ-ation rules to more complex structures. The problem of relationalassociation rule mining may be viewed as a special case of findingfrequent chains in a database of labeled graphs [24]. Relational asso-ciation rules may be viewed as the frequent chains (frequent lineargraphs) in this graph database. Using this representation, one maybe able to compute the labeled graph canonical form [26] whichcan be used to avoid reporting duplicate rules (especially “mirror”rules), leading this way to a more efficient filtering. This represen-tation for relational association rules also suggests an extension toallow more general graphs, like trees or graphs including cycles.

As future work we also plan to extend the evaluation of theARARM method to different data sets and real life problems. In addi-tion, an extension of the method proposed in this paper to a fuzzyapproach [36,11] and other approaches for mining association rules[30,19] will be further considered.

Acknowledgments

The authors would like to thank the editor and the anonymousreviewers for their valuable comments and suggestions to improvethe paper and the presentation. They also gratefully acknowledgethe assistance received from the Babes -Bolyai University Interdisci-plinary Research Institute on Bio & Nano Sciences [20] for providing
the data sets used in the experimental part of the paper. Specialthanks are due to Professor Beatrice Kelemen and her researchteam. This work was supported by a grant of the Romanian NationalAuthority for Scientific Research, CNCS–UEFISCDI, project numberPN-II-RU-TE-2014-4-0082.

ft Com

R

[

[

[

[[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

G. Czibula et al. / Applied So

eferences

[1] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in largedatabases, in: Proceedings of the 20th International Conference on Very LargeData Bases, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1994,pp. 487–499.

[2] R. Ahmed, G. Karypis, Algorithms for mining the evolution of conservedrelational states in dynamic networks, Knowl. Inf. Syst. 33 (3) (2012)603–630.

[5] L. Brown, T. Cat, A. DasGupta, Interval estimation for a proportion, Stat. Sci. 16(2001) 101–133.

[6] T. Calders, N. Dexters, J.J. Gillis, B. Goethals, Mining frequent itemsets in astream, Inf. Syst. 39 (2014) 233–255.

[7] A. Campan, G. Serban, T.M. Truta, A. Marcus, An algorithm for the discovery ofarbitrary length ordinal association rules, in: DMIN, 2006, pp. 107–113.

[8] A. Câmpan, G. Serban, A. Marcus, Relational association rules and error detec-tion, Stud. Univ. Babes-Bolyai Inform. LI (1) (2006) 31–36.

10] G.-Y. Chan, C.-S. Lee, S.-H. Heng, Defending against XML-related attacks ine-commerce applications with predictive fuzzy associative rules, Appl. SoftComput. 24 (0) (2014) 142–157.

11] C.-H. Chen, A.-F. Li, Y.-C. Lee, A fuzzy coherent rule mining algorithm, Appl. SoftComput. 13 (7) (2013) 3422–3428.

12] C.-H. Chen, A.-F. Li, Y.-C. Lee, Actionable high-coherent-utility fuzzy itemsetmining, Soft Comput. 18 (12) (2014) 2413–2424.

13] K.J. Cios, Medical Data Mining and Knowledge Discovery, Springer, 2001.15] G. Czibula, M.-I. Bocicor, I.G. Czibula, Promoter sequences prediction using

relational association rule mining, Evol. Bioinform. 8 (2012) 181–196.16] G. Czibula, Z. Marian, I.G. Czibula, Software defect prediction using relational

association rule mining, Inf. Sci. 264 (2014) 260–278.17] G. Czibula, Z. Marian, I.G. Czibula, Detecting software design defects using rela-

tional association rule mining, Knowl. Inf. Syst. (2014) 1–33.18] M. Dhanabhakyam, M. Punithavalli, An efficient market basket analysis based

on adaptive association rule mining with faster rule generation algorithm, SIJTrans. Comput. Sci. Eng. Appl. 1 (3) (2013) 105–110.

19] Y. Du, H. Li, Strategy for mining association rules for web pages based on formalconcept analysis, Appl. Soft Comput. 10 (3) (2010) 772–783.

20] Institute of Interdisciplinary Research in Bio-Nano-Sciences, http://bionanosci.institute.ubbcluj.ro/, 2001.

21] J. DeRisi, P. Iyer, V. Brown, Exploring the metabolic and genetic control of geneexpression on a genomic scale, Science 278 (5338) (1997) 680–686.

24] J.A. Gallian, A dynamic survey of graph labeling, Electron. J. Comb. 17 (2014)1–384.

25] J. Han, Data Mining: Concepts and Techniques, Morgan Kaufmann PublishersInc., San Francisco, CA, USA, 2005.

26] S.G. Hartke, A.J. Radcliffe, Mckay’s canonical graph labeling algorithm, Con-temp. Math. 479 (2009) 99–111.

27] Y. He, Y. Tang, Y. Zhang, R. Sunderraman, Adaptive fuzzy association rule min-ing for effective decision support in biomedical applications, Int. J. Data Min.Bioinform. 1 (1) (2006) 3–18.

30] R. Kuo, C. Chao, Y. Chiu, Application of particle swarm optimization to associa-tion rule mining, Appl. Soft Comput. 11 (1) (2011) 326–336.

32] W. Lin, S.A. Alvarez, C. Ruiz, Efficient adaptive-support association rulemining for recommender systems, Data Min. Knowl. Discov. 6 (1) (2002)83–105.

33] T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, F. Peters, B. Turhan, Thepromise repository of empirical software engineering data, June 2012 http://promisedata.googlecode.com

puting 36 (2015) 519–533 533

35] NeuNetPro, Discover the patterns in your data, 2010 http://www.cormactech.com/neunet

36] S. Ribaric, T. Hrkac, A model of fuzzy spatio-temporal knowledge representationand reasoning based on high-level Petri nets, Inf. Syst. 37 (3) (2012) 238–256.

37] N.L. Sarda, N.V. Srinivas, An adaptive algorithm for incremental mining of asso-ciation rules, in: Proceedings of the 9th International Workshop on Databaseand Expert Systems Applications, DEXA ‘98, IEEE Computer Society, Washing-ton, DC, USA, 1998, pp. 240–245.

38] C.S.K. Selvi, A. Tamilarasi, Association rule mining with dynamic adaptivesupport thresholds for associative classification, in: Proceedings of the Interna-tional Conference on Computational Intelligence and Multimedia Applications(ICCIMA 2007) – vol. 02, ICCIMA’07, IEEE Computer Society, Washington, DC,USA, 2007, pp. 76–80.

39] B. Soua, A. Borgi, M. Tagina, An ensemble method for fuzzy rule-based classifi-cation systems, Knowl. Inf. Syst. 36 (2013) 385–410.

40] G. Serban, A. Câmpan, I.G. Czibula, A programming interface for finding rela-tional association rules, Int. J. Comput. Commun. Control I (S.) (2006) 439–444.

41] G. Serban, I.G. Czibula, A.C. ampan, Medical diagnosis prediction using rela-tional association rules, in: Proceedings of the International Conference onTheory and Applications of Mathematics and Informatics (ICTAMI’07), 2008,pp. 339–352.

42] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, First edition,Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005.

43] M. Trotter, G. Gleser, Estimation of Stature from Long Bones of American Whitesand Negroes, Press of the Wistar Institute of Anatomy and Biology, 1952.

44] R. Vimieiro, P. Moscato, A new method for mining disjunctive emerging pat-terns in high-dimensional datasets using hypergraphs, Inf. Syst. 40 (2014) 1–10.

45] C. Zhang, X. Wu, M.-L. Shyu, Q. Peng, Adaptive association rule mining for webvideo event classification, in: IRI, IEEE, 2013, pp. 618–625.

46] K. Zhang, D. Lo, E.-P. Lim, P.K. Prasetyo, Mining indirect antagonistic commu-nities from social interactions, Knowl. Inf. Syst. 5 (3) (2013) 553–583.

47] W.H. Wolberg, O.L. Mangasarian, Multisurface method of pattern separationfor medical diagnosis applied to breast cytology, Proc. Natl. Acad. Sci. U. S. A.87 (1990) 9193–9196.

Gabriela Czibula works as a professor in the Computer Science Department, Fac-ulty of Mathematics and Computer Science from the Babes -Bolyai University, city ofCluj-Napoca, Romania. She received her PhD degree in Computer Science in 2003,with the “cum laude” distinction. She published more than 140 papers in presti-gious journals and conferences proceedings. Her research interests include Appliedartificial intelligence, Machine learning and Multiagent systems.

István Gergely Czibula works as an associate professor at the Computer ScienceDepartment, Faculty of Mathematics and Computer Science from the Babes -BolyaiUniversity, city of Cluj-Napoca, Romania. He received his PhD degree in ComputerScience in 2009. He published more than 70 papers in various journals and confer-ences proceedings. His main research interests are Software engineering and Objectoriented design.

Adela-Maria Sîrbu is currently a PhD student in the Computer Science Department,Faculty of Mathematics and Computer Science from the Babes -Bolyai University,city of Cluj-Napoca, Romania. Her main research interests are Machine learning and
Software engineering.
Ioan-Gabriel Mircea is currently a PhD student in the Computer Science Depart-ment, Faculty of Mathematics and Computer Science from the Babes -BolyaiUniversity, city of Cluj-Napoca, Romania. His main research interests are Data min-ing, Multiagent systems and Software engineering.

applied soft computingstatic.tongtianta.site/paper_pdf/953a421e-5cec-11e9-affb... · 2019. 4....

Documents