[ieee 2005 purtuguese conference on artificial intelligence - covilha, portugal...

Post-processing of Association Rules using

Taxonomi es

Marcos Aurelio DominguesLIACC-NIAAD - Universidade do Porto

Rua de Ceuta, 118, Andar 64050-190 Porto, Portugal

marcos @liacc.up.pthttp://www.liacc.up.pt

Abstract- The Data Mining process enables the end usersto analyse, understand and use the extracted knowledge in anintelligent system or to support in the decision-making processes.However, many algorithms used in the process encounter largequantities of patterns, complicating the analysis of the patterns.This fact occurs with Association Rules, a Data Mining techniquethat tries to identify intrinsic patterns in large data sets. Amethod that can help the analysis of the Association Rules isthe use of taxonomies in the step of post-processing knowledge.In this paper, the gART algorithm is proposed, which usestaxonomies to generalize Association Rules, and the RulEE-GAR computational module, that enables the analysis of thegeneralized rules.

Index Terms-Association Rules, Data Mining, Post-processing,Taxonomies.

I. INTRODUCTION

The development of the data storing technologies has in-creased the data storage capacity of companies. Nowadaysthe companies have technology to store detailed informationabout each realized transaction, generating large databases.This stored information may help the companies to improvethemselves and because of this the companies have spon-sored researches and the development of tools to analyse thedatabases and generate useful information.

During years, manual methods had been used to convertdata in knowledge. However, the use of these methods hasbecome expensive, time consuming, subjective and non-viablewhen applied at large databases.The problems with the manual methods stimulated the de-

velopment of processes of automatic analysis, like the processof Knowledge Discovery in Databases or Data Mining. Thisprocess is defined as a process of identifying valid, novel,potentially useful, and ultimately understandable patterns indata [1]. This process has also been used in several domainsof application [2].

Although the use of the Data Mining process to identifypatterns in large databases has become necessary, such processmay generate large quantities of patterns. Part of these patternsmay not be useful for the user. The user usually prefersand wants few patterns that are interesting. Therefore, the

Solange Oliveira RezendeICMC - Universidade de Sao Paulo

Av. Trabalhador Sao-Carlense, 400, C.P. 66813560-970 Sao Carlos, SP, Brazil

solange@ icmc.usp.brhttp://www.icmc.usp.br

development of techniques that just provide more interestingpatterns is important [3].

In the Data Mining process, the use of the AssociationRules technique may generate large quantities of patterns.This technique has caught the attention of companies andresearch centers [4]. Several researches have been developedwith this technique and the results are used by the companiesto improve their businesses (insurance policy, health policy,geo-processing, molecular biology) [5], [6], [7].A way to solve the problem of the large quantities of pat-

terns extracted by the Association Rules technique is the useof taxonomies in the step of post-processing knowledge [8],[5], [9]. The taxonomies may be used to prune uninterestingand/or redundant rules (patterns) [8].

In this paper the QA'RT algorithm and the RulEE-GARcomputational module is proposed. The QARvT algorithm(Generalization of Association Rules using Taxonomies) usestaxonomies to generalize Association Rules. The RulEE-GARcomputational module uses the 9ARGT algorithm, to genera-lize Association Rules, and provides several means to analyzethe generalized rules.

This paper is organized as following: first by presentingsome selected works, second by presenting the AssociationRules technique and some general features about the use oftaxonomies, describing the QA'RT algorithm and the RulEE-GAR computational module. Finally the results of some ex-periments realized with the QA'RvT algorithm along with ourconclusion are presented.

II. RELATED WORK

In this section some selected works about generalization andpost-processing of Association Rules are presented.

Srikant and Agrawal [9] introduce the problem of genera-lized Association Rules using taxonomy. They also present anew interest-measure for rules which uses information of thetaxonomy.

Baesens et al [4] situate and motivate the need for a step ofpost-processing to the Association Rules mining algorithms.They give a tentative overview of some of the main post-processing tasks, considering the efforts that have already beenreported in the literature.

0-7803-9365-1/05/$17.00c2005 IEEE192

Liu et al [5] propose a new approach to assist the userin finding interesting rules (in particular, unexpected rules)from a set of discovered Association Rules. The technique ischaracterized by analysing the discovered Association Rulesusing the user's existing knowledge about the domain and thenranking the discovered rules according to various criteria ofinterestingness, e.g., conformity and various types of unex-pectedness.

Alipio et al [10] propose a post-processing methodologyand a tool for browsing/visualizing large sets of AssociationRules. The methodology is based on a set of operators thattransform sets of rules into sets of rules, allowing focusing oninteresting regions of the rules space.

III. ASSOCIATION RULES AND TAXONOMIESThe problem of mining Association Rules was introduced

in [11]. Given a set of transactions, where each transactionis a set of literals, called items, an Association Rule isan expression LHS X* RHS. The LHS and RHS are,respectively, the Left Hand Side and the Right Hand Side of therule, defined by distinct sets of items. The intuitive meaning ofsuch rule is that transactions in the database which contain theitems in LHS tend to also contain the items in RHS. Thus,the Association Rules are used to find the tendency that allowsthe user to understand and explore the patterns of behaviorof the data. An example of such rule might be that 80% ofthe customers who purchase the Q product also buy the Wproduct. Here 80% is called the confidence of the rule. Thefollowing is a formal statement of the Association Rules [11]:

Let A = {a,, ...am} be a set of literals, calleditems. Let T = {T1, ..., Tn} be a set of transactions,where each transaction Ti C T is a set of items suchthat Ti C A. We say that a transaction Ti containsX, a set of some items in A, if X C Tj.An Association Rule is an implication of the formLHS > RHS, where LHS c A, RHS c A andLHS n RHS= 0. The rule LHS > RHS holdsin the transactions set T with confidence conf ifconff% of the transactions in T that contain LHSalso contain RHS. The rule LHS #= RHS hassupport sup in the transactions set T if sup% of thetransactions in T contain LHSU RHS.

Given a set of transactions T, the problem of miningAssociation Rules is to generate all Association Rules thathave support and confidence greater than the user-specifiedminimum support and minimum confidence respectively. Thesupport value represents the number of transactions thatcontain all the elements in LHS U RHS. The confidencevalue represents the proportion of the number of transactionsthat contain LHS and RHS in relation to the number oftransactions that contain LHS. Usually, minimum supportand minimum confidence values are defined by the user tomine Association Rules. High values of minimum support andminimum confidence just generate trivial rules. Low valuesof minimum support and minimum confidence generate largequantities of rules (patterns), complicating the user's analysis.

A way of overcoming the difficulties in the analysis of largequantities of Association Rules is the use of taxonomies inthe step of post-processing knowledge. The use of taxonomiesmay help the user to identify interesting and useful knowledgein the extracted rules set. The taxonomies represent a collectiveor individual characterization of how the items can be clas-sified hierarchically [8]. In Fig. 1 an example of a taxonomyis presented where it can be observed that: t-shirts are lightclothes, shorts are light clothes, light clothes are a kind ofsport clothes, sandals are a kind of shoes.

Fig. 1. An example of taxonomy for clothes.

However, there are some taxonomies with difficulties toattribute a name to the class of hierarchical characterization ofthe items. An example of a taxonomy without classification isshowed in Fig. 2, where there are no specific names for theclasses products] and products2.

Fig. 2. An example of taxonomy without classification for clothes.

In the next section the proposed algorithm is described.The algorithm uses taxonomies (made by the users) to ge-neralize Association Rules in the step of knowledge post-processing [12]. The algorithm uses both the types of taxo-nomies, with and without classification.

IV. THE ALGORITHM gAR§TWe analysed the structure of the Association Rules gener-

ated by algorithms that do not use taxonomies. The resultsof the analysis show us that it is possible to generalizeAssociation Rules using taxonomies. In Fig. 3 we show howthe Association Rules can be generalized.

First, we changed the items t-shirt and short of the rules

short & slipper =# cap,

193

sandal & short =* cap,sandal & t-shirt =# cap and

slipper & t-shirt => cap

by the item light clothes (which represents ageneralization). This change generated two ruleslight clothes & slipper #> cap and two ruleslight clothes & sandal #> cap. Next, we pruned the repeatedgeneralized rules, maintaining only the two rules:

light clothes & slipper => cap andlight clothes & sandal => cap.

The two rules generated by the Step 1 (Fig. 3)were generalized again. We changed the items slip-per and sandal by the item light shoes (which repre-sented another generalization) generating two ruleslight clothes & light shoes => cap. Then we pruned the re-peated generalized rules again, maintaining only one gene-ralized Association Rule: light clothes & light shoes => cap.

Taxbnor y

shXort &spper = capsandal & short cap

sUpopet& "shirt cap

Step I

.+

I -u'u'Geerhli zed R1>es

light clothe & shlpperc cap +.Lght o|t & tanda caap

roand4l *|lppwr

Step 2

light clothes & light shoes cap

Fig. 3. Generalization of Association Rules using two taxonomies.

Due to the possibility of generalization of the AssociationRules (Fig. 3), we propose an algorithm to generalize Asso-ciation Rules. The proposed algorithm is illustrated in Fig. 4.The proposed algorithm just generalizes one side of the

Association Rules (LHS or RHS). First, we grouped the rules(represented in the standard syntax defined by [13]) in subsetsthat present equal antecedents or consequents. If the algorithmwere used to generalize the left hand side of the rules (LHS),the subsets would be generated using the equals consequents(RHS). If the algorithm were used to generalize the right handside of the rules (RHS), the subsets would be generated usingthe equal antecedents (LHS). Next, we used the taxonomiesto generalize each subset. In the final algorithm, we orderedlexicographically the items of the generalized rules and thenwe stored the rules in a set of generalized Association Rules.

In the final algorithm, we also calculated the ContingencyTable for each generalized Association Rules, because thestandard syntax defined by [13] include the rule followed by itsContingency Table. The Contingency Table of a rule representsthe coverage of the rule with respect to the database used inits mining [14]. With the calculation of the Contingency Tablewe finished the algorithm.We called the proposed algorithm of QARvT (Genera-

lization of Association Rules using Taxonomies). A formalstatement of the 9ARGT algorithm is showed in Algorithm 1.

Algorithm 1 AR§T

Require: A set of Association Rules R, a set of taxonomies T, adatabase D and the rules' side that will be generalized - left handside (antecedent of the rules) or right hand side (consequent ofthe rules).

1: Rg 0; I/The variable Rg will store the generalized rules set2: E group-association-rules(R, side); H/The parameter side

indicates the side that will not be generalized3: for all subset E C E do4: generalize-rules(E, T, side);5: order-lexicographically (E, side);6: Rg:= RgUE;7: end for8: for all rule r C Rg do9: if r is a generalized rule then

10: calculate-contingency-table(r, T, D);11: end if12: end for13: return Rg; //Return the generalized rules set

V. THE COMPUTATIONAL MODULE RulEE-GAR

In this section we present the RulEE-GAR computationalmodule that provides means to generalize Association Rulesand also to analyze the generalized rules [12]. The genera-lization of the Association Rules is realized by the QARTalgorithm, described in the previous section. Next we describethe means to analyze the generalized Association Rules. InFig. 5 we show the screen of the interface that enables theuser to analyze and to explore the generalized rules sets.On the screen of the analysis interface of generalized rules

(Fig. 5) there are some spaces where the user puts data to makea query and select a set of generalized rules, accompanied ornot of several evaluation measures [14], to be analyzed. Withthe query the user chooses the consequents and/or antecedentsthat must be in the rules selected and also the evaluationmeasures that must appear together the rules. Besides allowingthe user to select a set of rules, the interface provides fourlinks in the section Downloads to look for and/or downloadthe files. The files contain, respectively, the set of transactionaldata (Data Set), the set of source rules (Rule Set), the setof generalized rules (Generalized Rule Set) and the set oftaxonomies used to generalize the rules (Taxonomy Set).

Besides links for visualization and/or download of the files,each generalized Association Rule presents others links thatenable the user to explorer information about the generaliza-tion of the rule. The links are positioned at the left side of the

194

Txrnomies

+1

I I I I I I I I _C6hfl!Iqff,h Tbtt fc

Fig. 4. The proposed algorithm to generalize Association Rules.

M. 4 DWnloadns; i

Fig. 5. Screen of the analysis interface of generalized Association Rules.

rules (Fig. 5). The links are described as following:

* Expanded Rule: It is represented in the interface bythe letter "E". This link enables the user to see thegeneralized rule in expanded way. The generalized itemsof a rule are changed by the respective specific items.

* Source Rules: It is represented in the interface by theletter "S". This link enables the user to see the sourcerules that were generalized.

* Measures: It is represented in the interface by the letter"M". This link is available only if the user selects thesupport (Sup) and/or confidence (Cov) measures in itsquery and these measures present values lower than theminimum support and/or minimum confidence valuesdefined to the mining process of the rules set not generali-

zed. With this link it is possible to see which generalizedrules have support and/or confidence values lower thanthe minimum support and/or minimum confidence values.

In Fig. 5 we also see that the generalized items in a rule(items between parentheses) are presented as links. Theselinks enable the user to see the source items that weregeneralized. In the analysis interface, the user can also storethe information, selected by the query, in a text file.

VI. EXPERIMENTS

We realized some experiments using the Q9ART algorithmto demonstrate that the use of taxonomies, to generalize largerules sets, reduces large quantities of Association Rules andmakes easy the analysis of the rules.

195

The experiments were realized using a sale database of aBrazilian supermarket. The database contained sales data ofthe recent 3 month. We made 4 partitions of the database torealize the experiments. The partitions were made using thesale data along of 1 day, 7 days, 14 days and 1 month.To generate the Association Rules, we used the implemen-

tation of the Apriori algorithm realized by Chistian Borgelt'with minimum support value equal 0.5, minimum confidencevalue equal 0.5 and a maximum number of 5 items by rule.The generated rules sets are described as following:

. RuleSet iday - 32668 rules generated using the partitionof 1 day;

* RuleSet 7days - 19166 rules generated using the partitionof 7 days;

. RuleSet 14days - 16053 rules generated using the parti-tion of 14 days;

. RuleSet lmonth - 21505 rules generated using the parti-tion of 1 month;

. RuleSet 3months - 19936 rules generated using thewhole database (3 months of sale data).

To realize the experiments, we manually made 13 sets oftaxonomies through of the analysis of the database and of the 5sets of generated Association Rules. Then we ran the 9AR§Talgorithm combining each set of taxonomies with each set ofrules. In Fig. 6 a chart is presented that shows the reductionrates of the 5 rules sets after running Q9AIT algorithm usingthe 13 sets of taxonomies to generalize each rules set. InFig. 6 the sets of taxonomies are called "Tax" followed byan identification number, as for example: TaxOl.As it can be observed in Fig. 6, the experiments show

reduction rates of the sets of Association Rules varying from3,98% to 18,65%.

0-

199'189179'169'15914913512

11910999

594939~209

.....1=. i

Set ofTC xnomes

RuleSet Id y Rul Set Jd RueSet j4aysWRuSet lrnohth . RuleSet 3MOhttit

Fig. 6. Reduction rates got using taxonomies to generalize Association Rules.

1Available for downloading at the web site http:// fu z z y. cs.uni-magdeburg.de/borgelt/software.html.

During the experiments, some taxonomies were identifiedwithout classification in the sets of Association Rules. Theuse of the QARvT algorithm with a taxonomy without clas-sification may generate, for example, generalized rules like:jeans & products] =# t-shirts. This generalized AssociationRule is not meaningful for the user, because it is difficultto identify the products in products]. However, the rule canthe meaningful if it is analyzed using the generalized rulesanalysis interface of the RulEE-GAR computational module(this interface is described in Section V). We can use thisinterface to see the source items that were generalized in theproducts] item and to understand the meaning of the rule.

Considering this fact, we realized some experiments using 3sets of taxonomies without classification. In Fig. 7 we present achart that shows the reduction rates of the 5 sets of rules whenwe ran the qAR,T algorithm with each set of taxonomieswithout classification. In Fig. 7 we call each set of taxonomieswithout classification of "Twc" followed by an identificationnumber, as for example: TwcOl.

In the experiments realized using the sets of taxonomieswithout classification, we got reduction rates of the Associa-tion Rules sets varying from 14,12% to 42,97%.

0U

50%48% t45% A

38 L

35%

30%

23% to820%

15%

8%-5%

Witho jt Two Tw 1 rWcO2 TW ,3

Sets of Taxonomies without Classification

RdleSetld y * RuleStJdays Rul etj4dayl--4~UlbSd 1irn6tt --O ReSt 3HrTh6h

Fig. 7. Reduction rates got using taxonomies without classification togeneralize Association Rules.

In the realized experiments, the sets of taxonomies with andwithout classification were made by the user. Thus, others setsof taxonomies may generate reduction rates higher than therates presented in our experiments, mainly whether the sets oftaxonomies were made by an expert in the application domain.We also realized experiments combining some of the 13

sets of taxonomies, that present classification, with the 3 setsof taxonomies without classification. To do these experiments,we combined the 3 sets of taxonomies without classificationwith 6 sets of taxonomies with classification. Of the sets of

196

I

Ar .. V-

I 4. w.

.L Z..Z11,01,

L ------4-

-

taxonomies with classification, we chose 2 sets of taxonomieswith the lowest quantity of taxonomies, 2 sets of taxonomieswith median quantity of taxonomies and 2 sets with the highestquantity of taxonomies. We got a total of 18 combinations oftaxonomies sets. In Fig. 8 we present a chart that shows thereduction rates of the 5 sets of rules when we run the QARTalgorithm with each combination of the sets of taxonomieswith and without classification. We call each combination of"C" followed by an identification number, as for example CO1.The reduction rates presented in Fig. 8 varying from 14,61%to 50,11%.

0US

53950$48451431

389353028125$23920115$

10N8$593$C

wr, CombinationsuleSeteodnt

~4-~RulSeot 1 ionthRujl6SLt 7daysRujl6StA3nlohth

RuIa-SOt 14days

Fig. 8. Reduction rates got combining taxonomies with and withoutclassification to generalize Association Rules.

VII. CONCLUSION

A problem found in the Data Mining process is the factthat several of the used algorithms generate large quantitiesof patterns, complicating the analysis of the patterns. Thisproblem occurs with the Association Rules, a Data Miningtechnique that tries to identify all the patterns in a database.

The use of taxonomies, in the step of knowledge post-

processing, to generalize and to prune uninteresting and/orredundant rules may help the user to analyze the generatedAssociation Rules.

In this paper we proposed the QAR.T algorithm that uses

taxonomies to generalize Association Rules. We also proposedthe RulEE-GAR computational module that uses the QAR§Talgorithm to generalize Association Rules and provides severalmeans to analyse the generalized Association Rules. Thenwe presented the results of some experiments realized to de-monstrate that the Q9ART algorithm, using sets of taxonomieswith and without classification and also combining both, may

reduce the volume of the sets of Association Rules. As thesets of taxonomies were made by the user, others sets oftaxonomies may generate reduction rates higher than the rates

presented in our experiments, mainly whether the sets weremade by experts in the application domain.

ACKNOWLEDGMENT

This work was supported by the Coordenaqio deAperfeiqoamento de Pessoal de Nfvel Superior (CAPES) andby the Fundaqio de Amparo 'a Pesquisa do Estado de SioPaulo (FAPESP), Brazil.

REFERENCES

[1] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "The KDD processfor extracting useful knowledge from volumes of data," Communicationsof the ACM, vol. 39, no. 11, pp. 27-34, 1996.

[2] S. 0. Rezende, J. B. Pugliesi, E. A. Melanda, and M. F. Paula,"Mineracao de dados," in Sistemas Inteligentes: Fundamentos eAplicaCoes, S. 0. Rezende, Ed., vol. 1. Barueri, SP: Editora Manole,2003, pp. 307-335.

[3] A. Silberschatz and A. Tuzhilin, "On subjective measures of interesting-ness in knowledge discovery," in Proceedings of the First InternationalConference on Knowledge Discovery and Data Mining (KDD-95), U. M.Fayyad and R. Uthurusamy, Eds., 1995, pp. 275-281.

[4] B. Baesens, S. Viaene, and J. Vanthienen, "Post-processing ofassociation rules," in Proceedings of the Special Workshop on Post-Processing. The Sixth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, 2000, pp. 2-8. [Online].Available: www.cas.mcmaster.ca/lbruha/kdd2000/kddrep.html

[5] B. Liu, W. Hsu, S. Chen, and Y Ma, "Analyzing the subjectiveinterestingness of association rules," IEEE Intelligent Systems & theirApplications, vol. 15, no. 5, pp. 47-55, 2000.

[6] E. Clementini, P. D. Felice, and K. Koperski, "Mining multiple-levelspatial association rules for objects with a broad boundary," Data &Knowledge Engineering, vol. 34, no. 3, pp. 251-270, 2000. [Online].Available: www.elsevier.comnlocate/datak

[7] T. Semenova, M. Hegland, W. Graco, and G. Williams, "Effectiveness ofmining association rules for identifying trends in large health databases,"in Workshop on Integrating Data Mining and Knowledge Management.ICDM'01: The 2001 IEEE International Conference on Data Mining,2001, avaliable in http:Hlcui.unige.ch/lhilario/icdm-01/DM- KM-Final/Semenova.pdf. Access in 11/01/2005.

[8] J. M. Adamo, Data Mining for Association Rules and SequentialPatterns. New York, NY: Springer-Verlag, 2001.

[9] R. Srikant and R. Agrawal, "Mining generalized association rules,"Future Generation Computer Systems, vol. 13, no. 2-3, pp. 161-180,1997. [Online]. Available: citeseer.nj.nec.com/srikant95mining.html

[10] A. Jorge, J. Pocas, and P. Azevedo, "A post processing environment forbrowsing large sets of associationrules," in ECML/PKDD'02 Workshopon Integrating Aspects of Data Mining, Decision Support and Meta-Learning, M. Bohanec, B. Kavsek, N. Lavrac, and D. Mladenic, Eds.,Helsinki, 2002, pp. 53-64.

[11] R. Agrawal and R. Srikant, "Fast algorithms for mining associationrules," in Proceedings of Twentieth International Conference on VeryLarge Data Bases, VLDB, J. B. Bocca, M. Jarke, and C. Zaniolo,Eds., 1994, pp. 487-499. [Online]. Available: citeseer.nj.nec.com/agrawal94fast.html

[12] M. A. Domingues, "Generalizacao de regras de associacao," 2004,masters Thesis, Instituto de Ciencias Matematicas e de Computacao-Universidade deSao Paulo,Sao Carlos, SP- Brazil.

[13] E. A. Melanda, "Pos-processamento de regras de associacao," 2004,phD Thesis, Instituto de Ciencias Matematicas e de Computacao-Universidade deSao Paulo,Sao Carlos, SP- Brazil.

[14] N. Lavrac, P. Flach, and R. Zupan, "Rule evaluation measures: Aunifying view," in Proceedings of the Ninth International Workshop onInductive Logic Programming (ILP-99),S. Dzeroski and P. Flach, Eds.,vol. 1634. Springer-Verlag, 1999, pp. 174-185, INAI.

197

7T

[ieee 2005 purtuguese conference on artificial intelligence - covilha, portugal...

Documents