[ieee sixth international conference on machine learning and applications (icmla 2007) - cincinnati,...

6
An itemset-driven cluster-oriented approach to extract compact and meaningful sets of association rules Claudio H. Yamamoto Maria Cristina F. de Oliveira Magaly L. Fujimoto Solange O. Rezende Universidade de S˜ ao Paulo Instituto de Ciˆ encias Matem´ aticas e de Computac ¸˜ ao Av. Trabalhador S˜ ao-carlense, 400, S˜ ao Carlos, SP, Brazil {haruo, cristina, mlika, solange}@icmc.usp.br Abstract Extracting association rules from large datasets typi- cally results in a huge amount of rules. An approach to tackle this problem is to filter the resulting rule set, which reduces the rules, at the cost of also eliminating potentially interesting ones. In exploring a new dataset in search of relevant associations, it may be more useful for miners to have an overview of the space of rules obtainable from the dataset, rather than getting an arbitrary set satisfying high values for given interest measures. We describe a rule ex- traction approach that favors rule diversity, allowing min- ers to gain an overview of the rule space while reducing semantic redundancy within the rule set. This approach adopts an itemset-driven rule generation coupled with a cluster-based filtering process. The set of rules so obtained provides a starting point for a user-driven exploration of it. 1. Introduction The conventional approach in tackling an association rule mining task is to submit the data to an automatic rule extraction algorithm, e.g., Apriori [2], after an ad hoc tun- ing of thresholds for support and confidence values. It is well-known that, in running this process, users usually get far more rules than they can comfortably handle. There- fore, obtaining a manageable set usually involves eliminat- ing rules, which can be done by filtering based on proper interestingness measures [13]. We consider an exploration scenario in which a user has little knowledge about the data, and wants to get a glimpse of what kind of knowledge it holds, rather than extracting all the possible rules given strict support, confidence or any other measure. We hypothesize that, in this situation, the user wants to obtain a broad domain view [12] and inspect a wide variety of associations between items. In this sce- nario, identifying associations originated from diverse item- sets, would be more desirable. Nonetheless, it is still impor- tant to keep the number of rules manageable to encourage further user-driven analysis and exploration. Conventional rule extraction approaches followed by fil- tering may output many rules derived from a few itemsets, whereas other itemsets are not at all represented in the final rule set. Although each rule has its meaning, concentrating on rules originated from the same itemset adds little seman- tic information in exploratory contexts. We propose tack- ling this problem with an approach that extracts rules rep- resentative of a greater variety of associations, that is, rules involving a higher diversity of items, whereas keeping the total number of rules manageable. We select a small but sig- nificant subset of representative rules obtained from diverse itemsets, that may be used as a starting point for further ex- ploration. Three real datasets are used to motivate, describe and illustrate the itemset driven rule extraction approach. 2. Problem Identification We extracted rules from three real datasets, namely, Iris, Flags (mlearn.ics.uci.edu/MLRepository.html) and Coffee (made available by EMBRAPA, Empresa Brasileira de Pesquisas em Agropecu´ aria). The first one contains 150 samples of 3 species of the iris plant, with attributes describ- ing petal and sepal height and width (numeric values). The continuous attributes were discretized into categoric values “large”, “medium” and “small”, while keeping the class at- tribute. The Flags data contains 30 attributes describing the national flags of 194 countries, of which we selected 9 discrete attributes (religion, stripes, red, green, blue, white, crosses, sunstars and icon). The Coffee data contains a hu- man expert evaluation, described by 12 attributes, of 11,526 Sixth International Conference on Machine Learning and Applications 0-7695-3069-9/07 $25.00 © 2007 IEEE DOI 10.1109/ICMLA.2007.45 87 Sixth International Conference on Machine Learning and Applications 0-7695-3069-9/07 $25.00 © 2007 IEEE DOI 10.1109/ICMLA.2007.45 87 Sixth International Conference on Machine Learning and Applications 0-7695-3069-9/07 $25.00 © 2007 IEEE DOI 10.1109/ICMLA.2007.45 87 Sixth International Conference on Machine Learning and Applications 0-7695-3069-9/07 $25.00 © 2007 IEEE DOI 10.1109/ICMLA.2007.45 87 Sixth International Conference on Machine Learning and Applications 0-7695-3069-9/07 $25.00 © 2007 IEEE DOI 10.1109/ICMLA.2007.45 87 Sixth International Conference on Machine Learning and Applications 0-7695-3069-9/07 $25.00 © 2007 IEEE DOI 10.1109/ICMLA.2007.45 87

Upload: solange-oliveira

Post on 13-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE Sixth International Conference on Machine Learning and Applications (ICMLA 2007) - Cincinnati, OH, USA (2007.12.13-2007.12.15)] Sixth International Conference on Machine Learning

An itemset-driven cluster-oriented approach to extract compact and meaningfulsets of association rules

Claudio H. Yamamoto Maria Cristina F. de Oliveira Magaly L. FujimotoSolange O. Rezende

Universidade de Sao PauloInstituto de Ciencias Matematicas e de Computacao

Av. Trabalhador Sao-carlense, 400, Sao Carlos, SP, Brazil{haruo, cristina, mlika, solange}@icmc.usp.br

Abstract

Extracting association rules from large datasets typi-cally results in a huge amount of rules. An approach totackle this problem is to filter the resulting rule set, whichreduces the rules, at the cost of also eliminating potentiallyinteresting ones. In exploring a new dataset in search ofrelevant associations, it may be more useful for miners tohave an overview of the space of rules obtainable from thedataset, rather than getting an arbitrary set satisfying highvalues for given interest measures. We describe a rule ex-traction approach that favors rule diversity, allowing min-ers to gain an overview of the rule space while reducingsemantic redundancy within the rule set. This approachadopts an itemset-driven rule generation coupled with acluster-based filtering process. The set of rules so obtainedprovides a starting point for a user-driven exploration of it.

1. Introduction

The conventional approach in tackling an associationrule mining task is to submit the data to an automatic ruleextraction algorithm, e.g., Apriori [2], after an ad hoc tun-ing of thresholds for support and confidence values. It iswell-known that, in running this process, users usually getfar more rules than they can comfortably handle. There-fore, obtaining a manageable set usually involves eliminat-ing rules, which can be done by filtering based on properinterestingness measures [13].

We consider an exploration scenario in which a user haslittle knowledge about the data, and wants to get a glimpseof what kind of knowledge it holds, rather than extractingall the possible rules given strict support, confidence or anyother measure. We hypothesize that, in this situation, the

user wants to obtain a broad domain view [12] and inspecta wide variety of associations between items. In this sce-nario, identifying associations originated from diverse item-sets, would be more desirable. Nonetheless, it is still impor-tant to keep the number of rules manageable to encouragefurther user-driven analysis and exploration.

Conventional rule extraction approaches followed by fil-tering may output many rules derived from a few itemsets,whereas other itemsets are not at all represented in the finalrule set. Although each rule has its meaning, concentratingon rules originated from the same itemset adds little seman-tic information in exploratory contexts. We propose tack-ling this problem with an approach that extracts rules rep-resentative of a greater variety of associations, that is, rulesinvolving a higher diversity of items, whereas keeping thetotal number of rules manageable. We select a small but sig-nificant subset of representative rules obtained from diverseitemsets, that may be used as a starting point for further ex-ploration. Three real datasets are used to motivate, describeand illustrate the itemset driven rule extraction approach.

2. Problem Identification

We extracted rules from three real datasets, namely, Iris,Flags (mlearn.ics.uci.edu/MLRepository.html) and Coffee(made available by EMBRAPA, Empresa Brasileira dePesquisas em Agropecuaria). The first one contains 150samples of 3 species of the iris plant, with attributes describ-ing petal and sepal height and width (numeric values). Thecontinuous attributes were discretized into categoric values“large”, “medium” and “small”, while keeping the class at-tribute. The Flags data contains 30 attributes describingthe national flags of 194 countries, of which we selected 9discrete attributes (religion, stripes, red, green, blue, white,crosses, sunstars and icon). The Coffee data contains a hu-man expert evaluation, described by 12 attributes, of 11,526

Sixth International Conference on Machine Learning and Applications

0-7695-3069-9/07 $25.00 © 2007 IEEEDOI 10.1109/ICMLA.2007.45

87

Sixth International Conference on Machine Learning and Applications

0-7695-3069-9/07 $25.00 © 2007 IEEEDOI 10.1109/ICMLA.2007.45

87

Sixth International Conference on Machine Learning and Applications

0-7695-3069-9/07 $25.00 © 2007 IEEEDOI 10.1109/ICMLA.2007.45

87

Sixth International Conference on Machine Learning and Applications

0-7695-3069-9/07 $25.00 © 2007 IEEEDOI 10.1109/ICMLA.2007.45

87

Sixth International Conference on Machine Learning and Applications

0-7695-3069-9/07 $25.00 © 2007 IEEEDOI 10.1109/ICMLA.2007.45

87

Sixth International Conference on Machine Learning and Applications

0-7695-3069-9/07 $25.00 © 2007 IEEEDOI 10.1109/ICMLA.2007.45

87

Page 2: [IEEE Sixth International Conference on Machine Learning and Applications (ICMLA 2007) - Cincinnati, OH, USA (2007.12.13-2007.12.15)] Sixth International Conference on Machine Learning

coffee samples. Nine discrete attributes were selected foranalysis, namely: agtron, grind, rating, class, consistency,aroma, flavor, body and type.

Itemsets were generated with an implementation of Apri-ori, considering the following (arbitrary) support thresh-olds: 30 (20%) for the Iris dataset; 20 (∼10%) for the Flags;and 3,467 (∼30%) for the Coffee. This resulted in 46 item-sets for the Iris data (25 2-itemsets, 15 3-itemsets and 64-itemsets); 88 itemsets for the Flags data (37 2-itemsets,37 3-itemsets and 14 4-itemsets) and 176 itemsets for theCoffee (35 2-itemsets, 58 3-itemsets, 53 4-itemsets, 25 5-itemsets and 5 6-itemsets).

For each dataset, rules were extracted from the itemsets,and filtering applied to reduce the resulting rule set. Ruleswere ranked based on a combination of interestingness mea-sures (see Section 3.2) and the r-best ranked rules were se-lected, where r is the total number of itemsets obtained inthe itemset generation step.

We then verified the rule diversity in the filtered rule sets.Let us consider, for each rule set, those itemsets that gener-ated more rules. Observing the top t itemsets generatingmore rules, for varying values of t, and the percentage ofrules they generate, as shown in Table 1, one notices thatthe filtered rule sets include rules from a very limited num-ber of itemsets, in all cases. For example, 25% of the item-sets account for nearly all the r-best ranked rules: 93.5%for the Iris dataset, 94.3% for the Flags and 99.4% for theCoffee. After filtering, many itemsets are not representedin the final rule set, although they may generate interestingrules.

Table 1. Rules generated by top t% itemsetsItemsets Number of Number of Number ofselected rules: Iris rules: Flags rules: Coffee

Top ∼5% 22 (47.8%) 39 (44.3%) 87 (49.4%)Top ∼10% 30 (65.2%) 58 (65.9%) 126 (71.6%)Top ∼15% 35 (76.1%) 71 (80.7%) 151 (85.8%)Top ∼20% 41 (89.1%) 79 (89.8%) 167 (94.9%)Top ∼25% 43 (93.5%) 83 (94.3%) 175 (99.4%)

... ... ...Top 100% 46 (100.0%) 88 (100.0%) 176 (100.0%)

3. Itemset-Driven Rule Extraction

We define rule diversity as the number of different item-sets that generated the rules contained in a given rule setS, in relation to the total number of extracted itemsets.Our goal is to maximize rule diversity, while still keepingthe resulting rule set manageable. This is made possibleby adopting an itemset-driven rule generation approach,where rules are gradually extracted from groups of similarfrequent itemsets. The overall rationale is as follows. Sev-eral steps are executed, each one handling itemsets of size

k = 2..i

Data Pre-processing

Generating and Filtering k-itemsets

Generating k-rules

Cluster-oriented filtering of k-rules

Navigating rule space

Figure 1. The Itemset-driven approach

k, for k ≥ 2, as illustrated in Figure 1. For each validk, frequent k-itemsets are obtained first. Resulting item-sets are clustered based on their similarity, so that itemsetswith many items in common are assigned to the same clus-ter. Rules are then extracted from each cluster of itemsets,and may be filtered according to a user defined metric. Theresulting rule set is the union of all the rules that survived,from all k-itemsets considered, and provides a starting pointfor further user navigation through the rule space, in searchfor additional interesting rules. This approach is supportedby I2E, Interactive Itemset Exploration, a system devisedto assist miners in exploring itemsets and association rules.Each step of the above approach, except pre-processing, isfurther described in the following.

3.1. Generating and Filtering k-itemsets

The core of this step is the Apriori algorithm, which de-parts from 1-itemsets and performs a series of k-itemsetjoins to generate (k + 1)-itemsets, where k ≥ 2. The I2Esystem embeds a modified implementation introduced byHoller [7], referred to here as s-Apriori (stepped Apriori).It generates k-itemsets step by step, for varying sizes of k,enabling user interaction after each step and user controlover the generation of larger itemsets. Thus, after each step,users may interact with a visual representation of the result-ing itemsets (see Figure 2) and discard those considered notrelevant. Prior filtering based on user interest contributes toreduce the number of future frequent itemset combinations,and the number of final rules.

Finally, the remaining k-itemsets are clustered based ontheir similarity. A k-means clustering is applied [11] thatemploys the same similarity metric used to generate theprojection-based itemset visualization, and outputs clustersof similar itemsets – that is, itemsets that share many itemsin common. These clusters typically reflect strong associa-

888888888888

Page 3: [IEEE Sixth International Conference on Machine Learning and Applications (ICMLA 2007) - Cincinnati, OH, USA (2007.12.13-2007.12.15)] Sixth International Conference on Machine Learning

tions between specific items, and provide the basis for fur-ther rule filtering, performed later in the process.

3.2. Generating k-rules

Rules are now generated for each k-itemset that re-mained after user interaction. All the possible rules areextracted from the current k-itemsets, and then, for eachitemset, a subset of the best ranked rules is picked as itsrepresentative rule set. Representative rules must be well-ranked, but deciding how many rules to keep requires somethought. One may, for example, employ Pareto’s princi-ple to define a suitable subset of representative rules. Thisprinciple, also known as the “20-80 rule”, states that, formany phenomena, 80% of consequences stem from 20% ofthe causes [9]. Following this principle, one might pick the20% best ranked rules as representatives for each itemset.Alternatively, more rules or less rules may be picked, de-pending on characteristics of the application. For example,to keep the rule set as small as possible, one may select theone best ranked rule as the representative of each itemset.

Many interestingness metrics can be employed to eval-uate association rules [5], and selecting the relevant onesfor the task is not straightforward. In its current imple-mentation, I2E combines several metrics for rule ranking,in the following manner: each rule is ranked according toa specific metric, and individual ranks for a rule are av-eraged to obtain its final rank. This approach is unsensi-tive to the scale problem associated with combining differ-ent metrics. The current ranking system uses the followingmetrics: support, confidence, lift, leverage, certainty factor,jaccard, gini index and yule’s Q. Nonetheless, other user-defined metrics or ranking systems could be incorporated.

3.3. Cluster-oriented rule filtering

Rules selected as representative of k-itemsets that havelow interestingness measures are candidates to be filteredout. The itemset clusters obtained previously provide thebasis for an automatic cluster-oriented rule filtering ap-proach. Rules generated from each itemset cluster areranked, and the policy is to filter out a percentage of theworst ranked ones. This filtering scheme is at the core ofthe itemset-driven approach, as it preserves itemset cover-age in the resulting rule set. By filtering low-ranked rulesgenerated within a cluster, filtering is distributed throughoutthe space of itemsets. Rather than eliminating low-rankedrules from the complete final rule set, this approach elimi-nates low-ranked rules originated from itemsets as diverseas possible. Without employing the clustering information,filtering is likely to strongly affect specific portions of therule set, decreasing the coverage of itemsets by the rules.

3.4. Navigating rule space

As a final step, interactive mechanisms are offered forusers to navigate the space of rules, departing from the setof representative rules. The union set of all rules extractedthat survived the filtering process offers a compact set forfurther analysis, e.g., regarding rule composition (head andbody), support and confidence, and interestingness metrics.Queries may be conducted departing from the set of repre-sentative rules, while search for related rules occurs over thecomplete rule set derived with the itemset-driven approach.From a user-selected target rule, s/he can perform queries toobtain related rules of interest, such as: rules with the sameantecedent or consequent as the target; rules originated froma subset or a superset of the itemset originating the target;or other rules originated from the same itemset. By inspect-ing and comparing such rules a user can conduct a deeperanalysis of the rules extracted and explore the rule space.

4. Results and Comparison

To validate the itemset-based association rule extrac-tion methodology, we performed a case study on the threedatasets described previously. We compare the rule setsextracted from them using both a conventional approachcoupled with filtering and the itemset-driven approach. Amore detailed description is given for the Iris data, as theapproach was similarly applied to the three datasets. We as-sume that preprocessing to conform with the standard inputwas already done.

Association rules were extracted from the preprocessedIris dataset, adopting a support of 20% (30). Our goal wasto compare the set of rules obtained with the itemset-drivenapproach (ID set) with the one obtained employing a con-ventional rule extraction approach, coupled with filteringbased on a rank of combined measures (Co set). The IDset contains n rules, where n is the total number of distinctitemsets computed with s-Apriori. For each itemset, onesingle rule (the best-ranked one) was selected as its repre-sentative. For comparison, the Co set includes the n bestranked rules extracted with the conventional approach.

To measure the diversity of items appearing in the rulesets, we compare the number of rules derived from distinctitemsets, in both rule sets (ID and Co), for all frequent k-itemsets. To evaluate the accuracy of the itemset-driven ap-proach, as compared with the conventional one, we compareboth rule sets, checking the percentage of itemsets whichoriginated rules that remained in the final rule set, for bothapproaches.

Results regarding itemset coverage for the Iris data areshown in Table 2. Evidently, the ID set covers 100% ofthe itemsets, as one representative was selected from eachof the 46 distinct itemsets extracted with s-Apriori. On the

898989898989

Page 4: [IEEE Sixth International Conference on Machine Learning and Applications (ICMLA 2007) - Cincinnati, OH, USA (2007.12.13-2007.12.15)] Sixth International Conference on Machine Learning

other hand, the Co set has an itemset coverage of 32.6%. Inother words, the rule set extracted with a conventional ap-proach followed by filtering has rules derived from 15 outof the 46 possible itemsets. The Co set indeed keeps manyrules generated from a few diverse itemsets: itemset “Iris-setosa, petalLengthSmall, petalWidthSmall, sepalLengthS-mall” derived 12 rules out of the 46; another itemset derived6 rules; 3 itemsets derived 4 rules each; one itemset derived3 rules; 4 itemsets derived 2 rules each; and 5 itemsets de-rived 1 rule each. On the other hand, 22 itemsets are simplynot represented in the Co rule set.

Tables 3 and 4 show results of a similar analysis on theFlags and Coffee datasets, respectively, defining a supportof 20 (10.3%) and 3,467 (30.1%), respectively. The resultswere similar for the three datasets, and indicate that the con-ventional extract and filter approach fails to preserve item-set coverage in the resulting rule set. As most itemsets aresmall (low values of k), and these small itemsets have lowcoverage in the resulting rule sets, overall itemset coverageis also low.

Table 2. Itemset coverage: Iris datasetN. itemsets Co set ID set

k (N. max. N. rules Itemset N. rules Itemsetrules) (20.53%) coverage (20.53%) coverage

2 25 (50) 9 6 (24.0%) 25 25 (100.0%)3 15 (90) 22 6 (40.0%) 15 15 (100.0%)4 6 (84) 15 3 (50.0%) 6 6 (100.0%)

All 46 (224) 46 15 (32.6%) 46 46 (100.0%)

Table 3. Itemset coverage: Flags datasetN. itemsets Co set ID set

k (N. max. N. rules Itemset N. rules Itemsetrules) (17.89%) coverage (17.89%) coverage

2 37 (74) 9 7 (18.9%) 37 37 (100.0%)3 37 (222) 33 12 (32.4%) 37 37 (100.0%)4 14 (196) 46 8 (57.2%) 14 14 (100.0%)

All 88 (492) 88 27 (30.7%) 88 88 (100.0%)

Table 4. Itemset coverage: Coffee datasetN. itemsets Co set ID set

k (N. max. N. rules Itemset N. rules Itemsetrules) (7.92%) coverage (7.92%) coverage

2 35 (70) 6 3 (8.6%) 35 35 (100.0%)3 58 (348) 30 14 (24.1%) 58 58 (100.0%)4 53 (742) 62 16 (25.8%) 53 53 (100.0%)5 25 (750) 62 10 (40.0%) 25 25 (100.0%)6 5 (310) 16 2 (9.1%) 22 22 (100.0%)

All 176 (2220) 176 45 (25.6%) 176 176 (100.0%)

The ID set does include low-ranked rules, and these maybe filtered out at the user’s criterion. For the datasets con-sidered above, we analyzed the impact of rule filtering onitemset coverage, on both the ID and Co rule sets. Resultsare shown in Table 5.

For the Iris data, filtering out 15% of the (lowest-ranked)rules eliminates from its ID set the rules shown in Table 6.All such rules have very low interest measures, particularlythe bottom one. For the Flags data, a 15% filtering elimi-nates rules with low values for leverage (≤ 0.07), jaccard(≤ 0.35) and gini index (≤ 0.04). Finally, for the Coffeedata, a 15% filtering eliminates rules with low values forlift (≤ 1.17), leverage (≤ 0.06), jaccard (≤ 0.56) and giniindex (≤ 0.03). Since rule interestingness is subjective, de-termining suitable thresholds for filtering is user and appli-cation dependent. For these examples, 15% seemed a goodpercentage for an automatic filter.

Table 5. Itemset coverage after rule filteringCut Iris Flags Coffee(%) ID set Co set ID set Co set ID set Co set0 100.0% 32.6% 100.0% 30.7% 100.0% 25.6%15 84.8% 23.9% 84.1% 28.4% 84.7% 23.3%25 73.9% 21.7% 73.9% 28.4% 74.4% 21.0%50 47.8% 17.4% 48.9% 19.2% 49.4% 17.1%

Table 6. Rules dropped at rule filtering: IrisAntec. Cons. Sup Conf Lift Lev CF Jacc GiniPLS SWM 36 0.72 1.23 0.04 0.33 0.35 0.02PWS SWM 36 0.72 1.23 0.04 0.33 0.35 0.02setosa SWM 36 0.72 1.23 0.04 0.33 0.35 0.02PWL SLM 30 0.65 1.38 0.06 0.34 0.34 0.03virg. SLM 32 0.64 1.35 0.06 0.32 0.36 0.03SLS SWM 37 0.63 1.07 0.02 0.10 0.34 0.00

SWM SLM 37 0.42 0.89 -0.03 -0.10 0.30 0.01

The itemset-driven approach outputs a compact set ofrules that provides a starting point for further exploration bythe miner, who can navigate through the space of rules usingthe queries described in Section 3.4. We now illustrate pos-sible queries over the rules obtained from the Iris dataset.Data attributes names are abbreviated for shortness, in thefollowing manner: P stands for “petal”, S for “sepal”; L for“length”, W for “width”; L for “large”, M for “medium”,S for “small”. Thus, PLM means “petal length medium”and so on.

By querying rules that share the same antecedent, userscan compare which items appear more often with theitems at the antecedent. For example, given the ruleversicolor → PLM , with support 48 (32%) and 96% con-fidence, the query returns versicolor → PWM , with sup-port 49 (32.7%) and 98% confidence and versicolor →SLM , with support 36 (24%) and 72% confidence, asshown in the second table (from top to bottom) in Figure 2,which shows at the right the rule query interface of the I2Esystem. The analyst then inspects the items more stronglyassociated with versicolor, and their interest metrics – inthis case, versicolor → PLM and versicolor → PWMhave the higher measures.

909090909090

Page 5: [IEEE Sixth International Conference on Machine Learning and Applications (ICMLA 2007) - Cincinnati, OH, USA (2007.12.13-2007.12.15)] Sixth International Conference on Machine Learning

By querying rules sharing the same consequent, usersmay compare which set of items leads to it and the strengthof this relation. For example, given the rule SWS →PLM , with support 30 (20%) and 64% confidence, thequery returns PWM → PLM , with support 48 (32%) and89% confidence, which has higher interest measures.

By querying other rules originated from a subset, or asuperset, of the source itemset, users may check the im-pact of adding or removing an item from a target rule.For example, given the rule SWM, setosa → PLS, withsupport 36 (24%) and 100% confidence, the query returnssetosa → PLS, with support 50 (33.3%) and 100% confi-dence and SWM, PWS, setosa → PLS, with support 36and 100% confidence. One verifies that “removing” SWMfrom the rule causes support to increase from 36 to 50, im-plying that SWM affects support negatively. On the otherhand, “adding” PWS does not affect support.

Finally, querying other rules derived from the same item-set allows comparing the strength of different item com-binations at the antecedent or at the consequent. Thisquery helps evaluating how interest measures of rules froma given itemset differ, as small differences probably indi-cate highly correlated items. For example, given the rulevirginica → PLL, with support 44 (29.3%) and 88% con-fidence, the query returns PLL → virginica, with support44 and 96% confidence, indicating high correlation betweenvirginica and PLL.

5. Related Work

Since association rules were first introduced [2], manyrelated issues have been dealt with, e.g. regarding modelinterpretation, definition of interestingness measures, andcombinatorial rule explosion. The last one, the focus of thiscontribution, deals with the reduction of the final rule setcardinality, so it can be handled comfortably by an analyst.

Klemettinen et al. [10] use rule templates to select be-tween interesting and uninteresting rules. Templates de-scribe a set of rules by specifying the attributes in the an-tecedent and the attribute in the consequent (ignoring con-sequents with multiple items). Also seeking to reduce thequantity of rules to ease analysis, Domingues and Rezende[4] propose a post-processing stage that generalizes therules, based on a taxonomy of items.

Goethals and den Bussche [6] introduce the concepts ofa priori and a posteriori filtering. In the first approach, theidea is to specify items of interest in the head, body or both,as a parameter to modify the mining algorithm. This has theadvantage of not generating irrelevant rules. The second ap-proach consists in generating as many itemsets as possible,so one can perform queries over the whole set of rules.

Aggarwal and Yu proposed generating association rulesonline [1]. The database is preprocessed into an adja-

cency lattice with information about frequent itemsets, fromwhich one can perform online queries. The approach findsrules with specific items in the antecedent or in the conse-quent. They also eliminate redundant rules, i.e., rules whoseinformation is contained in another one. For example, ifX → Y Z holds, then rules XY → Z , XZ → Y , X → Yand X → Z are redundant. Ashrafi et al. [3] argue thatthis concept of redundancy causes information loss. In theirview, only two kinds of rules are redundant: the ones withthe same antecedent where the consequent is redundant (i.e.if X → Y and X → Z hold, then X → Y Z is redundant);and the ones with the same consequent where the antecedentis redundant.

Jorge [8] employs hierarchical clustering to group andsummarize large sets of association rules. Rules involvingitems of related themes are grouped together, and the ruleset is then summarized, promoting one rule to representa-tive, which eases further interactive user exploration. How-ever, this approach deals with a determined set of rules, anddoes not guarantee itemset coverage.

Though the above approaches are effective in many sit-uations, users handling a new dataset often face an ex-ploratory scenario, where they strive to attain a global viewof the domain. This is critical even for relatively small rulesets. In this context, an alternative is to give users, as a start-ing point for exploration, a subset of the rule space preserv-ing rule diversity. We believe the itemset-driven approach isa step in this direction, giving users additional control overthe surviving rules and those that have been filtered out.

6. Conclusions and Further Work

Devising effective ways to handle the huge rule sets out-put in association rule mining is still a major research chal-lenge. The common approach of filtering rules based onglobal interestingness measures ignores aspects that may berelevant in exploratory mining scenarios, e.g., rule diversity.Consequently, analysts may remain unaware of potentiallyinteresting rules. We propose an alternative approach thatallows controlling the number of rules output, while pre-serving rule diversity. We believe this approach provides agood starting point for exploratory scenarios in which usershave little knowledge about the data. As compared withconventional rule filtering, one observes that our approachpreserves itemset coverage, even when increasing levels ofrule filtering are applied to the resulting rule set.

As future work, we intend to further investigate the be-havior and performance of the itemset-driven rule extrac-tion on larger real datasets. One issue is to investigate thechoice of a proper set of representative rules. We are alsoimproving the I2E system by looking for improved visualrepresentations of itemsets and rules, coupled with effectiveinteraction facilities to support user-driven rule exploration.

919191919191

Page 6: [IEEE Sixth International Conference on Machine Learning and Applications (ICMLA 2007) - Cincinnati, OH, USA (2007.12.13-2007.12.15)] Sixth International Conference on Machine Learning

Figure 2. Rule exploration interface at I2E

7. Acknowledgements

We acknowledge the financial support of CNPq (theBrazilian Research Funding Agency), Grants 470272/2004-0, 305861/2006-9 and C. H. Yamamoto’s D.Sc. scholarship.We are also grateful to Fernando Paulovich, for providingthe k-means code, Thiago Berlingieri, for some system cod-ing, and to EMBRAPA, for the Coffee dataset.

References

[1] C. C. Aggarwal and P. S. Yu. A new approach to onlinegeneration of association rules. IEEE Trans. on Knowledgeand Data Engineering, 13(4):527–540, Jul./Aug. 2001.

[2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining associ-ation rules between sets of items in large databases. In Proc.ACM Int’l. Conf. on Management of Data, pages 207–216,Washington, DC, USA, May 1993.

[3] M. Z. Ashrafi, D. Taniar, and K. A. Smith. A new approachof eliminating redundant association rules. In Proc. Int’l.Conf. on Database and Expert Systems Applications, pages465–474, Zaragoza, Spain, Sept. 2004.

[4] M. A. Domingues and S. O. Rezende. Using taxonomies tofacilitate the analysis of association rules. In Int’l Workshopon Knowledge Discovery and Ontologies held within Euro-pean Conf. Principles and Practice of Knowledge Discoveryin Databases, pages 59–66, Porto, Portugal, Oct. 2005.

[5] L. Geng and H. J. Hamilton. Interestingness measures fordata mining: A survey. ACM Comput. Surv., 38(3):9, 2006.

[6] B. Goethals and J. V. den Bussche. A priori versus a poste-riori filtering of association rules. In Proc. ACM SIGMODWorkshop on Research Issues in Data Mining and Knowl-edge Discovery, pages 1055–1061, Philadelphia, PA, USA,May 1999.

[7] M. Holler. http://www.helsinki.fi/∼holler/datamining/src/.website, July 2007.

[8] A. Jorge. Hierarchical clustering for thematic browsing andsummarization of large sets of association rules. In Proc.SIAM Intl. Conf. on Data Mining, Lake Buena Vista, FL,USA, Apr. 2004.

[9] J. Juran. The non-pareto principle; mea culpa.http://www.juran.com/pdf/SP7518.doc, 1975.

[10] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen,and A. I. Verkamo. Finding interesting rules from large setsof discovered association rules. In Proc. ACM Int’l. Conf. onInformation and Knowledge Management, pages 401–407,Gaithersburg, MD, USA, Nov./Dec. 1994.

[11] B. G. Mirkin. Clustering for Data Mining: A Data RecoveryApproach, pages 78–81. Chapman & Hall/CRC, 2005.

[12] R. Natarajan and B. Shekar. Interestingness of associationrules in data mining: Issues relevant to e-commerce. SAD-HANA – Academy Proceedings in Engineering Sciences,30(2&3):291–309, April/June 2005.

[13] R. A. Sinoara and S. O. Rezende. A methodology for identi-fying interesting association rules by combining objectiveand subjective measures. Revista Iberoamericana de In-teligencia Artificial, 10(32):19–27, 2006.

929292929292