an integrated approach for regional association rule ...ceick/kdd/edwyk06.pdf · 2 an integrated...

An Integrated Approach for Regional Association Rule Mining and

Scoping

Christoph F. Eick1,Wei Ding1, Jing Wang1, Xiaojing Yuan2, Shuhab Khan3

University of Houston{ceick, wding, jwang29, xyuan, sdkhan}@uh.edu∗

14th October 2006

Abstract

A special challenge for spatial data mining is that in-formation is not spread uniformly in spatial data sets.Consequently, the discovery of regional knowledge isof fundamental importance. However, traditional datamining techniques are ill-prepared for discovering re-gional knowledge. This paper introduces a methodol-ogy for mining spatial association rules and proposesnovel algorithms to determine the scope of a regionalassociation rule. We introduce a reward-based regiondiscovery framework that employs clustering to find in-teresting regions. An integrated approach is applied tosolve two distinct region discovery problems: identify-ing interesting regions for regional association rule min-ing, and determining the scope where a regional associ-ation rule holds. Moreover, our association rule miningmethodology is supervised, centering on finding ruleswith respect to an underlying class structure. The classstructure itself is also used for rule pruning and for thediscretization of numerical attributes.We evaluate ourapproach in a real-world case study to identify spatialrisk patterns from arsenic in the Texas water supply.The experimental are validated by domain experts andcompared with published results in geoscience research.

1 Introduction

Data mining has been recognized as a key means offinding interesting patterns in large datasets. Thegoal of spatial data mining is to automate the extrac-tion of interesting, useful but implicit spatial patterns[1, 2, 3, 4, 5, 6]. One special challenges in spatial datamining is that information is usually not uniformly dis-

∗1. Computer Science Department, University of Houston,

Houston, TX 77004

2. Engineering Technology Department, University of Houston,

Houston, TX 77004

3. Geosciences Department, University of Houston, Houston, TX

77004

tributed in spatial datasets. Consequently, the discov-ery of regional knowledge is of fundamental importancefor spatial data mining. It has been pointed out in lit-erature [7, 8] that “whole map statistics are seldom use-ful ”, that “most relationships in spatial data sets are geo-graphically regional, rather than global ”, and that “thereis no average place on the Earth’s surface” – a county isnot a representative of a state, and a state is not a rep-resentative of a country. Therefore, it is not surprisingthat domain experts are most interested in discoveringhidden patterns at a regional scale rather than a globalscale [7, 8].

Unfortunately, most data mining techniques are ill-prepared for discovering regional knowledge. For ex-ample, when using traditional association rule mining,regional patterns frequently fail to be discovered dueto insufficient global confidence and/or support. Fur-thermore, there are no pre-specified regions for a spa-tial dataset. This raises the questions on how to mea-sure the interestingness of a set of regions and how toidentify interesting regions algorithmically. One popu-lar approach is to select regions based on a priori givenstructure, such as a grid structure based on longitudeand latitude or political boundaries; for example, us-ing counties as regions of a state. Unfortunately, theboundary of so constructed regions frequently does notmatch the surface boundary of the interesting patterns,thus making them unlikely to be discovered. For exam-ple, assume there are high arsenic concentrations alonga river that crosses multiple counties in Texas, miningregional rules at county level is unlikely to detect thispattern, due to Simpson’s paradox.

Another challenge for mining regional spatial asso-ciation rules is from the characteristics of regions andregional knowledge. Regional association rules, by def-inition, only hold in a subspace and not in the globalspace; otherwise, they would be global. Furthermore,regional association rules can only be discovered in aparticular subspace of the global space. However, a re-

Table 1: A two-way contingency table between the welldepth and arsenic concentration.Well Depth Dangerous Safe Total(0, 251.5] 1000 1000 2000(251.5, ∞) 1200 800 2000Total 2200 1800 4000

Table 2: A three-way contingency table between geo-graphic zone A and zone B.

Well Depth Dangerous Safe TotalZoneA (0, 251.5] 400 100 500

(251.5, ∞) 1050 450 1500ZoneB (0, 251.5] 600 900 1500

(251.5, ∞) 150 350 500Total 2200 1800 4000

gional association rule that was discovered in a partic-ular subspace might also be valid in another subspace.Consequently, we are interested to determine the set ofregions in which the association rule is valid, which wecall the scope of a regional association rule.

This paper introduces a methodology for miningspatial association rules and proposes novel algorithmsto determine the scope of a spatial association rule.We propose a reward-based region discovery frameworkthat employs clustering to find interesting regions. Theframework is applied to solve two distinct region discov-ery problems: identifying interesting regions for regionalpattern discovery, and determining the scope of a givenassociation rule. Moreover, our association rule miningmethodology is supervised, centering on finding ruleswith respect to an underlying class structure, and theclass structure itself is also used for rule pruning and forthe discretization of numerical attributes.

This paper is organized as follows: Section 1.1discusses Simpson’s paradox. Section 2 introducesour integrated approach for regional association rulemining and scoping and Section 3 describes algorithmsof region discovery algorithm and regional associationrule mining. Section 4 presents the results of a real-world case case study, Section 5 reviews related work.We conclude the paper in Section 6.

1.1 Simpson’s Paradox Simpson’s Paradox [9], awell known issue in statistics, indicates that globalpatterns can be very different from local patterns. Thephenomenon is also known as spatial heterogeneity [5]in GIS. We illustrate the nature of this paradox using asample dataset in Table 1 and 2.

Consider the relationship between well depth andarsenic concentration in Table 1. An association rule,

sample_rule, suggests that a well up to 251.5-feet deepis associated with dangerous arsenic levels:

sample_rule : is_a(X, well) ∧ depth(X, 0 − 251.5)

→ arsenic_level(danderous).

whether the rule holds or not depends on where the welllocates. We calculate the confidence of the sample_ruleglobally (zone A and B in Table 1) and locally (ZoneA and B separately in Table 2). Let’s assume thatthe minimum confidence threshold is 70%. In Table1, the rule is not strong enough to be identified globallybecause its confidence, 1000

2000 = 50%, is less than theminimum confidence threshold. But in Zone A (Table2) the rule holds because its confidence, 400

500 = 80%, isabove the specified minimum confidence threshold. Thisrule does not hold in zone B (Table 2) either, due to itslow confidence ( 600

1500 = 40%).Hence a well up to 251.5-feet deep is positively

associated with dangerous arsenic levels in zone Abut is negatively associated in the zone B and thecombined global dataset. The reversal in the directionof association is known as Simpson’s paradox.

The Simpson’s paradox occurs when some under-lying variables that have a large effect on the ratios ofstratified data. For the dataset in the Table 1 and 2, theunderlying variable would be the zones; and the strati-fied data are in zone A and zone B respectively.

2 An Integrated Approach for RegionalAssociation Rule Mining and Scoping

Our approach uses a reward-based framework to find in-teresting regional association rules, and then determinethe scope of those rules. Here we break our approachinto four steps:

1. Discover and identify interesting regions, based onour reward criteria.

2. Mine regional association rules in regions generatedin the first step.

3. Identify scope of each interesting association rulediscovered in the second step.

4. Evaluate regional patterns and scopes; fine tunethe parameter settings and repeat step 1 to 3 if theresults are not satisfactory.

The first three steps are illustrated in Figure 1. First, weuse a reward-based region discovery algorithm to searchfor highly rewarded regions of hotspots and coolspots.Secondly, each region is examined and frequent itemsetsfor that region are generated. Regional association

Figure 1: An Integrated approach for Regional Association Rule Mining and Scoping. The example depicts aregion of arsenic hotspots and the scope of the identified association rule a on the terrain map of Texas.

rules are then constructed from these frequent itemsets.Thirdly, for each interesting association rule, we scanthe dataset for a second time to identify the scope wherethe rule satisfies the minimum support and confidencethresholds. Informally, scope is a set of regions wherethe confidence and support of the association rule aresignificantly higher than the minimum support andconfidence thresholds. The higher the confidence andsupport of the association rule in a region are, thegreater the reward the region receives. We shall definethe regions and scopes more precisely later. Finally, werepeat step 1 and 3 if the results are not satisfactory.Although the dataset is scanned twice, our hierarchicalgrid-based algorithm can reduce the cost significantly.Detailed computation cost analysis is in Section 3.2.

As a real example, Figure 1 shows a region ofarsenic hotspots in Texas (step 1). We find the followingassociation rule a in the region (Step 2) with 100%confidence: The wells, with nitrate concentration lowerthan 0.085mg/l, have dangerous arsenic concentrationlevel. The scope of the rule is a set of regions on thesouth-east coast of Texas (step 3). Our experimentalanalysis shows that the rule a fails to be discoveredat the global (Texas state) level due to insufficientconfidence (less than 50%).

2.1 Measuring the Interestingness of a Set ofRegions Our region discovery algorithm employs areward-based evaluation scheme that evaluates the qual-

ity of the generated regions. Given regions RX ={r1, ..., rm}, the fitness function is defined as the sumof the rewards obtained from each region ri (i = 1..m)(Equation 1).

q(RX) =

m∑

i=1

reward(ri)

=

m∑

i=1

(interestingness(ri) × |ri|β), (1)

where |ri|β(β > 1) in q(RX) increases the value of

the fitness nonlinearly with respect to the region size|ri|. Consequently, the evaluation scheme encouragesmerging small regions if the overall rewards do notdecrease.

A region contains a set of spatial objects. Inaddition, for each pair of objects belonging to the sameregion, there always exists a path within this region thatconnects them. We search for regions r1, ..., rm suchthat:

1. The regions are disjoint: ri ∩ rj = Ø, i 6= j.

2. RX = {r1, ..., rm} maximizes q(RX).

3. The generated regions are not required to be a totalpartition of the global dataset D, that is, r1∪...∪rm

⊆ D.

4. r1, ..., rm are ranked based on the reward value.Regions that receive low rewards or non-rewardsare frequently discarded.

In the rest of the section, we first present how to applyour reward-based scheme to identify interesting regionsfor regional association mining. Then we define transac-tions and supervised association rules for regional pat-terns. At last, we present how to measure the inter-estingness of a set of regions with respect to a givenregional association rule.

2.2 Identifying Interesting Regions for Re-gional Association Rule Mining In our work, weconsider a region that is interesting for regional rulemining if it is a hotspot or coolspot with respect to aparticular class of interest. The concept of hotspots orcoolspots were first introduced in crime/disorder map-ping [10]. Techniques to discover hotspots and coolspotsin spatial datasets have been developed in our previouswork [6, 11, 12] and proved to be effective. We definea region of hotspots/coolspots as a region in which thedensity of the class of interest cl is much higher/lowerthan its average density in the whole dataset. Interest-ingness for a region ri, function τ = interestiness(ri)(Equation 2), is defined as

if P (ri, cl) < priori(cl) × γ1

τ =[

prior(cl)×γ1−P (ri,cl)prior(cl)×γ1

× R−

]η

.

elseIf P (ri, cl) > priori(cl) × γ2

τ =[

P (ri,cl)−prior(cl)×γ2

1−prior(cl)×γ2

× R+

]η

.

otherwise

τ = 0.

(2)

where prior(cl) × γ1 and prior(cl) × γ2 (γ1, γ2 ∈[0, 1]) are the thresholds to determine whether or not aregion is rewarded; P (ri, cl) is the probability of objectsin region ri belonging to the class of interest cl; prior(cl)is the average density of objects in the whole datasets D

belonging to the class cl; R+ and R−( ∈ [0, 1]) are themaximum reward for regions of hotspots and coolspotsrespectively. The parameter η (η > 0) determineshow quickly the value of interestingness grows to themaximum reward (either R+ or R−). If η is set to 1, thereward function changes linearly, as shown in Figure 2.In general, the larger value for η is, the more interestedwe are in high density regions.

2.3 Regional Association Rule Mining Ourframework leads to a class-guided generation of associa-tion rules that sheds more light on the patterns relatedto the given class structure. Let D be a spatial dataset,

Figure 2: The function of interestingness measurement,τ, when η = 1.

and S = {s1, s2, ..., sl} be a set of spatial attributes,such as longitude or latitude. A = {a1, a2, ..., am} be aset of non-spatial attributes (continuous attributes arerequired to be transformed into nominal attributes), andcl be the class label. Let

I = S ∪ A ∪ {cl}

= {s1, s2, ..., sl, a1, a2, ..., am, cl}

be the set of all the items in D. . Let T = {t1, t2, ..., tN}be the set of all the transactions. T can be representedas a relational table, which contains N tuples conform-ing to the schema I (I contains l + m + 1 number ofitems). Thus an item i ∈ I is a binary variable whosevalue is 1 if the item is present in ti (i = 1, ..., N) and0 otherwise. Consequently, the set of transactions T isclassified based the class cl. We are interested in asso-ciation rules that associate the class label cl either inthe antecedent or the consequent predicate. We definesuch rules as supervised association rules. The formaldefinition is

Definition 1 A supervised association rule r isof the form P → Q, where P ⊆ I, Q ⊆ I,and (P ∪ Q) ∩ {cl} 6= Ø.

The rule r holds in the D with confidence con andsupport sup, where sup(P → Q) = σ(P∪Q)

N, con(P →

Q) = σ(P∪Q)σ(P ) . The support count is defined as σ(α) =

|{ti|α ⊆ ti, ti ∈ T }|, (i = 1, ..., N), where | . |denotes the number of elements in a set. A supervisedassociation rule is strong if it satisfies user-specifiedminimum support (min_sup) and minimum confidence(min_conf) thresholds.

2.4 Association Rule Scoping: Measuring theInterestingness of Association Rule Scope Whenmining regional association rules, we are interestedin a high-density region with respect to the class ofinterest cl. However, we need to redefine the measureof interestingness for regional association rule scoping.

The goal of association rule scoping is to compute aset of regions where a given association rule is valid.The scope of a regional association rule indicates howregional or global the rule is. In this section, we presentan approach for association rule scoping using the regiondiscovery framework introduced in Section 2.1, wherethe measure of interestingness depends on support andconfidence of a given association rule.

Let a be an association rule, r be a region,conf(a, r) denotes the confidence of a in region r, andsup(a, r) denotes the support of a in r.

Definition 2. The scope of the association rule a areregions in which the association rule a satisfies themin_sup and min_conf threshholds.

We define the interestingness of a region ri with respectto a given association rule a, τ ′ = interestingness(ri)(Equation 3) as follows:

if sup(a, ri) < min_sup × δ1 or

conf(a, ri) < min_conf × δ2,τ ′ = 0.otherwise

τ ′ = sup(a,ri)min_sup

× (conf(a,ri)−min_conf×δ2

.

1−min_conf×δ2

)η

(3)

where larger value of η, for example η = 2, givesmore weight to the increment of the confidence, becauseconfidence is more important when measuring an asso-ciation rule. Typically we choose δ1 = δ2 = 0.9 if theregion ri is at least 90% satisfied with the min_sup andmin_conf thresholds.

The measure of interestingness (Equation 3) is de-signed to efficiently identify the scope of a given regionalassociation rule. First, in contrast to traditional asso-ciation rule mining, the proposed measure of interest-ingness uses “soft” instead of “hard” support and con-fidence threshholds. For example, with δ1 = δ2 = 0.9,it rewards regions as long as their confidence or sup-port threshhold are within 90% of the hard threshholdsmin_conf and min_sup. Assume min_sup = 10%,min_conf = 80%, and an association rule whose sup-port is 9% and confidence is 100% in a region r′. Insteadof assigning zero reward to region r′, we argue to re-ward the region because confidence of the rule is abovemin_conf threshold and its support is just a little bitlower (1%) than the min_sup threshold. Secondly, ourapproach uses a quantitative evaluation method that as-signs higher degree of interestingness and consequentlyhigher reward to regions whose support and confidenceare high with respect to the association rule of interest.Consequently, those regions are most likely ranked ashighly rewarded regions. Thirdly, with our methodol-ogy, an association rule a can be discovered by applying

supervised association rule mining to a particular regionr. Due to the fact that a satisfies the support and con-fidence threshold in r, we already know that the region,r, from which a originated, receives a positive reward.This is also confirmed by our experimental results. Insummary, our method provides a quantitative way todetermine which region will be reported to form thescope of rule a; e.g. we may decide only include regionswhose rewards are ranked as top 80%.

2.5 Other Applications of Association RuleScoping Association Rule Scoping has many applica-tions that lie outside of the regional association miningmethodology we introduced in the former section. Welist below several interesting applications. First, a do-main expert can use it to determine the scope of his ownassociation that is not necessarily originated from ourregional association rule mining. For example, a domainexpert can check whether an arsenic association, whichis valid in Texas, also holds in Bangladesh, a countrythat has serious arsenic contamination in drinking wa-ter. Secondly, a domain expert may be interested in thechange of the scope if a condition in the antecedent ofan association rule is dropped. Furthermore, in additionto find the scope where an association holds, it mightbe interesting to search for the scope where it does nothold. For example, if we find that high levels of iron as-sociates with high arsenic concentration in one region,but with low arsenic concentration in another region,this case should be further analyzed.

The association rule scoping can also be used todevelop interactive tools that allow domain experts tomodify association rules and to visualize how the scopeof a rule changes corresponding to such modifications.Such a tool is very useful for educational purpose tosimulate and visualize the spatial variability of anyscientific knowledge that can be represented in the formof association rules.

Last but not least, the regions obtained from theassociation rule scoping can serve as a source for creat-ing new interesting association rules. For example, wecan conduct a study of regions where high levels of ironassociate high levels of fluoride; we can then determinethe regions using our method for the association rulehigh_iron(X) → high_fluoride(X).

In summary, it is very important for spatial datamining to realize and utilize the duality between re-gional patterns and regions where the patterns are sup-ported: regions can be used to create novel regionalpatterns and regional patterns can be used to createinteresting regions.

3 Algorithms

In this section, we first give details of our regiondiscovery algorithm based on the reward and fitnessfunctions defined in section 2. Then we explain ourregional association rule mining algorithm.

3.1 Region Discovery Algorithm:Supervised Clustering Using Multi-ResolutionGrids (SCMRG) We have developed an algorithmcalled Supervised Clustering using Multi-ResolutionGrids to identify promising regions. The algorithm isa hierarchical grid-based method that utilizes a divi-sive, top-down search: each cell at a higher level is par-titioned further into a number of smaller cells at thelower level, and this process continues if the sum of therewards of the lower level cells is greater than the ob-tained reward for the cell at the higher level. The re-turned cells usually have different sizes, because theywere obtained at different level of resolution. A queuedata structure is used to store all the cells that need beprocessed. The algorithm (see Figure 3) starts at a userdefined level of resolution, and considers the followingthree cases when processing a cell c.

1. Case 1. If the cell c receives a reward, and itsreward is greater than the sum of the rewards ofits children (succ(c)) and the sum of rewards of itsgrandchildren respectively, this cell is returned asa cluster by the algorithm (step 15-17).

2. Case 2. If the cell c does not receive a rewardand its children and grandchildren do not receivea reward, neither the cell nor any of its decedentswill be labeled as a cluster (step 23-29).

3. Case 3. Otherwise, if the cell c does not receivea reward, but its children receive rewards, put allthe children of the cell c (succ(c)) into a queue forfurther processing (step 18-21, step 24-28).

We traverse through the hierarchical structure andexamine those cells in the queue from the higher level.The algorithm uses a user-defined cell size as depthbound. Cells that are smaller than this cell will not besplit any further (step 19, step 25). Finally, we collect allthe cells that have been identified in case 1 from differentlevels, and we merge neighbor clusters if it improves thefitness as defined in Equation 2. The obtained regionsare returned as the result of executing SCMRG (step31-33).

The example in Figure 4 explains the procedureof this algorithm using a sample dataset. The first

SCMRG (min_cell_size)1. Determine a level of resolution l to

start with.

2. Assign spatial objects to grid cells.

3. for each cell c at current level l do

4. enqueue(c, cellQueue).5. end for

6. while NOT empty(cellQueue) do

7. c = dequeue(cellQueue).8. r = reward (c). {Calculate reward for

the cell.}

9. for each cchild ∈succ(c) do

10. rchildren = rchildren+reward (cchild).

11. end for {Calculate reward for its

children.}

12. for each cgrandchild ∈succ(succ(c)) do

13. rgrandchildren = rgrandchildren+reward (cgrandchild).

14. end for {Calculate reward for its

grandchildren.}

15. if r > 0 {The cell receives a

reward.}

16. if r > rchildren ANDr > rgrandchildren

17. label the cell as a cluster.

18. else {The cell should be divided

further.}

19. if ( the size of each

cchild ∈succ(c) >min_cell_size)20. enqueue(succ(c), cellQueue).21. end if

22. end if

23. else if r = 0{The cell does not receive

a reward.}

24. if NOT (rchildren = 0 ANDrgrandchildren = 0)

25. if ( the size of each

cchild ∈succ(c) >min_cell_size){The cell will be divided

further}

26. enqueue(succ(c), cellQueue).27. end if

28. end if

29. end if

30.end while

31.Collect all the cluster-labeled cells

from different levels.

32.Obtain regions by merging neighbor

clusters if it improves the fitness.

33.Return the obtained regions.

Figure 3: The Algorithm of Supervised Clustering usingMulti-Resolution Grids (SCMRG)

Figure 4: Running the SCMRG algorithm.

decomposition results in 4 cells c11, c12, c13, c14 at level1. Assume only c11 and c12 receive rewards, and thereward of c11 is greater than the sum of the rewards ofits children and the sum of rewards of its grandchildrenrespectively, c11 is then labeled as a cluster accordingto case 1. We also assume the cell, c14, does not receiveany rewards. Neither do its children nor grandchildrenreceive any rewards. According to case 2, c14 is notlabeled as a cluster, and its successors are not saved intothe queue. Although the cell, c13, receives no reward,assume its children receive rewards. According to case 3,all the children of c13 are saved into a queue for furtherprocessing. The cells at level 1 are then divided intolevel 2 and 3 and the same procedure is applied to all thecells in the queue. Each cell is labeled accordingly. Theintermediate results are shown at level 2 and 3 in Figure4. Neighbor clusters are then merged if it improves thefitness. In this example, there are two regions identified.

3.2 Analysis of the SCMRG Algorithm In theabove algorithm, Step 1 and 2 take constant time.Step 3 and 4 take a time linearly proportional tothe number of cells at the first layer, which is 4 inour experiment. Regarding the time spent from step6 to 30, there are two major activities: check thecandidate cells in the queue, and calculate the rewardof the cells. Our hierarchical grid-based algorithmcaptures clustering information associated with spatialcells without recourse to the individual objects as we donot drill down a cell if it does not look so promising.Hence the total number of cells to be examined isless than the total number of cells in our hierarchicalstructure. Notice that the total number of cells is 1.33K,where K is the number of cells at bottom layer. Weobtain the factor 1.33 because the number of cells ofa layers is always one-forth of the number of cells of

Figure 5: An itemset lattice.

the layer one level lower. The SCMRG algorithm eitheruses Equation 1 and 2 or Equation 1 and 3 describedin Section 2 to calculate the measure of interestingness.For Equation 2, prior(cl) is a constant, and P (ri, cl)is pre-calculated only once for each grid cell at thebottom level. For Equation 3, min_sup and min_confare constants, sup(a, ri) and conf(a, ri) again are pre-calculated only once for each grid cell at the bottomlevel. So in either case the computational cost is linearlyproportional to the number of objects. Consideringthe pre-computation only has to be done once, theoverall cost is less than O(N) if the algorithm is re-run several times with a different parameter setting. Instep 31 and 32, the time it takes to form the regionsis linearly proportional to the number of cells. Sothe overall computation complexity for multiple runsis far less than O(N), where N is the total number ofobjects. Thus the algorithm is capable of processinglarge datasets efficiently. The employed framework hassome similarity with the framework introduced in theSTING [13], another hierarchical grid-based algorithm.The difference is that our algorithm focuses on findinginteresting cells (that receive high rewards) instead ofcells that contain answers to a given query.

3.3 Generation of Regional Rules Once regionsare identified, we construct frequent itemsets for eachregion. We extend the Apriori algorithm [14] to gen-erate supervised association rules using the given classstructure: our algorithm not only prunes infrequent can-didate itemsets, but also discards itemsets that do notinvolve the class label. For example, Figure 5 shows anitemset lattice for an itemset I = {a, b, d, cl}. The item-sets in the lattice are divided into two groups: itemsetsthat are used to generate candidate itemsets containingclass label (colored in green/grey), and itemsets that are

Figure 6: Map of Texas showing arsenic concentrationlevel.

not involved with class label (they are pruned withoutcalculation of frequency). After frequent itemsets aregenerated, we use the same approach proposed by theApriori algorithm to generate strong associations usingthe min_confidence threshold.

4 A Real-World Case Study: Arsenic SpatialRisk Pattern Discovery in Texas

In this section we describe the experiment procedures ofapplying the proposed framework to a real world casestudy that identifies spatial risk patterns from arsenicin Texas water supply.

4.1 Data Collection and Data PreprocessingThe datasets used in this study are extracted from theTexas Ground Water Database (GWDB) maintainedby the Texas Water Development Board, the stateagency in charge of statewide water planning [15]. TheTexas Water Board has monitored and analyzed theconcentrations of this super toxic element over thelast 25 years. Arsenic in very high concentrations ispoisonous. Low-level, long term exposure to arsenic canlead to increased risk of cancer [16]. Arsenic is derivedfrom both anthropogenic sources – such as drainagefrom mines, pesticides, and biocides, and from naturalsources – such as hydrothermal leaching of arseniccontaining minerals or rocks. U.S. is one of the ninecountries that have been reported arsenic in drinkingwater by the World Health Organization [17].

Because data collection and maintenance proce-dures and standards have been changed over the yearsin the GWDB, datasets have to be cleaned to deal withproblems such as missing values, inconsistent data, and

Figure 7: Interesting regions are identified using β =1.01, η = 1, γ1 = 0.5, γ2 = 1.5, R+ = 1, R− = 1.Average region purity = 0.85.

duplicate entries. The obtained arsenic spatial datasetincludes spatial attributes (S), non-spatial attributes(A), and a class label (cl) for each water well. Spatialattributes include latitude and longitude, river basin,and state zone. Non-spatial attributes are selected withthe assistance of domain experts [18, 19, 20], and in-clude well depth, concentration of fluoride, nitrate, andother chemical metal elements, such as vanadium, iron,molybdenum and selenium etc. Regarding the class la-bel, water wells are classified into safe wells and danger-ous wells. A well is labeled as dangerous if its arsenicconcentration level is above 10µg/l, the standard fordrinking water by the Environment Protection Agency[21]. The GWDB data consists of 16,136 arsenic sam-ples of 131,949 water wells. We select 12,055 records1

from the GWDB. Figure 6 illustrates arsenic concen-tration in Texas, where safe wells are in green (or lightgrey), dangerous wells in red (or dark grey).

In preparation of the association rule mining, con-tinuous attributes need to be transformed into nominalattributes (as discussed in Section 2.3) except longitudeand latitude. We use the Recursive Minimal EntropyPartitioning algorithm [22] to discretize continuous at-tributes, because there is empirical evidence that thealgorithm produces better results in several real-worlddata mining tasks[23]. Discretization must avoid data

1Those records are sampled in accordance with the Texas

Water Development Board’s A Field Manual for Ground-Water

Sampling, 1990. "Samples are collected when temperature,

conductivity, and PH have stabilized. The sample is filtered and

field tested for alkalinity. Samples are preserved as applicable,

kept chilled, and delivered to the lab. Holding times are honored.

Organic sub-samples are not filtered."

Nitrate Safe DangerousBin 1 (0-0.085] 0% 100%Bin 2 (0.085-0.125] 59% 41%Bin 3 (0.125-0.135] 0% 100%Bin 4 (0.135-0.265] 72% 28%Bin 5 (0.265-16.1] 42% 58%Bin 6 (16.1-28.31] 63% 37%Bin 7 (28.31-∞) 38% 62%

Figure 8: Attribute discretization.priori(dangerous) = 22%,priori(safe) = 78%

fragmentation as too many small intervals may resultin low support count. This entropy-based algorithmsolves the problem in a global search: it uses a top-down method to recursively partition a range of valuesinto a set of intervals. The partition stops when thereis no more information gain. It is possible that someareas in the continuous spaces will be partitioned finelywhereas others will be partitioned coarsely if it has lowentropy. For example, Figure 8 shows a histogram forthe discretization of attribute nitrate in Region 3 (seeFigure 6), a region of arsenic hotspots. Bin 1, 3, and5 significantly deviate from the prior probability of thetwo classes – dangerous and safe. The discretizationmaximizes purity and therefore increases the probabil-ity of finding associations between the attribute rangeand the majority class of that range. In Region 3, wediscover six strong associations between nitrate and higharsenic concentration, where the nitrate concentrationsare in Bin 1 and 5 (Bin 3 does not show in the result be-cause its support is too low – only two wells have nitrateconcentration in this range).

4.2 Experimental Results Evaluation In this sec-tion we present experimental results in region discov-ery, regional association mining, and regional associa-

tion rule scoping with the validation from the publishedresults in geoscience. In Section 4.2-A we show the re-gions of arsenic hotspots and coolspots discovered byour region discovery algorithm. In Section 4.2-B, wepresent several interesting regional association rules forarsenic. We then show the scope of the regional asso-ciation rules. The experiment results clearly show thatour integrated framework will help the geoscientists toidentify the cause of high/low arsenic concentration indifferent regions.

A. Region Discovery As described in Section 3.1,the SCMRG algorithm uses spatial attribute longitude,latitude, and arsenic class label to do a divisive, top-down search. Figure 7 shows the result of such a run. Itillustrates four most highly rewarded regions – Region 1and 3 are regions of hotspots (high density of dangerouswells), and Region 2 and 4 are regions of coolspots (highdensity of safe wells). Region 1 and 3 overlap with thearsenic risk zone discussed by geoscientists in [24, 20].

B. Regional Association Rule Mining and Scop-ing Supervised association rules are generated fromeach region identified in the step of region discovery.The algorithm uses the whole set of attributes exceptlongitude and latitude, where min_sup = 10% andmin_conf = 80%. In the rest of section, we discussthe association rules derived from the four highly re-warded regions. The following four regional associationrules originally discovered from Region 1, 2, 3, and 4 re-spectively. Intuitively, mining regional rules in arsenichotspots/coolspots discovers attributes that are associ-ated with high/low arsenic concentrations. Associationrules 1 and 3 are confirmed by arsenic literature [18, 19].All the four association rules are meaningful and impor-tant according to domain experts.

is_a(X, well) ∧ nitrate(X, 28.31−∞) ∧

arsenic_level(danderous)

→ depth(X, 0 − 251.5) (100%). (1)

is_a(X, well) ∧ depth(X, 0 − 251.5) ∧

fluoride(X, 0 − 0.085)

→ arsenic_level(safe) (100%). (2)

is_a(X, well) ∧ nitrate(X, 0 − 0.085)

→ arsenic_level(danderous) (100%). (3)

is_a(X, well) ∧ depth(X, 251.5−∞) ∧

Figure 9: Region - Regional association rule - Scope. Legend: regions are highlighted by bold border line; scopesare in color blue (or light grey).

nitrate(X, 0.265− 16.1)

→ arsenic_level(safe) (100%). (4)

Figure 9 depicts that a scope may include severalregions. The experimental results also confirm ourdiscussion in Section 2.4: the figure clearly shows thatthe region, where an association rule is originated, is asubset of the scope where the association rule holds.

It is also important to point out that the scope of anassociation rule indicates how regional or global a localpattern is. For example, the scope of the associationrule 4 in Figure 9 covers a large percentage of theglobal space (> 75%). We find that the associationrule is strong (holds with 85% confidence) in the globaldataset. Hence it is indeed a global association rule.However, none of the association rule 1, 2, and 3 arediscovered at global level. We can also fine tune themeasurement interestingness of association rule scoping.Figure 10 shows such a scope tuning for the associationrule 3. Typically, lower value for min_sup resultsin larger scope, higher value for min_conf results insmaller scope.

Our algorithms are very computational efficient.On average, it takes 3.031 seconds for region discovery,and 1.67 seconds for regional association rule scopingwith a dataset that contains 12,055 spatial objects and24 attributes. The computer uses MS Windows XP,Intel(R) Pentium(R) M, processor 1.2GHz, 632 MB ofRAM.

In summary, from these experiments we identifiedmeaningful regions, generated meaningful regional as-sociation rules, and further identified the scopes of re-gional patterns at different granularity. Our experimen-tal results were confirmed and validated by literature ingeoscience[20, 18, 19, 24], with new rules and scopes fornew regions. The results also provide more systematicand concrete evidence that regional rules are not the

representative of global rules, and vice verse.

5 Related Work

The areas most relevant to our work are: associationrule mining, supervised and semi-supervised clustering,co-location rule discovery, and localized association rulemining.

Association rule mining has been introduced in[14] to mine interesting relationships hidden in marketbasket transactions. Spatial association rule mining [1,25] extends association rule mining to spatial datasets.A spatial association rule takes the form of

P1 ∧ P2 ∧ ... ∧ Pm → Q1 ∧ Q2 ∧ ... ∧ Qn (sup%, con%).

It denotes association relationships among a set of pred-icates Pi (i = 1, ..., m) and Qj (j = 1, ..., n), wherethere exists at least one spatial predicate. Most spa-tial association rule mining uses Apriori-style [14] algo-rithms, which requires dataset in the form of transac-tions. However, transaction definition is often implicitin spatial space. If spatial association rule discovery isrestricted to a reference feature (such as cities or wa-ter wells), then transactions can be defined using theinstances of this reference feature, as in [1]. Otherwise,transactions must be generated by mining algorithms,as in spatial co-location mining [26]. This paper adoptsthe transaction model in [1] (in our case, a water well isa reference feature) . While co-location rule discoveryidentifies subsets of spatial features frequently locatedtogether [26]. It focuses on finding frequent, global pat-terns that characterize the complete dataset, whereasour approach centers on analyzing spatial datasets atdifferent levels of granularity to discover regional pat-terns.

Supervised clustering [6, 27] focuses on partition-ing classified examples, maximizing cluster purity whilekeeping the number of clusters low. Semi-supervised

Figure 10: The scope of a particular rule changes based on the different values of min_sup or min_conf .

clustering employs a small amount of labeled data to aidunsupervised learning [28]. This paper applies super-vised clustering to a new problem: region discovery forregional association rule mining, and then determinesthe scope of a spatial association rule.

Aggarwal et al. [29] discuss a technique for discov-ering localized associations in segments of the basketdata using clustering. They share the same objectiveas ours: finding significant localized associations in thedataset which cannot be found from the aggregate data.However, there exist two major differences between thetwo approaches. First, we target on spatial datasets,which have very different characteristics from the tradi-tional basket data they use. Second, our approach usessupervised, reward-based, hierarchical top-down search,while their approach compare similarity between differ-ent basket transactions and uses bottom-up agglomera-tion to cluster transactions.

6 Conclusions

One critical requirement for spatial data mining is thecapability to analyze datasets at different levels of gran-ularity, in addition to analyze data globally. Further-more, it is desirable to have the capability to move be-tween different granularity, particularly if the obtainedresults are unsatisfactory. We also provided evidencethat discovering regional patterns is very important inspatial data mining. Unfortunately, the currently em-ployed association rule mining techniques do not offersuch capability. We see our work as a first breakthroughto provide such capabilities.

This paper introduces a methodology for miningspatial association rules and proposes novel algorithmsto determine the scope of a spatial association rule. Thepresented approach also takes advantage of the dualitybetween regions and rules: regions are used to discoverrules, and rules are used to discover regions.

We propose a reward-based region discovery frame-work that employs clustering to find interesting regions.The integrated approach is applied to solve two dis-tinct region discovery problems: identifying interestingregions for regional pattern discovery and determiningthe scope of a given association rule. Moreover, thepresented association rule mining methodology is su-pervised, centering on finding rules with respect to anunderlying class structure. In addition, a divisive, grid-based supervised clustering algorithm named SCMRGhas been discussed that searches for interesting regionsin large spatial datasets, maximizing a reward-basedfitness function that measures the interestingness of agiven set of regions.

Moreover, rule scoping provides unique capabilitiesto domain experts to explore how the changes of a rulecan impact the regions for which it is valid. We plan toimplement an interactive tool of rule scoping that allowsfor changing existing rules and for testing new rules, andthe changes in rule scope can be visually presented toa domain expert, as illustrated in Figure 10. We claim,providing such capabilities will be useful for exploratoryhypothesis testing and to increase the understandingof spatial datasets, if a domain expert concerns theregional characteristics of particular patterns she/he isinterested in.

We evaluated the proposed framework in a real-world case study to identify spatial risk patterns ofarsenic in Texas water supply. We identified arsenichotspots and coolspots and created regional rules fromthe obtained regions, and then determined the scopeof regional patterns. We have rediscovered several re-lationships that are already reported in the scientificliterature. Moreover, our approach identified new rela-tionships and regions for arsenic that provide scientistswith novel hypothesis that deserve further exploration.

References

[1] K. Koperski and J. Han, “Discovery of spatial asso-ciation rules in geographic information databases,” inProc. 4th Int. Symp. Advances in Spatial Databases,SSD, M. J. Egenhofer and J. R. Herring, Eds., vol.951, 6–9 1995, pp. 47–66.

[2] R. Munro, S. Chawla, and P. Sun, “Complex spatialrelationships,” in The Third IEEE International Con-ference on Data Mining (ICDM2003), 2003.

[3] S. Papadimitriou, A. Gionis, P. Tsaparas, A. Väisänen,H. Mannila, and C. Faloutsos, “Parameter-free spatialdata mining using MDL,” in 5th International Confer-ence on Data Mining (ICDM) 2005, 2005.

[4] J. F. Roddick and M. Spiliopoulou, “A bibliographyof temporal, spatial and spatio-temporal data miningresearch,” in SIGKDD Explorations, vol. 1, no. 1, 1999,pp. 34–38.

[5] S. Shekhar, P. Zhang, Y. Huang, and R. R. Vatsavai,“Book chapter in data mining: Next generation chal-lenges and future directions,” in Spatial Data Mining,H. Kargupta and A. Joshi, Eds. AAAI/MIT Press,2003.

[6] C. Eick, B. Vaezian, D. Jiang, and J. Wang, “Discov-ering of interesting regions in spatial data sets usingsupervised cluster,” in PKDD’06, 10th European Con-ference on Principles and Practice of Knowledge Dis-covery in Databases, 2006.

[7] M. F. Goodchild, “The fundamental laws of GIScience,”Invited talk at University Consortium for GeographicInformation Science, University of California, SantaBarbara, 2003.

[8] S. Openshaw, “Geographical data mining: Key designissues,” in GeoComputation, 1999.

[9] S. EH, “The interpretation of interaction in contingencytables,” Journal of the Royal Statistical Society, vol.B13, pp. 238–241, 1951.

[10] Mapping Crime: Understanding Hot Spots. U.S.Department of Justice, Office of Justice Programs:National Institude of Justice, 2005.

[11] J. Wang, “Region discovery using hierarchical super-vised clustering,” Master’s thesis, Computer ScienceDepartment, Univeristy of Houston, 2006.

[12] W. Ding, C. F. Eick, J. Wang, and X. Yuan, “Aframework for regional association rule mining in spa-tial datasets,” in The 6th IEEE International Confer-ence on Data Mining (ICDM’06), to appear, December2006.

[13] W. Wang, J. Yang, and R. R. Muntz, “STING: Astatistical information grid approach to spatial datamining,” in Twenty-Third International Conference onVery Large Data Bases. Athens, Greece: MorganKaufmann, 1997, pp. 186–195.

[14] R. Agrawal, T. Imielinski, and A. N. Swami, “Min-ing association rules between sets of items in largedatabases,” in Proceedings of the 1993 ACM SIGMODInternational Conference on Management of Data,

P. Buneman and S. Jajodia, Eds., Washington, D.C.,26–28 1993, pp. 207–216.

[15] T. W. D. Board, http://www.twdb.state.tx.us/home/index.asp,2006.

[16] S. A. H. et al., “Cancer risks from arsenic in drinkingwater,” in Environmental Health Perspectives, vol. 97,1992, pp. 259–267.

[17] W. H. Organization, http://www.who.int/, 2006.[18] P. F. Hudak, “Arsenic, nitrate, chloride and bromide

contamination in the gulf coast aquifer, south-centralTexas, USA,” International Journal of EnvironmentalStudies, vol. 60, pp. 123–133, 2003.

[19] L. M. Lee and B. Herbert, “A GIS survey of arsenicand other trace metals in groundwater resources ofTexas,” in Natural Arsenic in Groundwater: Science,Regulation, and Health Implications (Posters), 2001.

[20] R. Parker, “Ground water discharge from mid-tertiaryrhyolitic ash-rich sediments as the source of elevatedarsenic in south texas surface waters,” in NaturalArsenic in Groundwater: Science, Regulation, andHealth Implications, 2001.

[21] U. E. P. Agency, http://www.epa.gov/, 2006.[22] U. M. Fayyad and K. B. Irani, “Multi-interval dis-

cretization of continuous-valued attributes for classi-fication learning,” in Proceedings of the 13th Inter-national Joint Conference on Artificial Intelligence,M. Kaufmann, Ed., 1993, pp. 1022–1027.

[23] J. Dougherty, R. Kohavi, and M. Sahami, “Super-vised and unsupervised discretization of continuousfeatures,” in International Conference on MachineLearning, 1995, pp. 194–202.

[24] N. W.-Q. A. Program, “Ground-water quality of thesouthern high plains aquifer, Texas and New Mexico,open-file report 03-345,” U.S. Department of the Inte-rior and U.S. Geological Survey, Tech. Rep., 2001.

[25] S. Shekhar and S. Chawla, Spatial Databases: A Tour,T. D. Holm, Ed. Prentice Hall, 2003 (ISBN 013-017480-7), 2003.

[26] S. Shekhar and Y. Huang, “Discovering spatial co-location patterns: A summary of results,” LectureNotes in Computer Science, vol. 2121, pp. 236+, 2001.

[27] C. F. Eick, N. Zeidat, and Z. Zhao, “Supervised clus-tering: Algorithms and application,” in InternationalConference on Tools with AI, Boca Raton, Florida,2004, pp. 774–776.

[28] S. Basu, M. Bilenko, and R. J. Mooney, “Compar-ing and unifying search-based and similarity-based ap-proaches to semi-supervised clustering,” in the ICML-2003 Workshop on the Continuum from Labeled to Un-labeled Data in Machine Learning and Data MiningSystems, Washington DC, August 2003, pp. 42–49.

[29] C. C. Aggarwal, C. M. Procopiuc, and P. S. Yu,“Finding localized associations in market basket data,”IEEE Transactions on Knowledge and Data Engineer-ing, vol. 14, pp. 51–62, 2002.

an integrated approach for regional association rule ...ceick/kdd/edwyk06.pdf · 2 an integrated...

Documents