regional association rule miningceick/white/w_rassoc.pdf · 2018-06-18 · figure 1: regional...

3
Regional Association Rule Mining Wei Ding, Christoph F. Eick, Jing Wang, Xiaojing Yuan * Department of Computer Science University of Houston, Houston, TX 77204-3010 30th November 2006 1 Regional Association Rule Mining This project [4] centers on regional association rule mining and scoping in spatial datasets. We introduces a methodol- ogy for mining spatial association rules and proposes new al- gorithms to determine the scope of a spatial association rule. We develop a reward-based region discovery framework that employs clustering to find interesting regions. The frame- work is applied to solve two distinct region discovery prob- lems: identifying interesting regions for regional association rule discovery, and determining the scope of a given associ- ation rule. Moreover, our association rule mining methodol- ogy is supervised, centering on finding rules with respect to an underlying class structure, and the class structure itself is also used for rule pruning and for the discretization of numerical attributes. 2 Motivation The goal of spatial data mining is to automate the extrac- tion of interesting, useful but implicit spatial patterns. One special challenges in spatial data mining is that information is usually not uniformly distributed in spatial datasets. The Earth’s terrain varies from place to place – a county is not a representative of a state, and a state is not a represen- tative of a country. Many interesting associations in spa- tial datasets are geographically regional, rather than global. Consequently, the discovery of regional knowledge is of fun- damental importance for spatial data mining. Unfortunately, traditional data mining techniques are ill- prepared for discovering regional knowledge. For example, when using association rule mining, regional patterns fre- quently fail to be discovered due to insufficient global con- fidence and/or support. Furthermore, the number of re- gions is not pre-defined for a spatial dataset. This raises the questions on how to measure the interestingness of a set of regions and how to identify interesting regions algorithmi- cally. One popular approach is to select regions based on a priori given structure, such as a grid structure based on longitude and latitude or political boundaries; for example, using counties as regions of a state. However, the bound- ary of so constructed regions frequently does not match the surface boundary of the interesting patterns, thus making them unlikely to be discovered. For example, assume there are high arsenic concentrations along a river that crosses multiple counties in Texas, mining regional rules at county level is unlikely to detect this pattern, due to insufficient support/confidence. * Engineering Technology Department, University of Houston, Hous- ton, TX 77204-3010 On the other hand, regional spatial association rule min- ing has its unique regional characteristics. Regional associ- ation rules, by definition, hold in a subspace. Consequently, regional association rules may only be discovered in a par- ticular subspace of the global space. In addition, a regional association rule that was discovered in a particular subspace may also be valid in another subspace. Hence we are inter- ested in determining the set of regions where the regional rule is valid. We define such a set of regions as the scope of the regional association rule. 3 The Framework Our approach uses a reward-based framework to find in- teresting regional association rules, and then determine the scope of those rules. Here we break our approach into four steps: 1. Discover and identify interesting regions. We consider a region that is interesting for regional rule mining if it is a hot spot or cool spot with respect to a particular class of interest. Techniques to discover hot spots and cool spots in spatial datasets have been developed in our previous work [1, 2, 3]; 2. Mine regional association rules in regions generated in the first step; 3. Identify scope of each interesting association rule dis- covered in the second step; 4. Evaluate regional patterns and scopes; fine tune the parameter settings and repeat step 1 to 3 if the results are not satisfactory. As a real example, Figure 1 shows a region of arsenic hot spots located southwestern Gulf Coast of Texas (step 1 ). We find the following association rule a in the region (step 2 ) with 100% confidence 1 : The wells, with nitrate concen- tration lower than 0.085mg/l, have dangerous arsenic con- centration level. The scope of the rule is a set of regions on the south-east coast of Texas (step 3 ). Our experimental analysis shows that the rule a fails to be discovered at the global level (the state of Texas) due to insufficient confidence (less than 50%). The following examples show four regional association rules with 100% confidence identified from the four regions in Texas (see Figure 2), where Region 1 and Region 3 are arsenic hot spots located in southern High Plains and south- western Gulf Coast of Texas. The four association rules 1 Confidence of the association rule a determines how frequently dangerous arsenic concentration levels are associated nitrate concen- trations (<0.085mg/l). The frequency is 100% in this example. 1

Upload: others

Post on 22-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regional Association Rule Miningceick/white/W_RASSOC.pdf · 2018-06-18 · Figure 1: Regional Association Rule Mining and Scoping. The example depicts a region of arsenic hot spots

Regional Association Rule Mining

Wei Ding, Christoph F. Eick, Jing Wang, Xiaojing Yuan∗

Department of Computer ScienceUniversity of Houston, Houston, TX 77204-3010

30th November 2006

1 Regional Association Rule Mining

This project [4] centers on regional association rule miningand scoping in spatial datasets. We introduces a methodol-ogy for mining spatial association rules and proposes new al-gorithms to determine the scope of a spatial association rule.We develop a reward-based region discovery framework thatemploys clustering to find interesting regions. The frame-work is applied to solve two distinct region discovery prob-lems: identifying interesting regions for regional associationrule discovery, and determining the scope of a given associ-ation rule. Moreover, our association rule mining methodol-ogy is supervised, centering on finding rules with respect toan underlying class structure, and the class structure itselfis also used for rule pruning and for the discretization ofnumerical attributes.

2 Motivation

The goal of spatial data mining is to automate the extrac-tion of interesting, useful but implicit spatial patterns. Onespecial challenges in spatial data mining is that informationis usually not uniformly distributed in spatial datasets. TheEarth’s terrain varies from place to place – a county is nota representative of a state, and a state is not a represen-tative of a country. Many interesting associations in spa-tial datasets are geographically regional, rather than global.Consequently, the discovery of regional knowledge is of fun-damental importance for spatial data mining.

Unfortunately, traditional data mining techniques are ill-prepared for discovering regional knowledge. For example,when using association rule mining, regional patterns fre-quently fail to be discovered due to insufficient global con-fidence and/or support. Furthermore, the number of re-gions is not pre-defined for a spatial dataset. This raises thequestions on how to measure the interestingness of a set ofregions and how to identify interesting regions algorithmi-cally. One popular approach is to select regions based ona priori given structure, such as a grid structure based onlongitude and latitude or political boundaries; for example,using counties as regions of a state. However, the bound-ary of so constructed regions frequently does not match thesurface boundary of the interesting patterns, thus makingthem unlikely to be discovered. For example, assume thereare high arsenic concentrations along a river that crossesmultiple counties in Texas, mining regional rules at countylevel is unlikely to detect this pattern, due to insufficientsupport/confidence.

∗Engineering Technology Department, University of Houston, Hous-ton, TX 77204-3010

On the other hand, regional spatial association rule min-ing has its unique regional characteristics. Regional associ-ation rules, by definition, hold in a subspace. Consequently,regional association rules may only be discovered in a par-ticular subspace of the global space. In addition, a regionalassociation rule that was discovered in a particular subspacemay also be valid in another subspace. Hence we are inter-ested in determining the set of regions where the regionalrule is valid. We define such a set of regions as the scope ofthe regional association rule.

3 The Framework

Our approach uses a reward-based framework to find in-teresting regional association rules, and then determine thescope of those rules. Here we break our approach into foursteps:

1. Discover and identify interesting regions. We considera region that is interesting for regional rule mining if itis a hot spot or cool spot with respect to a particularclass of interest. Techniques to discover hot spots andcool spots in spatial datasets have been developed inour previous work [1, 2, 3];

2. Mine regional association rules in regions generated inthe first step;

3. Identify scope of each interesting association rule dis-covered in the second step;

4. Evaluate regional patterns and scopes; fine tune theparameter settings and repeat step 1 to 3 if the resultsare not satisfactory.

As a real example, Figure 1 shows a region of arsenic hotspots located southwestern Gulf Coast of Texas (step 1 ).We find the following association rule a in the region (step2 ) with 100% confidence1: The wells, with nitrate concen-tration lower than 0.085mg/l, have dangerous arsenic con-centration level. The scope of the rule is a set of regions onthe south-east coast of Texas (step 3 ). Our experimentalanalysis shows that the rule a fails to be discovered at theglobal level (the state of Texas) due to insufficient confidence(less than 50%).

The following examples show four regional associationrules with 100% confidence identified from the four regionsin Texas (see Figure 2), where Region 1 and Region 3 arearsenic hot spots located in southern High Plains and south-western Gulf Coast of Texas. The four association rules

1Confidence of the association rule a determines how frequentlydangerous arsenic concentration levels are associated nitrate concen-trations (<0.085mg/l). The frequency is 100% in this example.

1

Page 2: Regional Association Rule Miningceick/white/W_RASSOC.pdf · 2018-06-18 · Figure 1: Regional Association Rule Mining and Scoping. The example depicts a region of arsenic hot spots

Figure 1: Regional Association Rule Mining and Scoping. The example depicts a region of arsenic hot spots and the scopeof the identified association rule a on the terrain map of Texas.

indicate what associate with dangerous/safe arsenic concen-trations. Figure 3 depicts the scope of the four regionalassociation rules. We can also fine tune the measurementinterestingness of association rule scoping. Figure 4 showssuch a scope tuning for the association rule 3. Typically,lower value for min_sup results in larger scope, higher valuefor min_conf results in smaller scope2

Region 1. is_a(X, well) ∧ nitrate(X, 28.31−∞) ∧

arsenic_level(X, danderous) → depth(X, 0 − 251.5)

Region 2. is_a(X, well) ∧ depth(X, 0 − 251.5)∧

fluoride(X, 0 − 0.085) → arsenic_level(X, safe)

Region 3. is_a(X, well) ∧ nitrate(X, 0 − 0.085)

→ arsenic_level(X, danderous)

Region 4. is_a(X, well) ∧ depth(X, 251.5−∞) ∧

nitrate(X, 0.265− 16.1) → arsenic_level(X, safe)

4 Applications

We have successfully evaluate our approach in a real-worldcase study to identify spatial risk patterns from arsenic inTexas water supply [4]. We list below several other interest-ing applications.

First, a domain expert can use the approach to deter-mine the scope of his own association that is not necessarilyoriginated from our regional association rule mining. Forexample, a domain expert can check whether an arsenic as-sociation, which is valid in Texas, also holds in Bangladesh,a country that has serious arsenic contamination in drinkingwater. Secondly, a domain expert may be interested in the

2mim_sup (minimum support) and min_conf (minimum confi-

dence) are the thresholds to determine whether an association ruleholds in a given region.

change of the scope if a condition in the antecedent of anassociation rule is dropped. Furthermore, in addition to findthe scope where an association holds, it might be interestingto search for the scope where it does not hold. For exam-ple, if we find that high levels of iron associates with higharsenic concentrations in one region, but with low arsenicconcentration in another region, this case then can be fur-ther analyzed. Our framework can also be used to developinteractive tools that allow domain experts to modify asso-ciation rules and to visualize how the scope of a rule changescorresponding to such modifications. Last but not least, theregions obtained from the association rule scoping can serveas a source for creating new interesting association rules.For example, we can conduct a study of regions where highlevels of iron associate high levels of fluoride; we can thendetermine the regions using our method for the associationrule high_iron(X) → high_fluoride(X).

In summary, it is critical to analyze datasets at differentlevels of granularity. Our framework utilizes the duality be-tween regional patterns and regions where the patterns aresupported: regions can be used to create novel regional pat-terns and regional patterns can be used to create interestingregions.

References

[1] C. Eick, B. Vaezian, D. Jiang, and J. Wang, “Discover-ing of interesting regions in spatial data sets using super-vised cluster,” in PKDD’06, 10th European Conferenceon Principles and Practice of Knowledge Discovery inDatabases, 2006.

[2] J. Wang, “Region discovery using hierarchical supervisedclustering,” Master’s thesis, Computer Science Depart-ment, Univeristy of Houston, 2006.

[3] W. Ding, C. F. Eick, J. Wang, and X. Yuan, “A frame-work for regional association rule mining in spatialdatasets,” in The 6th IEEE International Conference onData Mining (ICDM’06), to appear, December 2006.

[4] C. F. Eick, W. Ding, J. Wang, X. Yuan, and S. Khan,“An integrated appproach to regional association rulemining and scoping„” in Submitted to the SIAM DataMining Conference, Minnesota, April 2007.

2

Page 3: Regional Association Rule Miningceick/white/W_RASSOC.pdf · 2018-06-18 · Figure 1: Regional Association Rule Mining and Scoping. The example depicts a region of arsenic hot spots

(a) Map of Texas showing arsenicconcentration level.

(b) Identified interesting regions.

Figure 2: Arsenic distribution in Texas.

Figure 3: Region - Regional association rule - Scope. Legend: regions are highlighted by bold border line; scopes are incolor blue (or light grey).

Figure 4: The scope of a particular rule changes based on the different values of min_sup or min_conf .

3