extracting regional knowledge from spatial datasets
DESCRIPTION
Extracting Regional Knowledge from Spatial Datasets. Christoph F. Eick Department of Computer Science, University of Houston Motivation: Why is Regional Knowledge Important? Region Discovery Framework A Family of Clustering Algorithms for Region Discovery - PowerPoint PPT PresentationTRANSCRIPT
Extracting Regional Knowledge from Spatial DatasetsExtracting Regional Knowledge from Spatial Datasets
Christoph F. Eick
Department of Computer Science, University of Houston
1. Motivation: Why is Regional Knowledge Important?
2. Region Discovery Framework
3. A Family of Clustering Algorithms for Region Discovery
4. Case Studies—Extracting Regional Knowledge:• Regional Regression• Regional Association Rule Mining• [Regional Models of User Behaviour on the Internet]• [Co-location Mining]
5. [Analyzing Related Datasets]
6. Summary 1
Ch. Eick: Extracting Regional Knowledge from Spatial Datasets
Spatial Data MiningSpatial Data Mining• Definition: Spatial data mining is the process of
discovering interesting patterns from large spatial datasets; it organizes by location what is interesting.
• Challenges:– Information is not uniformly distributed– Autocorrelation– Space is continuous– Complex spatial data types– Large dataset sizes and many possible patterns– Patterns exist at different sets level of resolution– Importance of maps as summaries – Importance of regional Knowledge
2
Ch. Eick: Extracting Regional Knowledge from Spatial Datasets
Why Regional Knowledge Important in Spatial Data Mining?Why Regional Knowledge Important in Spatial Data Mining?
• It has been pointed out in the literature that “whole map statistics are seldom useful”, that “most relationships in spatial data sets are geographically regional, rather than global”, and that “there is no average place on the Earth’s surface” [Goodchild03, Openshaw99].
• Simpson’s Paradox – global models may be inconsistent with regional models [Simpson1951].
• Therefore, it is not surprising that domain experts are mostly interested in discovering hidden patterns at a regional scale rather than a global scale.
3
Ch. Eick: Extracting Regional Knowledge from Spatial Datasets
Example: Regional Association RulesExample: Regional Association Rules
Rule 1
Rule 3
Rule 2
Rule 4
Scopes of the 4 Rules in
4
Ch. Eick: Extracting Regional Knowledge from Spatial Datasets
Goal of the Presented ResearchGoal of the Presented Research
Develop and implement an integrated computational framework useful for data analysts and scientists from diverse disciplines for extracting regional knowledge in spatial datasets in a highly automated fashion.
5
Ch. Eick: Extracting Regional Knowledge from Spatial Datasets
Related Work Related Work
Spatial co-location pattern discovery [Shekhar et al.]Spatial association rule mining [Han et al.] Localized associations in segments of the basket data
[Yu et al.]Spatial statistics on hot spot detection [Tay and
Brimicombe et al.]There is some work on geo-regression techniques (to
be discussed later)…
6Comment: Most work centers on extraction global knowledge from spatial datasets
Department of Computer Science
Preview: A Framework for Extracting Regional Knowledge from Spatial Datasets
RD-Algorithm
Application 1: Supervised Clustering [EVJW07]Application 2: Regional Association Rule Mining and Scoping [DEWY06, DEYWN07]Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08]Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08]Application 5: Find “representative” regions (Sampling)Application 6: Regional Regression [CE09]Application 7: Multi-Objective Clustering [JEV09]Application 8: Change Analysis in Related Datasets [RE09]
Wells in Texas:Green: safe well with respect to arsenicRed: unsafe well
=1.01
=1.04
UH-DMML
7
Department of Computer Science Christoph F. Eick
2. Region Discovery Framework8
Department of Computer Science Christoph F. Eick
Region Discovery Framework2 We assume we have spatial or spatio-temporal datasets
that have the following structure: (<spatial attributes>;<non-spatial attributes>) e.g. (longitude, lattitude, class_variable) or (longitude,
lattitude, continous_variable) Clustering occurs in space of the spatial attributes;
regions are found in this space. The non-spatial attributes are used by the fitness function
but neither in distance computations nor by the clustering algorithm itself.
For the remainder of the talk, we view region discovery as a clustering task and assume that regions and clusters are the same.
9
Department of Computer Science Christoph F. Eick
Region Discovery Framework3The algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of RA fitness function q(X) that evaluates clusterings X={c1,…,ck} as follows:
q(X)= cX reward(c)=cX i(c) size(c) with 1
Objective:Find c1,…,ck O such that:1. cicj= if ij2. X={c1,…,ck} maximizes q(X)3. All cluster ciX are contiguous (each pair of objects belonging to c i has to be
delaunay-connected with respect to ci and to d)4. c1,…,ck O 5. c1,…,ck are usually ranked based on the reward each cluster receives, and
low reward clusters are frequently not reported
10
Department of Computer Science Christoph F. Eick
Measure of Interestingness i(c) The function i(c) is an interestingness measure for
a region c, a quantity based on domain interest to reflect how “newsworthy” the region is.
In our past work, we have designed a suite of measures of interestingness for: Supervised Clustering [PKDD06] Hot spots and cool spots [ICDM06] Scope of regional patterns [SSTDM07] Co-location patterns involving variables [PAKDD08,
ACM-GIS08] High-variance regions involving a continuous variable
[PAKDD09] Regional Regression [ACM-GIS09]
11
Department of Computer Science Christoph F. Eick
Example1: Finding Regional Co-location Patterns in Spatial Data
Objective: Find co-location regions using various clustering algorithms and novel fitness functions.
Applications:1. Finding regions on planet Mars where shallow and deep ice are co-located, using point and raster datasets. In figure 1, regions in red have very high co-
location and regions in blue have anti co-location.
2. Finding co-location patterns involving chemical concentrations with values on the wings of their statistical distribution in Texas’ ground water supply.
Figure 2 indicates discovered regions and their associated chemical patterns.
Figure 1: Co-location regions involving deep andshallow ice on Mars
Figure 2: Chemical co-location patterns in Texas Water Supply
12
Department of Computer Science Christoph F. Eick
Example 2: Regional RegressionGeo-regression approaches: Multiple regression functions are
used that vary depending on location.
Regional Regression:
I. To discover regions with strong relationships between dependent & independent variables
II. Construct regional regression functions for each region
III. When predicting the dependent variable of an object, use the regression function associated with the location of the object
13
Department of Computer Science Christoph F. Eick
Challenges for Region Discovery1. Recall and precision with respect to the discovered regions
should be high
2. Definition of measures of interestingness and of corresponding parameterized reward-based fitness functions that capture “what domain experts find interesting in spatial datasets”
3. Detection of regions at different levels of granularities (from very local to almost global patterns)
4. Detection of regions of arbitrary shapes
5. Necessity to cope with very large datasets
6. Regions should be properly ranked by relevance (reward)
7. Design and implementation of clustering algorithms that are suitable to address challenges 1, 3, 4, 5 and 6.
14
Clustering with Plug-in Fitness Functions
In the last 5 years, my research group developed families of clustering algorithms that find contiguous spatial clusters that by maximizing a plug-in fitness function.
This work is motivated by a mismatch between evaluation measures of traditional clustering algorithms (such as cluster compactness) and what domain experts are actually looking for.
15
Department of Computer Science Christoph F. Eick
3. Current Suite of Clustering Algorithms Representative-based: SCEC, SRIDHCR, SPAM, CLEVER Grid-based: SCMRG, SCHG Agglomerative: MOSAIC, SCAH Density-based: SCDE, DCONTOUR
Clustering Algorithms
Density-based
Agglomerative-basedRepresentative-based
Grid-based
16
Department of Computer Science Christoph F. Eick
Representative-based Clustering
Attribute2
Attribute1
1
2
3
4
Objective: Find a set of objects OR such that the clustering X
obtained by using the objects in OR as representatives minimizes q(X).
Characteristic: cluster are formed by assigning objects to the closest representativePopular Algorithms: K-means, K-medoids, CLEVER,…
17
Is a representative-based clustering algorithm, similar to PAM.
Searches variable number of clusters and larger neighborhood sizes to battle premature termination and randomized hill climbing and adaptive sampling to reduce complexity.
In general, new clusters are generated in the neighborhood of the current solution by replacing, inserting, and replacing representatives.
Searches for optimal number of clusters
Rinsurakawong&Eick: Correspondence Clustering , PAKDD’10
CLEVER [ACM-GIS08]
18
Department of Computer Science Christoph F. Eick
Advantages of Grid-based Clustering Algorithms
fast: No distance computations Clustering is performed on summaries and not
individual objects; complexity is usually O(#populated-grid-cells) and not O(#objects)
Easy to determine which clusters are neighboring
Shapes are limited to union of grid-cells
19
Department of Computer Science
Ideas SCMRG (Divisive, Multi-Resolution Grids)
Cell Processing Strategy
1. If a cell receives a reward that is larger than the sum of its rewards
its ancestors: return that cell.
2. If a cell and its ancestor do not receive any reward: prune
3. Otherwise, process the children of the cell (drill down)
20
Department of Computer Science
Code SCMRG21
Department of Computer Science Christoph F. Eick
4. Case Studies Regional Knowledge Extraction
4.1 Regional Regression 4.2 Regional Association Rule Mining & Scoping4.3 Association-List Based Discrepancy Mining of User Behavior 4.4 Co-location Mining to be skipped!
22
Department of Computer Science Christoph F. Eick
4.1 REG^2: A Framework of Regional Regression Motivation: Regression functions spatially vary, as they are not constant over space
Goal: To discover regions with strong relationships between dependent & independent variables and extract their regional regression functions.
AIC AIC FitnessFitness
VAL VAL FitnessFitness
RegVAL RegVAL FitnessFitness
WAIC WAIC FitnessFitness
Arsenic 5.01% 11.19% 3.58% 13.18%
Boston 29.80% 35.69% 38.98% 36.60%
Clustering algorithms with plug-in fitness functions are
employed to find such region; the employed fitness
functions reward regions with a low generalization error. Various schemes are explored to estimate the
generalization error: example weighting, regularization,
penalizing model complexity and using validation sets,…
Discovered Regions and Regression FunctionsREG^2 Outperforms Other Models in SSE_TR
Regularization Improves Prediction Accuracy
23
Skip!
Motivation
1st law of geography: “Everything is related to everything else
but nearby things are more related than distant things” (Tobler)
Coefficient estimates in geo-referenced datasets spatially
vary we need regression methods to discover regional
coefficient estimates that captures underlying structure of data.
Using human-made boundaries (zip code etc.) is not good
idea since spatial variation is rarely rectangular.
Regional Knowledge & Regression
24
Motivation
Regression Trees Data is split in a top-down approach using a greedy
algorithm Discovers only rectangle shapes
Geographically Weighted Regression(GWR) an instance-based, local spatial statistical technique used
to analyze spatial non-stationarity. generates a separate regression equation for a set of
observation points-determined using a grid or kernel weight assigned to each observation is based on a
distance decay function centered on observation.
Other Geo-Regression Analysis Methods
25
Motivation
Regression Result: A positive linear regression line
(Arsenic increases with increasing Fluoride concentration)
Example 1: Why We Need Regional Knowledge?
Fluoride
Ars
enic
26
Motivation
A negative linear Regression line in both locations (Arsenic decreases with increasing Fluoride concentration) A reflection of Simpson’s paradox.
Example 1: Why We Need Regional Knowledge?
Fluoride
Ars
enic
Location 1Location 2
27
Motivation
Example 2: Houston House Price Estimate
Dependent variable: House_Price Independent variables: noOfRooms, squareFootage, yearBuilt, havePool, attachedGarage, etc..
28
Global Regression (OLS) produces the coefficient
estimates, R2 value, and error etc.. a single global model
This model assumes all areas have same coefficients
E.g. attribute havePool has a coefficient of +9,000
(~having a pool adds $9,000 to a house price)
In reality this changes. A house of $100K and a house of
$500K or different zip codes or locations.
Having a pool in a house in luxury areas is very different
(~$40K) than having a pool in a house in Suburbs(~$5K).
Example 2: Houston House Price Estimate
Motivation
29
Motivation
Example 2: Houston House Price Estimate
$180,000
$350,000
Houses A, B have very similar characteristics OLS produces single parameter estimates for predictor variables like noOfRooms, squareFootage, yearBuilt, etc
31
Motivation
Example 2: Houston House Price Estimate If we use zip code as regions, they are in same region
If we use a grid structure
They are in different regions but
some houses similar to B (lake view)
are in same region with A and this will
effect coefficient estimate
More importantly, the house
around U-shape lake show similar
pattern and should be in the same
region, we miss important information.
32
We need to discover arbitrary shaped regions, and not
rely on some a priori defined artificial boundaries
Our Approach: Capture the True Pattern Structure!
Problems to be solved: 1. Find regions whose objects have a
strong relationship between the
dependent variable and
independent variables
2. Extracting Regional Regression
Functions
3. Develop a method to select which
regression function to use for a
new object to be predicted.
Motivation
33
The REGional REGression Framework (REG^2)
Phase I: Discovering regions using a clustering alg.
Maximizing a regression based (R-sq or AIC ) fitness functions
( along with regional coefficient estimates) Phase II: Applying techniques to select correct regional
regression function and improve prediction for unseen data
Methodology
Employs a two-phased approach:
34
Skip!
So, what Can we use as Interestingness?
The natural first candidate is Adjusted R2. R-sq is a
measure of the extent to which the total variation of the
dependent variable is explained by the model.
R-sq alone is not a good measure to assess the goodness
of fit; only deals with the bias of the model & ignores the
complexity of model which leads to overfitting
There are better model selection criteria to balance the
tradeoff between bias and the variance.
Methodology
35
Fitness Function Candidates R2-based fitness functions Fitness functions that additionally consider model
complexity, in addition to goodness of fit, such as AIC or BIC
Regularization approaches that penalize large coefficients.
Fitness functions that employ validation sets that provide a better measure for the generalization error—the model’s performance on unseen examples
An improvement of the previous approach that additionally considers training set/test set similarity
Combination of approaches mentioned above
Methodology
36
R-sq Based Fitness Function
Given; and
The interestingness is:
To battle the tendency towards having small size regions with high correlation (false correlation):
used scaled version of the fitness function employed a parameter to limit the min. size of the region
The Rsq-based fitness function then becomes;
1
( ) = ( )* ( )k
Rsq Rsq j jj
q R i r size r
Methodology
1( ) =
0Rsq
SSEif n minRegSize
SSTi r
if n minRegSize
2
1
( )n
i ii
SSE y y
2
1
( )n
ii
SST y y
37
AIC Based Fitness Function (AICFitness)
We prefer Akaike’s Information Criterion (AIC) because;
it takes model complexity (number of observations etc..) into
consideration more effectively
AIC provides a balance between bias and variance, and is
estimated using the following formula:
Variations of AIC including AICu [McQuarrie] which is used for
small size data is available good fit for our small size regions
Methodology
2 [ln(2 . / ) 1]AIC k n SSE n
ln2u
SSE n kAIC
n k n k
38
AIC Based Fitness Function (AICFitness) AIC-based Interestingness – iAIC (r)
AICFitness function then becomes
AICFitness function repeatedly applies regression analysis during the search for the optimal set of regions which overall provides best AIC values (minimum)
Methodology
1
2 [ln(2 . / ) 1]( ) = 1
ln2
AIC
if n thk n SSE n
i rif n th
SSE n kn k n k
1
( ) = ( )* ( )k
AIC AIC j jj
q R i r size r
39
Controlling Regional Granularity β is used to control the number of regions to be discovered, thus overall model complexity. Finding a good value for β means striking the right balance between underfitting and overfitting for a given
dataset. Small values for small number of regions; large values for large number of regions
Methodology
Reminder—Region Discovery Framework Fitness Function:
q(X)= cX reward(c)=cX i(c) size(c)
40
Generalization Error Improvement (SSE_TE)
Experiments & Results
Discovered regions and their regional regression coefficients
perform better prediction compared to the global model
Some regions with very high error reduce the overall accuracy
but still 27% improvement. (future work item)
Relationship between variables spatially varies
βSSE_TE
(GL)SSE_TE(REG2)
SSEImprovement
% of objectsbetter prediction
1.1 17,182 12,566 27% 72%
1.7 17,182 14,799 26% 65%
Generalization Error Results - Boston Housing Data
41
Experiments & Results
Regional regression coefficients perform just slightly better
prediction
Some due to external factors, e.g. toxic waste, power plant
(analyzed previously using PCAFitness approach, MLDM09)
Some regions with very high error reduce the overall accuracy
Still around 60% of objects are better predicted
Open for improvement; new fitness functions (next)
βSSE_TE
(GL)SSE_TE(REG2)
SSEImprovement
% of objectsbetter
prediction
1.1 102, 578 98,879 3.6% 57%
1.25 102, 578 92,200 8.01% 61%
Generalization Error Results – Arsenic Data
42
Department of Computer Science
4.2 A Framework for Regional Association Rule Mining and Scoping [GeoInformatica10]
Step 1: Region DiscoveryStep 1: Region Discovery
Step 2: Regional Association Rule Mining
Step 2: Regional Association Rule Mining
Step 3: Regional Association Rule Scoping
Step 3: Regional Association Rule Scoping
Arsenic hot spots
An association rule ais discovered.
Scope ofthe rule a
43
Department of Computer Science Christoph F. Eick
Arsenic Hot Spots and Cool Spots
Step 1: Region DiscoveryStep 1: Region Discovery
Step 2: Regional Association Rule Mining
Step 2: Regional Association Rule Mining
Step 3: Regional Association Rule Scoping
Step 3: Regional Association Rule Scoping
44
Department of Computer Science Christoph F. Eick
Example Regional Association Rules
Step 1: Region DiscoveryStep 1: Region Discovery
Step 2: Regional Association Rule Mining
Step 2: Regional Association Rule Mining
rule 1
rule 3
rule 2
rule 4
Step 3: Regional Association Rule Scoping
Step 3: Regional Association Rule Scoping
45
Department of Computer Science Christoph F. Eick
Region vs. Scope Scope of an association rule
indicates how regional or global a local pattern is.
The region, where an association rule is originated, is a subset of the scope where the association rule holds.
46
Department of Computer Science Christoph F. Eick
Association Rule Scope Discovery FrameworkLet a be an association rule, r be a region, conf(a,r) denotes the confidence of a in region r, and sup(a,r) denotes the support of a in r.
Goal: Find all regions for which an associate rule a satisfies its minimum support and confidence threshold; regions in which a’s confidence and support are significantly higher than the min-support and min-conf thresholds receive higher rewards.
Association Rule Scope Discovery Methodology:For each rule a that was discovered for region r’, we run our region discovery algorithm that defines the interestingness of a region r i with respect to an association rule a as follows:
Remarks: Typically 1=2=0.9; =2 (confidence increase is more important than support
increase) Obviously the region r’ from which rule a originated or some variation of it should be
“rediscovered” when determining the scope of a.
47
Department of Computer Science Christoph F. Eick
Regional Association Rule Scoping
Ogallala Aquifer
Gulf Coast Aquifer
48
Department of Computer Science Christoph F. Eick
Fine Tuning Confidence and Support
We can fine tune the measure of interestingness for association rule scoping by changing the minimum confidence and support thresholds.
49
Department of Computer Science Christoph F. Eick
Regional Models for User Behaviour on the Internet
Problem: We are interested in predicting a performance variable based on some performance context that is described using a set of binary variables
Example: We try to predict is a user clicks on an ad based on the keywords that occur in the ad.
Our ancle: As usual, we are interested in extracting knowledge concerning the „regional variation of clicking behavior“
50
Department of Computer Science Christoph F. Eick
Association List Based Discrepancy Mining (ALDM)
Given a set of key-words with an associated performance measure for a group G of transactions, we create association lists; for example:
G:=((A 0.002 17 2)(B 0.001 222 1))that models user behavior for group G on the internet, such as clicking of ads in Texas
Meaning—for the group G analyzed:If A was present the performance variable has an average value of 0.002, A is present in 17 objects, the average value of the performance measure if A is present is twice as high as its average value for all transactions. If B was present the performance variable has an average value of 0.01, B is present in 222 objects, the average value of the performance measure if B is present is the same as the average value for all transactions.
51
Department of Computer Science Christoph F. Eick
Research Goals ALDM1. Develop algorithms that generate groups and association lists
that characterize those groups2. Propose similarity measures for association lists to compare
different groups3. Compare different regional groups with respect to
discrepancies of user behavior to: Extract regional knowledge from the groups Extract discrepancy knowledge that describes
how the behavior of different users differs in different regions
How regional behavior differs from global behavior4. Develop regional prediction techniques
By using knowledge that has been obtained in step2 By generalizing our regional prediction work, presented
in part 4.1
52
Department of Computer Science Christoph F. Eick
Subtopics:
• Disparity Analysis/Emergent Pattern Discovery (“how do two groups differ with respect to their patterns?”) [SDE10]
• Change Analysis ( “what is new/different?”) [CVET09]
• Correspondence Clustering (“mining interesting relationships between two or more datasets”) [RE10]
• Meta Clustering (“cluster cluster models of multiple datasets”)
• Analyzing Relationships between Polygonal Cluster Models
Example: Analyze Changes with Respect to Regions of High Variance of Earthquake Depth.
Novelty (r’) = (r’—(r1 … rk))
Emerging regions based on the novelty change predicate
Time 1 Time 2
5. Methodologies and Tools toAnalyze Related Datasets
53
Department of Computer Science Christoph F. Eick
6. Summary1. A framework for region discovery that relies on additive,
reward-based fitness functions and views region discovery as a clustering problem has been introduced.
2. Families of clustering algorithms and families of measures of interestingness are provided that form the core of the framework.
3. Evidence concerning the usefulness of the framework for regional association rule mining, regional regression, and co-location mining has been presented.
4. The special challenges in designing clustering algorithms for region discovery have been identified.
5. The ultimate vision of this research is the development of region discovery engines that assist data analysts and scientists in finding interesting regions in spatial datasets.
54
Department of Computer Science Christoph F. Eick
Other Contributors to the Work Presented TodayGraduated PhD Students: Wei Ding (Regional Association Rule Mining, Grid-based Clustering) Rachsuda Jiamthapthaksin (Agglomerative Clustering, Multi-Run Clustering) Oner Ulvi Celepcikay (Regional Regression)Current PhD Students Chun-sheng Chen (Density based Clustering, Regional Knowledge Extraction from
Ads) Vadeerat Risurongkawong (Analyzing Multiple Datasets, Change Analysis)Graduated Master Students Rachana Parmar (CLEVER, Co-location Mining) Seungchan Lee (Grid-based Clustering, Agglomerative Clustering) Dan Jiang (Density-based Clustering, Co-location Mining) Jing Wang (Grid-based and Representative-based Clustering)Software Platform and Software Design Abraham Bagherjeiran (PhD student UH, now at Yahoo!)Domain Experts Tomasz Stepinski (Lunar and Planetary Institute, Houston, Texas) J.-P. Nicot (Bureau of Economic Geology, UT, Austin) Michael Two (College of Optometry, University of Houston)
55
Department of Computer Science
Inputs: Dataset O, k’, neighborhood-size, p, p’, Outputs: Clustering X, fitness q Algorithm: 1. Create a current solution by randomly selecting k’ representatives from O. 2. Create p neighbors of the current solution randomly using the given neighborhood definition. 3. If the best neighbor improves the fitness q, it becomes the current solution. Go back to step 2. 4. If the fitness does not improve, the solution neighborhood is re-sampled by generating p’ more neighbors. If re-sampling does not lead to a better solution, terminate returning the current solution; otherwise, go back to step 2 replacing the current solution by the best solution found by re- sampling.
CLEVER Pseudo Code
56