pradeep mohan*, shashi shekhar , zhe jiang university of minnesota, twin-cities, mn
DESCRIPTION
A spatial neighborhood graph approach to Regional Co-location pattern discovery: summary of results. Pradeep Mohan*, Shashi Shekhar , Zhe Jiang University of Minnesota, Twin-Cities, MN. James A. Shine, James P. Rogers, Nicole Wayant - PowerPoint PPT PresentationTRANSCRIPT
A spatial neighborhood graph approach to Regional Co-location pattern discovery:
summary of results
Pradeep Mohan*, Shashi Shekhar, Zhe Jiang
University of Minnesota, Twin-Cities, MN
James A. Shine, James P. Rogers, Nicole Wayant
US Army- ERDC, Topographic Engineering Center, Alexandria, VA
*Contact: [email protected] 1
Motivation Problem Formulation Computational Approach Conclusions and Future work
Outline
2
3
Motivation: Spatial Heterogeneity, the second law of Geography
Spatial Heterogeneity (Goodchild, 2004; Goodchild 2003)
Expectations vary across space. Global models may not explain locally observed phenomena. Need for place based analysis.
Traditional Data Mining : Which pair of items sell together frequently ? Ans : Diaper in Transaction Beer in Transaction.
Spatial Heterogeneity in Retail
Is this association true every where ? Answer : Blue Collar neighborhoods
Global Spatial Data Mining – Global Co-location patterns Which pairs of spatial features are located together frequently ?
Example: Gas stations and Convenience Stores
Our Focus: Where do certain pairs of spatial features co-locate frequently ?
Example: Assaults happen frequently around downtown bars.
Applications
Crime analysis
Localizing frequent crime patterns, Opportunities for crime vary across space!
Public Health
Localizing elevated disease risks around putative sources (e.g. mining areas)
Ecology
Localizing symbiotic relationships between different species of plants / animals.
Question: Where are Plover birds frequently found in the vicinity of a crocodile ?
Question: Where does high asbestos concentration often lead to lung cancer ?
Question: Do downtown bars often lead to assaults more frequently ?
Predicting localities of the next crime.
Courtsey: www.startribune.com
Courtsey: www.amazon.com
4
5
Regional co-location patterns (RCP)
Input: Spatial Features, Crime Reports. Output: RCP (e.g. < (Bar, Assaults), Downtown >)
Subsets of spatial features. Frequently located in certain regions of a study area.
Motivation Problem Formulation
Basic Concepts Problem Statement Challenges Related Work
Computational Approach Conclusions and Future work
Outline
6
7
Basic Concepts: NeighborhoodsPrevalence locality
Neighborhood Graph
Subsets of spatial framework containing instances of a Pattern. Simple representation to visualize: Convex Hull Other Representations possible.
Given: A Spatial Neighbor Relation (spatial neighborhood size) Nodes: Individual event instances Edges: Presence (If neighbor relation is satisfied) Based on Event Centric Model (Huang , 2004)
8
Basic Concepts: Quantifying regional interestingness
Regional Participation Ratio
Regional Participation index
Example
€
;RPR(< {ABC},PL2 >,B) =2
6
€
;RPR(< {ABC},PL2 >,C) =1
4
€
RPI(< {ABC},PL2 >) = min2
4,2
6,1
4
⎧ ⎨ ⎩
⎫ ⎬ ⎭=
1
4
Conditional probability of observing a pattern instance within a locality given an instance of a feature within that locality.
Example
Quantifies the local fraction participating in a relationship.
9
Detailed Statement
Objective: Minimize computational cost.
Constraints:(i) Spatial framework is Heterogeneous.
(ii) Interest measure captures spatial heterogeneity.
(iii) Completeness : All prevalent RCPs are reported.
(iv) Correctness: Only prevalent RCPs are reported.
*Prevalence Threshold = 0.25*Spatial neighborhood Size = 1 Mile
Given: A spatial framework, A collection of boolean spatial event types and their instances. A minimum interestingness threshold, Pθ
A symmetric and transitive neighbor relation R (e.g. based on Spatial neighborhood size)
Find : All RCPs with prevalence >= Pθ
ChallengesConflicting Requirements
Interest measure captures spatial heterogeneity while supporting scalable algorithms.Exponential search space.
Candidate pattern set cardinality is exponential in the number of event types.
10
Illustration:
Computational Scalability
Sta
tistic
s R
igor
Spatial D
ata Mining (e
.g. RCP)
ChallengesConflicting Requirements
Interest measure captures spatial heterogeneity while supporting scalable algorithms.Exponential search space.
Candidate pattern set cardinality is exponential in the number of event types.
11
Illustration:
ABC
{NULL}
AB C
AB AC BC
n # Patterns O(2M)
3 4 k1*23 (k1 >0)
4 11 k2*24 (k2 >0)
5 26 k3*25 (k3 >0)
6 57 k4*26 (k4 >0)
Contributions
Regional Co-location Patterns Neighborhood based Formulation
Interest Measure Captures the local fraction of events participating in patterns. Shows attractive computational properties, Honors spatial heterogeneity.
Computational Approach Computational Structure – Pattern Space Enumeration Performance Enhancement- Maximal locality based Pruning Strategies
Experimental Evaluation Performance Evaluation using real datasets, Lincoln, NE Real world case study.
12
13
Approaches for Regional Co-location Pattern discovery
Zoning Based(Celik et al., 2007)
Fitness function Clustering(Eick et al., 2008)
Spatial Neighborhood basedOur Work
Related Work
Zoning BasedFitness Function Clustering
Reports one pattern per interesting region based on a criterion (e.g. Max) Computational structure and pruning strategies not explored. Clustering is based on real valued attributes.
Motivation Problem FormulationComputational Approach
Pattern Space Enumeration Performance Tuning Experimental Evaluation
Conclusions and Future work
Outline
14
15
Computational Approach{Null}
CBA
<{AB},PL1({AB})>
<{AB},PL2({AB})>
<{AB},PL3({AB})>
<{AC},PL1({AC})>
<{AC},PL2({AC})>
<{AC},PL3({AC})>
<{BC},PL1({BC})>
<{BC},PL2({BC})>
<{BC},PL3({BC})>
<{BC},PL4({BC})>
<{ABC},PL1({ABC})>
<{ABC},PL2({ABC})>
<{ABC},PL3({ABC})>
Compute Neighborhoods
Identify candidate RCP instance
0.16
0.33
0.25
0.25
0.25
0.25
0.16
0.25
0.16
0.16
0.25
0.16
0.25
✔✕
✔
✔✔✔
✕
✕✕
✕
✔
✔✔
✕✔
Pruned RCP
Accepted RCP
Key Idea
Examine each pattern and prune.
Expensive !
Enumerate Entire Pattern Space.
Prevalence Threshold = 0.25
16
Performance Tuning: Key IdeasKey Idea
Interest Measure shows special pruning properties in certain subsets of the spatial framework.
Maximal Locality
RPI shows anti-monotonicity property within Maximal Localities
Pruning a co-location, {AB}, prunes all its super sets (e.g. {ABC}, {ABCD}…etc.).
Collection of connected instances. Maximal localities are mutually disjoint. Contains several RCPs.
Key Observations
Key Properties
RPI within a Maximal locality is an upper bound to RPI of constituent Prevalence localities.
17
Performance Tuning{Null}
CBA
ML1 ML2 ML3
{AB},0.167
✕
{BC},0.167
✕
Compute Maximal Locality
{ABC}: Pruned Automatically
Due to anti-monotonicity of RPI
{AC},0.25
<{AC},PL1({AC})>,0.25
No RCP No RCP
Due to upper bound property of RPI
Completeness Pruning a pattern within a maximal locality does not prune any valid RCPs.
Correctness Accepting a pattern involves additional checks so that only prevalent RCPs are reported.
{AB},0.25 {BC},0.33{AC},0.25
<{BC},PL3({BC})>,0.167
<{BC},PL4({BC})>,0.167
✕✕
Prevalence Threshold = 0.25
19
Experimental Evaluation: Spatial Neighborhood Size
Trends
Run Time # of RCPs
What is the effect of spatial neighborhood size on performance of different algorithms ? Fixed Parameters: Dataset Size : 7498 instances; # Features: 5; Prevalence Threshold: 0.07
Run Time: ML Pruning out performs PS Enumeration by a factor of 1.5 - 5 # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 4.13 - 19
20
Experimental Evaluation: Feature Types
Trends
Run Time # of RCPs
What is the effect of number of feature types on performance of different algorithms ? Fixed Parameters: Dataset Size : 7498 instances; Spatial neighborhood size: 800 feet; Prevalence Threshold: 0.07
Run Time: ML Pruning out performs PS Enumeration by a factor of 1.2 # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 1.6 – 3.5
22
Q: Where do assaults frequently occur around bars ? Are there other factors ?
Real Dataset Case study
RCP of Bar and AssaultsRCP of Larceny and Assaults RCP of Larceny, Bars and Assaults
Observations
Dataset: Lincoln, NE, Crime data (Winter ‘07), Neighborhood Size = 0.25 miles, Prevalence Threshold = 0.07
Bars in Downtown are more likely to be crime prone than bars in other areas (e.g. 21.1%, 20.1 %)
Assaults are more likely to be found in areas reporting larceny (e.g. 47.6 % vs 21.1%) Crimes.
23
Conclusion and Future work
Conclusions Neighborhood based formulation of Regional Spatial Patterns. Regional Participation Index: Measures the local fraction of the global count. Vector representation for Prevalence Localities (other representations possible, convex for simplicity)
Future Work Other representations for prevalence localities. Statistical interpretation LISA statistics / variants of Local Ripley’s K , multiple hypothesis testing. Interpretation using predictive methods (e.g. Geographically Weighted Regression)
Acknowledgement: Reviewers of ACM GIS Members of the Spatial database and spatial data mining group, UMN. U.S. Department of Defense. Mr. Tom Casady and Kim Koffolt.