pradeep mohan*, shashi shekhar , zhe jiang university of minnesota, twin-cities, mn

21
A spatial neighborhood graph approach to Regional Co-location pattern discovery: summary of results Pradeep Mohan*, Shashi Shekhar, Zhe Jiang University of Minnesota, Twin-Cities, MN James A. Shine, James P. Rogers, Nicole Wayant US Army- ERDC, Topographic Engineering Center, Alexandria, VA *Contact: [email protected] 1

Upload: alagan

Post on 12-Jan-2016

26 views

Category:

Documents


1 download

DESCRIPTION

A spatial neighborhood graph approach to Regional Co-location pattern discovery: summary of results. Pradeep Mohan*, Shashi Shekhar , Zhe Jiang University of Minnesota, Twin-Cities, MN. James A. Shine, James P. Rogers, Nicole Wayant - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

A spatial neighborhood graph approach to Regional Co-location pattern discovery:

summary of results

Pradeep Mohan*, Shashi Shekhar, Zhe Jiang

University of Minnesota, Twin-Cities, MN

James A. Shine, James P. Rogers, Nicole Wayant

US Army- ERDC, Topographic Engineering Center, Alexandria, VA

*Contact: [email protected] 1

Page 2: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

Motivation Problem Formulation Computational Approach Conclusions and Future work

Outline

2

Page 3: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

3

Motivation: Spatial Heterogeneity, the second law of Geography

Spatial Heterogeneity (Goodchild, 2004; Goodchild 2003)

Expectations vary across space. Global models may not explain locally observed phenomena. Need for place based analysis.

Traditional Data Mining : Which pair of items sell together frequently ? Ans : Diaper in Transaction Beer in Transaction.

Spatial Heterogeneity in Retail

Is this association true every where ? Answer : Blue Collar neighborhoods

Global Spatial Data Mining – Global Co-location patterns Which pairs of spatial features are located together frequently ?

Example: Gas stations and Convenience Stores

Our Focus: Where do certain pairs of spatial features co-locate frequently ?

Example: Assaults happen frequently around downtown bars.

Page 4: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

Applications

Crime analysis

Localizing frequent crime patterns, Opportunities for crime vary across space!

Public Health

Localizing elevated disease risks around putative sources (e.g. mining areas)

Ecology

Localizing symbiotic relationships between different species of plants / animals.

Question: Where are Plover birds frequently found in the vicinity of a crocodile ?

Question: Where does high asbestos concentration often lead to lung cancer ?

Question: Do downtown bars often lead to assaults more frequently ?

Predicting localities of the next crime.

Courtsey: www.startribune.com

Courtsey: www.amazon.com

4

Page 5: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

5

Regional co-location patterns (RCP)

Input: Spatial Features, Crime Reports. Output: RCP (e.g. < (Bar, Assaults), Downtown >)

Subsets of spatial features. Frequently located in certain regions of a study area.

Page 6: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

Motivation Problem Formulation

Basic Concepts Problem Statement Challenges Related Work

Computational Approach Conclusions and Future work

Outline

6

Page 7: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

7

Basic Concepts: NeighborhoodsPrevalence locality

Neighborhood Graph

Subsets of spatial framework containing instances of a Pattern. Simple representation to visualize: Convex Hull Other Representations possible.

Given: A Spatial Neighbor Relation (spatial neighborhood size) Nodes: Individual event instances Edges: Presence (If neighbor relation is satisfied) Based on Event Centric Model (Huang , 2004)

Page 8: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

8

Basic Concepts: Quantifying regional interestingness

Regional Participation Ratio

Regional Participation index

Example

;RPR(< {ABC},PL2 >,B) =2

6

;RPR(< {ABC},PL2 >,C) =1

4

RPI(< {ABC},PL2 >) = min2

4,2

6,1

4

⎧ ⎨ ⎩

⎫ ⎬ ⎭=

1

4

Conditional probability of observing a pattern instance within a locality given an instance of a feature within that locality.

Example

Quantifies the local fraction participating in a relationship.

Page 9: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

9

Detailed Statement

Objective: Minimize computational cost.

Constraints:(i) Spatial framework is Heterogeneous.

(ii) Interest measure captures spatial heterogeneity.

(iii) Completeness : All prevalent RCPs are reported.

(iv) Correctness: Only prevalent RCPs are reported.

*Prevalence Threshold = 0.25*Spatial neighborhood Size = 1 Mile

Given: A spatial framework, A collection of boolean spatial event types and their instances. A minimum interestingness threshold, Pθ

A symmetric and transitive neighbor relation R (e.g. based on Spatial neighborhood size)

Find : All RCPs with prevalence >= Pθ

Page 10: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

ChallengesConflicting Requirements

Interest measure captures spatial heterogeneity while supporting scalable algorithms.Exponential search space.

Candidate pattern set cardinality is exponential in the number of event types.

10

Illustration:

Computational Scalability

Sta

tistic

s R

igor

Spatial D

ata Mining (e

.g. RCP)

Page 11: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

ChallengesConflicting Requirements

Interest measure captures spatial heterogeneity while supporting scalable algorithms.Exponential search space.

Candidate pattern set cardinality is exponential in the number of event types.

11

Illustration:

ABC

{NULL}

AB C

AB AC BC

n # Patterns O(2M)

3 4 k1*23 (k1 >0)

4 11 k2*24 (k2 >0)

5 26 k3*25 (k3 >0)

6 57 k4*26 (k4 >0)

Page 12: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

Contributions

Regional Co-location Patterns Neighborhood based Formulation

Interest Measure Captures the local fraction of events participating in patterns. Shows attractive computational properties, Honors spatial heterogeneity.

Computational Approach Computational Structure – Pattern Space Enumeration Performance Enhancement- Maximal locality based Pruning Strategies

Experimental Evaluation Performance Evaluation using real datasets, Lincoln, NE Real world case study.

12

Page 13: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

13

Approaches for Regional Co-location Pattern discovery

Zoning Based(Celik et al., 2007)

Fitness function Clustering(Eick et al., 2008)

Spatial Neighborhood basedOur Work

Related Work

Zoning BasedFitness Function Clustering

Reports one pattern per interesting region based on a criterion (e.g. Max) Computational structure and pruning strategies not explored. Clustering is based on real valued attributes.

Page 14: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

Motivation Problem FormulationComputational Approach

Pattern Space Enumeration Performance Tuning Experimental Evaluation

Conclusions and Future work

Outline

14

Page 15: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

15

Computational Approach{Null}

CBA

<{AB},PL1({AB})>

<{AB},PL2({AB})>

<{AB},PL3({AB})>

<{AC},PL1({AC})>

<{AC},PL2({AC})>

<{AC},PL3({AC})>

<{BC},PL1({BC})>

<{BC},PL2({BC})>

<{BC},PL3({BC})>

<{BC},PL4({BC})>

<{ABC},PL1({ABC})>

<{ABC},PL2({ABC})>

<{ABC},PL3({ABC})>

Compute Neighborhoods

Identify candidate RCP instance

0.16

0.33

0.25

0.25

0.25

0.25

0.16

0.25

0.16

0.16

0.25

0.16

0.25

✔✕

✔✔✔

✕✕

✔✔

✕✔

Pruned RCP

Accepted RCP

Key Idea

Examine each pattern and prune.

Expensive !

Enumerate Entire Pattern Space.

Prevalence Threshold = 0.25

Page 16: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

16

Performance Tuning: Key IdeasKey Idea

Interest Measure shows special pruning properties in certain subsets of the spatial framework.

Maximal Locality

RPI shows anti-monotonicity property within Maximal Localities

Pruning a co-location, {AB}, prunes all its super sets (e.g. {ABC}, {ABCD}…etc.).

Collection of connected instances. Maximal localities are mutually disjoint. Contains several RCPs.

Key Observations

Key Properties

RPI within a Maximal locality is an upper bound to RPI of constituent Prevalence localities.

Page 17: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

17

Performance Tuning{Null}

CBA

ML1 ML2 ML3

{AB},0.167

{BC},0.167

Compute Maximal Locality

{ABC}: Pruned Automatically

Due to anti-monotonicity of RPI

{AC},0.25

<{AC},PL1({AC})>,0.25

No RCP No RCP

Due to upper bound property of RPI

Completeness Pruning a pattern within a maximal locality does not prune any valid RCPs.

Correctness Accepting a pattern involves additional checks so that only prevalent RCPs are reported.

{AB},0.25 {BC},0.33{AC},0.25

<{BC},PL3({BC})>,0.167

<{BC},PL4({BC})>,0.167

✕✕

Prevalence Threshold = 0.25

Page 18: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

19

Experimental Evaluation: Spatial Neighborhood Size

Trends

Run Time # of RCPs

What is the effect of spatial neighborhood size on performance of different algorithms ? Fixed Parameters: Dataset Size : 7498 instances; # Features: 5; Prevalence Threshold: 0.07

Run Time: ML Pruning out performs PS Enumeration by a factor of 1.5 - 5 # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 4.13 - 19

Page 19: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

20

Experimental Evaluation: Feature Types

Trends

Run Time # of RCPs

What is the effect of number of feature types on performance of different algorithms ? Fixed Parameters: Dataset Size : 7498 instances; Spatial neighborhood size: 800 feet; Prevalence Threshold: 0.07

Run Time: ML Pruning out performs PS Enumeration by a factor of 1.2 # of RCPs examined: ML Pruning out performs PS Enumeration by a factor of 1.6 – 3.5

Page 20: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

22

Q: Where do assaults frequently occur around bars ? Are there other factors ?

Real Dataset Case study

RCP of Bar and AssaultsRCP of Larceny and Assaults RCP of Larceny, Bars and Assaults

Observations

Dataset: Lincoln, NE, Crime data (Winter ‘07), Neighborhood Size = 0.25 miles, Prevalence Threshold = 0.07

Bars in Downtown are more likely to be crime prone than bars in other areas (e.g. 21.1%, 20.1 %)

Assaults are more likely to be found in areas reporting larceny (e.g. 47.6 % vs 21.1%) Crimes.

Page 21: Pradeep Mohan*,  Shashi Shekhar ,   Zhe  Jiang University of Minnesota, Twin-Cities, MN

23

Conclusion and Future work

Conclusions Neighborhood based formulation of Regional Spatial Patterns. Regional Participation Index: Measures the local fraction of the global count. Vector representation for Prevalence Localities (other representations possible, convex for simplicity)

Future Work Other representations for prevalence localities. Statistical interpretation LISA statistics / variants of Local Ripley’s K , multiple hypothesis testing. Interpretation using predictive methods (e.g. Geographically Weighted Regression)

Acknowledgement: Reviewers of ACM GIS Members of the Spatial database and spatial data mining group, UMN. U.S. Department of Defense. Mr. Tom Casady and Kim Koffolt.