spatial data mining: three case studies for additional details shekhar/problems.html shashi shekhar,...

18
Spatial Data Mining: Three Case Studies For additional details www.cs.umn.edu/~shekhar/problems.h Shashi Shekhar, University of Minnesota Presented to UCGIS Summer Assembly 2001

Post on 21-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Spatial Data Mining:Three Case Studies

For additional detailswww.cs.umn.edu/~shekhar/problems.html

Shashi Shekhar, University of MinnesotaPresented to UCGIS Summer Assembly 2001

Background

• NSF workshop on GIS and DM (3/99)

• Spatial data[1, 8] - traffic, bird habitats, global climate, logistics, ...

• For spatial patterns - outliers, location prediction, associations, sequential associations, trends, …

Framework

• Problem statement: capture special needs

• Data exploration: maps, new methods

• Try reusing classical methods – from data mining, spatial statistics

• If reuse is not possible, invent new methods

• Validation, Performance tuning

Case 1: Spatial Outliers

• Problem: stations different from neighbors [SIGKDD 2001]

• Data - space-time plot, distr. Of f(x), S(x)

• Distribution of base attribute:

– spatially smooth

– frequency distribution over value domain: normal

• Classical test - Pr.[item in population] is low

– Q? distribution of diff.[f(x), neighborhood agg{f(x)}]

– Insight: this statistic is distributed normally!

– Test: (z-score on the statistics) > 2

– Performance - spatial join, clustering methods

Spatial outlier detection[4]

Spatial outlier A data point that is extreme relative to it neighbors

Given A spatial graph G={V,E} A neighbor relationship (K neighbors) An attribute function f: V -> R

An aggregation function f aggr : R k -> R Confidence level threshold Find O = {vi | vi V, vi is a spatial outlier}

Objective Correctness: The attribute values of vi

is extreme, compared with its neighbors Computational efficiency

Constraints Attribute value is normally distributed Computation cost dominated by I/O op.

Spatial outlier detection

Spatial Outlier Detection Test1. Choice of Spatial Statistic

S(x) = [f(x)–E y N(x)(f(y))]

Theorem: S(x) is normally distributed

if f(x) is normally distributed

2. Test for Outlier Detection

| (S(x) - s) / s | >

HypothesisI/O cost determined by clustering efficiency f(x) S(x)

Spatial outlier and its neighbors

Spatial outlier detection

Results 1. CCAM achieves higher

clustering efficiency (CE) 2. CCAM has lower I/O cost 3. Higher CE leads to lower I/O cost 4. Page size improves CE for all methods

Z-orderCCAM

I/O costCE value

Cell-Tree

Case 2: Location Prediction

• Citations: SIAM DM Conf. 2001, SIGKDD DMKD 2000

• Problem: predict nesting site in marshes– given vegetation, water depth, distance to edge, etc.

• Data - maps of nests and attributes

– spatially clustered nests, spatially smooth attributes

• Classical method: logistic regression, decision trees, bayesian classifier

– but, independence assumption is violated ! Misses auto-correlation !

– Spatial auto-regression (SAR), Markov random field bayesian classifier

– Open issues: spatial accuracy vs. classification accurary

– Open issue: performance - SAR learning is slow!

Location Prediction[6, 7, 8]

Given:

1. Spatial Framework

2. Explanatory functions:

3. A dependent function

4. A family of function mappings:

Find: A function

Objective:maximize

classification_accuracy

Constraints:

Spatial Autocorrelation exists

},...{ 1 nssS

RSfkX :

}1,0{: SfY

}1,0{... RR

yf̂

),ˆ( yy ff

Nest locations Distance to open water

Vegetation durability Water depth

Evaluation: Changing Model • Linear Regression

• Spatial Regression

• Spatial model is better

Xy

XWyy

Evaluation: Changing measure

))(.,(),( PnearestAAdistPAADNP kk

k

New measure:

Case 3: Spatial Association Rules

• Citation: Symp. On Spatial Databases 2001

• Problem: Given a set of boolean spatial features– find subsets of co-located features, e.g. (fire, drought, vegetation)

– Data - continuous space, partition not natural, no reference feature

• Classical data mining approach: association rules– But, Look Ma! No Transactions!!! No support measure!

• Approach: Work with continuous data without transactionizing it!– confidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)]

– support: cardinality of spatial join of instances of fire, drought, dry veg.

– participation: min. fraction of instances of a features in join result

– new algorithm using spatial joins and apriori_gen filters

Co-location Patterns[2, 3]

Answers: and

Can you find co-location patterns from the following sample dataset?

Co-location Patterns

Can you find co-location patterns from the following sample dataset?

Co-location Patterns

Spatial Co-location A set of features frequently co-located

Given A set T of K boolean spatial feature types

T={f1,f2, … , fk}

A set P of N locations P={p1, …, pN } in a spatial frame work S, pi P is of some spatial feature in T

A neighbor relation R over locations in S

Find Tc = subsets of T frequently co-located

Objective Correctness Completeness Efficiency

Constraints R is symmetric and reflexive Monotonic prevalence measure

Reference Feature Centric

Window Centric Event Centric

Co-location Patterns

Participation indexParticipation ratio pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}: fraction of instances of fi with

feature {f1, …, fi-1, fi+1, …, fk} nearby 2.Participation index = min{pr(fi, c)}

AlgorithmHybrid Co-location Miner

Association rules Co-location rules

underlying space discrete sets continuous space

item-types item-types events /Boolean spatial features

collections transactions neighborhoods

prevalence measure support participation index

conditional probability measure Pr.[ A in T | B in T ] Pr.[ A in N(L) | B at L ]

Comparison with association rules

Conclusions & Future Directions

• Spatial domains may not satisfy assumptions of classical methods– data: auto-correlation, continuous geographic space– patterns: global vs. local, e.g. spatial outliers vs. outliers– data exploration: maps and albums

• Open Issues– patterns: hot-spots, blobology (shape), spatial trends, …– metrics: spatial accuracy(predicted locations), spatial contiguity(clusters)– spatio-temporal dataset– scale and resolutions sentivity of patterns– geo-statistical confidence measure for mined patterns

References

1. S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu and C.T. Liu, “Spatial Databases: Accomplishments and Research Needs”, IEEE Transactions on Knowledge and Data Engineering, Jan.-Feb. 1999.

2. S. Shekhar and Y. Huang, “Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th International Symposium on Spatial and Temporal Databases (SSTD01), July 2001.

3. S. Shekhar, Y. Huang, and H. Xiong, “Performance Evaluation of Co-location Miner”, the IEEE International Conference on Data Mining (ICDM’01), Nov. 2001. (submitted)

4. S. Shekhar, C.T. Lu, P. Zhang, "Detecting Graph-based Spatial Outliers: Algorithms and Applications“, the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.

5. S. Shekhar, S. Chawla, the book “Spatial Database: Concepts, Implementation and Trends”. (To be published in 2001)

6. S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Extending Data Mining for Spatial Applications: A Case Study in Predicting Nest Locations”, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000.

7. S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Modeling Spatial Dependencies for Mining Geospatial Data”, First SIAM International Conference on Data Mining, 2001.

8. S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla, “Spatial Contextual Classification and Prediction Models for Mining Geospatial Data”, IEEE Transactions on Multimedia, 2001. (Submitted)

Some papers are available on the Web sites: http://www.cs.umn.edu/research/shashi-group/