trends in spatial data mining - shekhar/research/sdm_dm_next_gen04.pdf · trends in spatial data...

Chapter 19

Trends in Spatial Data Mining

Shashi Shekhar, Pusheng Zhang, Yan Huang,and Ranga Raju Vatsavai

The explosive growth of spatial data and widespread use of spatial databasesemphasize the need for the automated discovery of spatial knowledge. Spa-tial data mining (Roddick and Spiliopoulou 1999, Shekhar and Chawla 2002)is the process of discovering interesting and previously unknown, but poten-tially useful patterns from spatial databases. The complexity of spatial dataand intrinsic spatial relationships limits the usefulness of conventional datamining techniques for extracting spatial patterns. Efficient tools for extractinginformation from geospatial data are crucial to organizations that make deci-sions based on large spatial datasets, including NASA, the National Imageryand Mapping Agency (NIMA), the National Cancer Institute (NCI), and theUnited States Department of Transportation (USDOT). These organizationsare spread across many application domains including ecology and environ-mental management, public safety, transportation, earth science, epidemiol-ogy, and climatology.

General purpose data mining tools, such as Clementine, See5/C5.0, andEnterprise Miner, are designed to analyze large commercial databases. Al-though these tools were primarily designed to identify customer-buying pat-terns in market basket data, they have also been used in analyzing scientificand engineering data, astronomical data, multimedia data, genomic data, and

354 CHAPTER NINETEEN

web data. Extracting interesting and useful patterns from spatial data sets ismore difficult than extracting corresponding patterns from traditional numericand categorical data due to the complexity of spatial data types, spatial rela-tionships, and spatial autocorrelation.

Specific features of geographical data that preclude the use of general pur-pose data mining algorithms are: (1) rich data types(e.g., extended spatial ob-jects) (2) implicit spatial relationships among the variables, (3) observationsthat are not independent, and (4) spatial autocorrelation among the features.In this chapter we focus on the unique features that distinguish spatial datamining from classical data mining in the following four categories: data input,statistical foundation, output patterns, and computational process. We presentmajor accomplishments of spatial data mining research, especially regardingoutput patterns known as predictive models, spatial outliers, spatial coloca-tion rules, and clusters. Finally, we identify areas of spatial data mining wherefurther research is needed.

19.1 Data Input

The data inputs of spatial data mining are more complex than the inputs ofclassical data mining because they include extended objects such as points,lines, and polygons. The data inputs of spatial data mining have two dis-tinct types of attributes: nonspatial attribute and spatial attribute. Nonspatialattributes are used to characterize nonspatial features of objects, such as name,population, and unemployment rate for a city. They are the same as the at-tributes used in the data inputs of classical data mining. Spatial attributes areused to define the spatial location and extent of spatial objects (Bolstad 2002).The spatial attributes of a spatial object most often include information relatedto spatial locations, e.g., longitude, latitude and elevation, as well as shape.

Relationships among nonspatial objects are explicit in data inputs, e.g.,arithmetic relation, ordering, isinstanceof, subclassof, and membershipof.In contrast, relationships among spatial objects are oftenimplicit, such as over-lap, intersect, and behind. One possible way to deal with implicit spatial rela-tionships is to materialize the relationships into traditional data input columnsand then apply classical data mining techniques (Quinlan 1993, Barnett andLewis 1994, Agrawal and Srikant 1994, Jain and Dubes 1988). However, thematerialization can result in loss of information. Another way to capture im-plicit spatial relationships is to develop models or techniques to incorporatespatial information into the spatial data mining process. We discuss a few casestudies of such techniques in section 19.3.

SHEKHAR, ZHANG, HUANG, AND VATSAVAI 355

Nonspatial Relationship Spatial Relationship(Explicit) (Often Implicit)Arithmetic Set-oriented: union, intersection, membership,· · ·Ordering Topological: meet, within, overlap,· · ·Is instanceof Directional: North, NE, left, above, behind,· · ·Subclassof Metric: e.g., distance, area, perimeter,· · ·Partof Dynamic: update, create, destroy,· · ·Membershipof Shape-based and visibility

Table 19.1: Relationships among nonspatial eata and spatial data.

19.2 Statistical Foundation

Statistical models (Cressie 1993) are often used to represent observations interms of random variables. These models can then be used for estimation,description, and prediction based on probability theory. Spatial data can bethought of as resulting from observations on the stochastic processZ(s): s ∈D, wheres is a spatial location andD is possibly a random set in a spatialframework. Here we present three spatial statistical problems one might en-counter: point process, lattice, and geostatistics.

Point process:A point process is a model for the spatial distribution of thepoints in a point pattern. Several natural processes can be modeled as spatialpoint patterns, e.g., positions of trees in a forest and locations of bird habitatsin a wetland. Spatial point patterns can be broadly grouped into random ornonrandom processes. Real point patterns are often compared with a randompattern(generated by a Poisson process) using the average distance betweena point and its nearest neighbor. For a random pattern, this average distanceis expected to be 1

2∗√

density, where density is the average number of points

per unit area. If for a real process, the computed distance falls within a certainlimit, then we conclude that the pattern is generated by a random process;otherwise it is a nonrandom process.

Lattice: A lattice is a model for a gridded space in a spatial framework.Here the lattice refers to a countable collection of regular or irregular spatialsites related to each other via a neighborhood relationship. Several spatial sta-tistical analyses, e.g., the spatial autoregressive model and Markov randomfields, can be applied on lattice data.

Geostatistics:Geostatistics deals with the analysis of spatial continuityand weak stationarity (Cressie 1993), which is an inherent characteristics ofspatial data sets. Geostatistics provides a set of statistics tools, such as kriging(Cressie 1993) to the interpolation of attributes at unsampled locations.

One of the fundamental assumptions of statistical analysis is that the datasamples are independently generated: like successive tosses of coin, or the


0 20 40 60 80 100 120 140 160

0

10

20

30

40

50

60

70

80

nz = 5372

White Noise? No spatial autocorrelation

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 20 40 60 80 100 120 140 160

0

10

20

30

40

50

60

70

80

nz = 5372

Water depth variation across marshland

0 10 20 30 40 50 60 70 80 90

Figure 19.1: Attribute values in space with independent identical distributionand spatial autocorrelation.

rolling of a die. However, in the analysis of spatial data, the assumption aboutthe independence of samples is generally false. In fact, spatial data tends tobe highly self correlated. For example, people with similar characteristics,occupation and background tend to cluster together in the same neighbor-hoods. The economies of a region tend to be similar. Changes in natural re-sources, wildlife, and temperature vary gradually over space. The property oflike things to cluster in space is so fundamental that geographers have ele-vated it to the status of the first law of geography: “Everything is related toeverything else but nearby things are more related than distant things” (Tobler1979). In spatial statistics, an area within statistics devoted to the analysis ofspatial data, this property is calledspatial autocorrelation.For example, figure19.1 shows the value distributions of an attribute in a spatial framework for anindependent identical distribution and a distribution with spatial autocorrela-tion.

Knowledge discovery techniques that ignore spatial autocorrelation typi-cally perform poorly in the presence of spatial data. Often the spatial depen-dencies arise due to the inherent characteristics of the phenomena under study,but in particular they arise due to the fact that the spatial resolution of imag-ing sensors are finer than the size of the object being observed. For example,remote sensing satellites have resolutions ranging from 30 meters (e.g., theEnhanced Thematic Mapper of the Landsat 7 satellite of NASA) to one me-ter (e.g., the IKONOS satellite from SpaceImaging), while the objects understudy (e.g., urban, forest, water) are often much larger than 30 meters. As a re-sult, per-pixel-based classifiers, which do not take spatial context into account,often produce classified images withsalt and peppernoise. These classifiersalso suffer in terms of classification accuracy.

The spatial relationship among locations in a spatial framework is of-


0 20 40 60 80 100 120 140 160

0

10

20

30

40

50

60

70

80

nz = 5372

Water depth variation across marshland

0 10 20 30 40 50 60 70 80 90

Figure 19.2: A spatial framework and its four-neighborhood contiguity matrix.

ten modeled via a contiguity matrix. A simple contiguity matrix may rep-resent a neighborhood relationship defined using adjacency, Euclidean dis-tance, etc. Example definitions of neighborhood using adjacency include afour-neighborhood and an eight-neighborhood. Given a gridded spatial frame-work, a four-neighborhood assumes that a pair of locations influence eachother if they share an edge. An eight-neighborhood assumes that a pair oflocations influence each other if they share either an edge or a vertex.

Figure 19.2(a) shows a gridded spatial framework with four locations, A,B, C, and D. A binary matrix representation of a four-neighborhood relation-ship is shown in figure 19.2(b). The row-normalized representation of this ma-trix is called a contiguity matrix, as shown in figure 19.2(c). Other contiguitymatrices can be designed to model neighborhood relationships based on dis-tance. The essential idea is to specify the pairs of locations that influence eachother along with the relative intensity of interaction. More general models ofspatial relationships using cliques and hypergraphs are available in the litera-ture (Warrender and Augusteijn 1999). In spatial statistics, spatial autocorre-lation is quantified using measures such as Ripley’s K-function and Moran’s I(Cressie 1993).


19.3 Output Patterns

In this section, we present case studies of four important output patterns forspatial data mining: predictive models, spatial outliers, spatial colocation rules,and spatial clustering.

19.3.1 Predictive Models

The prediction of events occurring at particular geographic locations is veryimportant in several application domains. Examples of problems that requirelocation prediction include crime analysis, cellular networking, and naturaldisasters such as fires, floods, droughts, vegetation diseases, and earthquakes.In this section we provide two spatial data mining techniques for predictinglocations, namely the spatial autoregressive model (SAR) and Markov randomfields (MRF).

An Application Domain

We begin by introducing an example to illustrate the different concepts relatedto location prediction in spatial data mining. We are given data about twowetlands, named Darr and Stubble, on the shores of Lake Erie in Ohio USAin order topredict the spatial distribution of a marsh-breeding bird, the red-winged blackbird (Agelaius phoeniceus). The data was collected from April toJune in two successive years, 1995 and 1996.

A uniform grid was imposed on the two wetlands and different types ofmeasurements were recorded at each cell or pixel. In total, the values of sevenattributes were recorded at each cell. Domain knowledge is crucial in decid-ing which attributes are important and which are not. For example,vegetationdurability was chosen overvegetation speciesbecause specialized knowledgeabout the bird-nesting habits of the red-winged blackbird suggested that thechoice of nest location is more dependent on plant structure, plant resistanceto wind, and wave action than on the plant species.

An important goal is to build a model for predicting the location of birdnests in the wetlands. Typically, the model is built using a portion of the data,called the learning or training data, and then tested on the remainder of thedata, called the testing data. In this study we build a model using the 1995Darr wetland data and then tested it 1995 Stubble wetland data. In the learningdata, all the attributes are used to build the model and in the training data, onevalue is hidden, in our case the location of the nests. Using knowledge gainedfrom the 1995 Darr data and the value of the independent attributes in the testdata, we want to predict the location of the nests in 1995 Stubble data.


0 20 40 60 80 100 120 140

MarshlandNest sites

160

0

10

20

30

40

50

60

70

80

nz = 95

Nest sites for 1995 Darr location

0 20 40 60 80 100 120 140 160

0

10

20

30

40

50

60

70

80

nz = 95

Vegetation distribution across the wetland

0 20 40 60 80 100 120 140 160

0

10

20

30

40

50

60

70

80

nz = 5372

Water depth variation across wetland

0 20 40 60 80 100 120 140 160

0

10

20

30

40

50

60

70

80

nz = 5372

Distance to open water

Figure 19.3:Top left:Learning dataset: The geometry of the Darr wetland andthe locations of the nests.Top right: The spatial distribution ofvegetationdurability over the marshland.Bottom left:The spatial distribution ofwaterdepth.Bottom right:The spatial distribution ofdistance to open water.

Modeling Spatial Dependencies Using the SAR and MRF Models

Several previous studies (Jhung and Swain 1996; Solberg, Taxt, and Jain 1996)have shown that the modeling of spatial dependency (often called context) dur-ing the classification process improves overall classification accuracy. Spatialcontext can be defined by the relationships between spatially adjacent pixelsin a small neighborhood. In this section, we present two models to model spa-tial dependency: the spatial autoregressive model (SAR) and Markov randomfield(MRF)-based Bayesian classifiers.

Spatial Autoregressive Model

The spatial autoregressive model decomposes a classifierf̂C into two parts,namely spatial autoregression and logistic transformation. We first show howspatial dependencies are modeled using the framework of logistic regressionanalysis. In the spatial autoregression model, the spatial dependencies of theerror term, or, the dependent variable, are directly modeled in the regressionequation (Anselin 1988). If the dependent valuesyi are related to each other,


then the regression equation can be modified as

y = ρWy + Xβ + ε. (19.1)

HereW is the neighborhood relationship contiguity matrix andρ is a pa-rameter that reflects the strength of the spatial dependencies between the ele-ments of the dependent variable. After the correction termρWy is introduced,the components of the residual error vectorε are then assumed to be generatedfrom independent and identical standard normal distributions. As in the caseof classical regression, the SAR equation has to be transformed via the logisticfunction for binary dependent variables.

We refer to this equation as the spatial autoregressive model (SAR). Noticethat whenρ = 0, this equation collapses to the classical regression model. Thebenefits of modeling spatial autocorrelation are many: The residual error willhave much lower spatial autocorrelation (i.e., systematic variation). With theproper choice ofW , the residual error should, at least theoretically, have nosystematic variation. If the spatial autocorrelation coefficient is statisticallysignificant, then SAR will quantify the presence of spatial autocorrelation. Itwill indicate the extent to which variations in the dependent variable(y) areexplained by the average of neighboring observation values. Finally, the modelwill have a better fit, (i.e., a higher R-squared statistic).

Markov Random Field-based Bayesian Classifiers

Markov random field-based Bayesian classifiers estimate the classification modelf̂C using MRF and Bayes’ rule. A set of random variables whose interdepen-dency relationship is represented by an undirected graph (i.e., a symmetricneighborhood matrix) is called a Markov random field (Li 1995). The Markovproperty specifies that a variable depends only on its neighbors and is indepen-dent of all other variables. The location prediction problem can be modeled inthis framework by assuming that the class label,li = fC(si), of different loca-tions,si, constitutes an MRF. In other words, random variableli is independentof lj if W (si, sj) = 0.

The Bayesian rule can be used to predictli from feature value vectorXand neighborhood class label vectorLi as follows:

Pr(li|X, Li) =Pr(X|li, Li)Pr(li|Li)

Pr(X)(19.2)

The solution procedure can estimatePr(li|Li) from the training data,whereLi denotes a set of labels in the neighborhood ofsi excluding the labelatsi, by examining the ratios of the frequencies of class labels to the total num-ber of locations in the spatial framework.Pr(X|li, Li) can be estimated using


kernel functions from the observed values in the training dataset. For reliableestimates, even larger training datasets are needed relative to those neededfor the Bayesian classifiers without spatial context, since we are estimatinga more complex distribution. An assumption onPr(X|li, Li) may be usefulif the training dataset available is not large enough. A common assumptionis the uniformity of influence from all neighbors of a location. For computa-tional efficiency it can be assumed that only local explanatory dataX(si) andneighborhood labelLi are relevant in predicting class labelli = fC(si). It iscommon to assume that all interaction between neighbors is captured via theinteraction in the class label variable. Many domains also use specific para-metric probability distribution forms, leading to simpler solution procedures.In addition, it is frequently easier to work with a Gibbs distribution specializedby the locally defined MRF through the Hammersley-Clifford theorem (Besag1974).

Comparison of the Methods

A more detailed theoretical and experimental comparison of these methodscan be found in Shekhar, Vatsavai, Wu, and Chawla (2002). Although MRFand SAR classification have different formulations, they share a common goal,estimating the posterior probability distribution:p(li|X). However, the poste-rior for the two models is computed differently with different assumptions.For MRF the posterior is computed using Bayes’ rule. On the other hand, inlogistic regression, the posterior distribution is directly fit to the data. Oneimportant difference between logistic regression and MRF is that logistic re-gression assumes no dependence on neighboring classes. Logistic regressionand logistic SAR models belong to a more general exponential family. Theexponential family is given byPr(u|v) = eA(θv)+B(u,π)+θT

v u whereu, v arelocation and label respectively. This exponential family includes many of thecommon distributions such as Gaussian, Binomial, Bernoulli, and Poisson asspecial cases.

Experiments were carried out on the Darr and Stubble wetlands to compareclassical regression, SAR, and the MRF-based Bayesian classifiers. The resultsshowed that the MRF models yield better spatial and classification accuraciesover SAR in the prediction of the locations of bird nests. We also observedthat SAR predictions are extremely localized, missing actual nests over a largepart of the marshlands.

19.3.2 Spatial Outliers

Outliers have been informally defined as observations in a dataset that appearto be inconsistent with the remainder of that set of data (Barnett and Lewis


1994), or that deviate so much from other observations so as to arouse suspi-cions that they were generated by a different mechanism (Hawkins 1980). Theidentification of global outliers can lead to the discovery of unexpected knowl-edge and has a number of practical applications in areas such as credit cardfraud, athlete performance analysis, voting irregularity, and severe weatherprediction. This section focuses on spatial outliers, i.e., observations that ap-pear to be inconsistent with their neighborhoods. Detecting spatial outliersis useful in many applications of geographic information systems and spatialdatabases, including transportation, ecology, public safety, public health, cli-matology, and location-based services.

A spatial outlier is a spatially referenced object whose nonspatial attributevalues differ significantly from those of other spatially referenced objects inits spatial neighborhood. Informally, a spatial outlier is a local instability (invalues of nonspatial attributes) or a spatially referenced object whose nonspa-tial attributes are extreme relative to its neighbors, even though the attributesmay not be significantly different from the entire population. For example, anew house in an old neighborhood of a growing metropolitan area is a spatialoutlier based on the nonspatial attribute house age.

Illustrative Examples

We use an example to illustrate the differences among global and spatial outlierdetection methods. In figure 19.4(a), theX-axis is the location of data pointsin one-dimensional space; the Y-axis is the attribute value for each data point.Global outlier detection methods ignore the spatial location of each data pointand fit the distribution model to the values of the nonspatial attribute. Theoutlier detected using this approach is the data pointG, which has an extremelyhigh attribute value 7.9, exceeding the threshold ofµ + 2σ = 4.49 + 2 ∗ 1.61= 7.71, as shown in figure 19.4(b). This test assumes a normal distribution forattribute values. On the other hand,S is a spatial outlier whose observed valueis significantly different than its neighborsP andQ.

Tests for Detecting Spatial Outliers

Tests to detect spatial outliers separate spatial attributes from nonspatial at-tributes. Spatial attributes are used to characterize location, neighborhood, anddistance. Nonspatial attribute dimensions are used to compare a spatially refer-enced object to its neighbors. Spatial statistics literature provides two kinds ofbi-partite multidimensional tests, namely graphical tests and quantitative tests.Graphical tests, which are based on the visualization of spatial data, highlightspatial outliers. Example methods include variogram clouds and Moran scat-terplots. Quantitative methods provide a precise test to distinguish spatial out-


0 2 4 6 8 10 12 14 16 18 200

1

2

3

4

5

6

7

8

← S

P →

Q →

D ↑

Original Data Points

Location

Attr

ibut

e V

alue

s

Data Point Fitting CurveG

L

?2 0 2 4 6 8 100

1

2

3

4

5

6

7

8

9

Attribute Values

Num

ber o

f Occ

urre

nce

Histogram of Attribute Values

Figure 19.4: A dataset for outlier detection.

liers from the remainder of data. Scatterplots (Luc 1994) are a representativetechnique from the quantitative family.

A variogram-cloud (Cressie 1993) displays data points related by neigh-borhood relationships. For each pair of locations, the square-root of the abso-lute difference between attribute values at the locations versus the Euclideandistance between the locations are plotted. In datasets exhibiting strong spatialdependence, the variance in the attribute differences will increase with increas-ing distance between locations. Locations that are near to one another, but withlarge attribute differences, might indicate a spatial outlier, even though the val-ues at both locations may appear to be reasonable when examining the datasetnonspatially. Figure 19.5(a) shows a variogram cloud for the example datasetshown in figure 19.4(a). This plot shows that two pairs(P, S) and(Q,S) onthe left hand side lie above the main group of pairs, and are possibly relatedto spatial outliers. The pointS may be identified as a spatial outlier since itoccurs in both pairs(Q,S) and(P, S). However, graphical tests of spatial out-lier detection are limited by the lack of precise criteria to distinguish spatialoutliers. In addition, a variogram cloud requires nontrivial post-processing ofhighlighted pairs to separate spatial outliers from their neighbors, particularlywhen multiple outliers are present, or density varies greatly.

A Moran scatterplot (Luc 1995) is a plot of normalized attribute value(Z[f(i)] = f(i)−µf

σf) against the neighborhood average of normalized attribute

values (W · Z), whereW is the row-normalized (i.e.,∑

j Wij = 1) neigh-borhood matrix, (i.e.,Wij > 0 iff neighbor(i, j)). The upper left and lowerright quadrants of figure 19.5(b) indicate a spatial association of dissimilarvalues: low values surrounded by high value neighbors(e.g., pointsP andQ),and high values surrounded by low values (e.g,. pointS). Thus we can identifypoints(nodes) that are surrounded by unusually high or low value neighbors.


0 0.5 1 1.5 2 2.5 3 3.50

0.5

1

1.5

2

2.5Variogram Cloud

Pairwise Distance

Squa

re R

oot o

f Abs

olut

e Di

ffere

nce

of A

ttrib

ute

Valu

es

← (Q,S)

← (P,S)

2 ?1.5 ?1 0.5 0 0.5 1 1.5 2 2.53

?2

1

0

1

2

3

S →

P →

Q →

Moran Scatter Plot

Z?Score of Attribute Values

Wei

ghte

d Ne

ighb

or Z

?Sco

re o

f Attr

ibut

e Va

lues

Figure 19.5: Variogram cloud and moran scatterplot to detect spatial outliers.

These points can be treated as spatial outliers.

A scatterplot (Luc 1994) shows attribute values on theX-axis and the av-erage of the attribute values in the neighborhood on theY -axis. A least squareregression line is used to identify spatial outliers. A scatter sloping upwardto the right indicates a positive spatial autocorrelation (adjacent values tendto be similar); a scatter sloping upward to the left indicates a negative spatialautocorrelation. The residual is defined as the vertical distance (Y-axis) be-tween a pointP with location(Xp, Yp) to the regression lineY = mX + b,that is, residualε = Yp − (mXp + b). Cases with standardized residuals,εstandard = ε−µε

σε, greater than 3.0 or less than -3.0 are flagged as possible

spatial outliers, whereµε andσε are the mean and standard deviation of thedistribution of the error termε. In figure 19.6(a), a scatterplot shows the at-tribute values plotted against the average of the attribute values in neighboringareas for the dataset in figure 19.4(a). The pointS turns out to be the farthestfrom the regression line and may be identified as a spatial outlier.

A location (sensor) is compared to its neighborhood using the functionS(x) = [f(x) − Ey∈N(x)(f(y))], wheref(x) is the attribute value for a lo-cationx, N(x) is the set of neighbors ofx, andEy∈N(x)(f(y)) is the averageattribute value for the neighbors ofx (Shekhar, Lu, and Zhang 2001). Thestatistic functionS(x) denotes the difference of the attribute value of a sensorlocated atx and the average attribute value ofx′s neighbors.

Spatial statisticS(x) is normally distributed if the attribute valuef(x)is normally distributed. A popular test for detecting spatial outliers for nor-mally distributedf(x) can be described as follows: Spatial statisticZs(x) =|S(x)−µs

σs| > θ. For each locationx with an attribute valuef(x), theS(x) is the

difference between the attribute value at locationx and the average attribute


1 2 3 4 5 6 7 82

2.5

3

3.5

4

4.5

5

5.5

6

6.5

7

Scatter Plot

Attribute Values

Aver

age

Attri

bute

Val

ues

Over

Nei

ghbo

rhoo

d

← S

P →

Q →

0 2 4 6 8 10 12 14 16 18 202

1

0

1

2

3

4

Spatial Statistic Zs(x) Test

Location

Zs(x

)

S →

P →← Q

Figure 19.6: Scatterplot and spatial statisticZs(x) to detect spatial outliers.

value ofx′s neighbors,µs is the mean value ofS(x), andσs is the value ofthe standard deviation ofS(x) over all stations. The choice ofθ depends on aspecified confidence level. For example, a confidence level of 95 percent willlead toθ ≈ 2.

Figure 19.6(b) shows the visualization of the spatial statistic method de-scribed above. TheX-axis is the location of data points in one-dimensionalspace; theY -axis is the value of spatial statisticZs(x) for each data point. Wecan easily observe that pointS has aZs(x) value exceeding 3, and will be de-tected as a spatial outlier. Note that the two neighboring pointsP andQ of ShaveZs(x) values close to –2 due to the presence of spatial outliers in theirneighborhoods.

19.3.3 Spatial Colocation Rules

Boolean spatial features are geographic object types that are either present orabsent at different locations in a two dimensional or three dimensional metricspace, e.g., the surface of the earth. Examples of Boolean spatial features in-clude plant species, animal species, road types, cancers, crime, and businesstypes. Colocation patterns represent the subsets of the boolean spatial featureswhose instances are often located in close geographic proximity. Examples in-clude symbiotic species, e.g., Nile crocodile and Egyptian plover in ecology,and frontage roads and highways in metropolitan road maps.

Colocation rules are models to infer the presence of Boolean spatial fea-tures in the neighborhood of instances of other Boolean spatial features. Forexample, “Nile Crocodiles→ Egyptian Plover” predicts the presence of Egyp-tian Plover birds in areas with Nile crocodiles. Figure 19.7(a) shows a dataset


0 10 20 30 40 50 60 70 800

10

20

30

40

50

60

70

80

Co-location Patterns Sample Data

X

Y

Figure 19.7: Illustration of point spatial colocation patterns. Shapes representdifferent spatial feature types. Spatial features in sets{‘+’, ‘×’} and{‘o’, ‘*’}tend to be located together.

consisting of instances of several Boolean spatial features, each represented bya distinct shape. A careful review reveals two colocation patterns, i.e., (‘+’,’×’)and (‘o’,‘*’).

Colocation rule discovery is a process to identify colocation patterns fromlarge spatial datasets with a large number of Boolean features. The spatialcolocation rule discovery problem looks similar to, but, in fact, is very dif-ferent from the association rule mining problem (Agrawal and Srikant 1994)because of the lack of transactions. In market basket datasets, transactionsrepresent sets of item types bought together by customers. The support of anassociation is defined to be the fraction of transactions containing the asso-ciation. Association rules are derived from all the associations with supportvalues larger than a user given threshold. The purpose of mining associationrules is to identify frequent item sets for planning store layouts or marketingcampaigns. In the spatial colocation rule mining problem, transactions are of-ten not explicit. The transactions in market basket analysis are independentof each other. Transactions are disjoint in the sense of not sharing instancesof item types. In contrast, the instances of Boolean spatial features are em-bedded in a continuous space and share a variety of spatial relationships (e.g.,neighbor) with each other.


C2 C2 C2 C2C2

Example dataset, neighboring instances of different features are connected

B2

A1

A2

C1B1

B2

A1

A2

C1B1

B2

Reference feature = C Transactions{{B1},{B2}}support(A,B) = φ

(b)

Support(A,B) = 1 Support(A,B) = 2

A1

A2

C1B1B1

Support for (a,b) is not well defined i.e. order sensitive

(c)

B2 B2

(d)

Support(A,B) = min(2/2,2/2) = 1

(a)

B1A1 C1

A2

Support(B,C) = min(2/2,2/2) = 1

A1

A2

C1

Figure 19.8: Example to illustrate different approaches to discovering coloca-tion patterns (a) Example dataset. (b) Data partition approach. Support mea-sure is ill-defined and order sensitive. (c) Reference feature centric model. (d)Event centric model.

Colocation Rule Approaches

Approaches to discovering colocation rules in the literature can be categorizedinto three classes, namely spatial statistics, association rules, and the eventcentric approach. Spatial statistics-based approaches use measures of spatialcorrelation to characterize the relationship between different types of spatialfeatures using the crossK function with Monte Carlo simulation and quadratcount analysis (Cressie 1993). Computing spatial correlation measures for allpossible colocation patterns can be computationally expensive due to the expo-nential number of candidate subsets given a large collection of spatial Booleanfeatures.

Association rule-based approaches focus on the creation of transactionsover space so that an a priorilike algorithm (Agrawal and Srikant 1994) canbe used. Transactions over space can use a reference-feature centric(Koperskiand Han 1995) approach or a data-partition (Morimoto 2001) approach. Thereference feature centric model is based on the choice of a reference spatialfeature and is relevant to application domains focusing on a specific Booleanspatial feature, e.g., incidence of cancer. Domain scientists are interested infinding the colocations of other task relevant features (e.g., asbestos) to thereference feature. A specific example is provided by the spatial associationrule presented in Koperski and Han (1995). Transactions are created aroundinstances of one user-specified reference spatial feature. The association rulesare derived using the a priori (Agrawal and Srikant 1994) algorithm. The rulesfound are all related to the reference feature. For example, consider the spa-tial dataset in figure 19.8(a) with three feature types,A,B andC. Each fea-ture type has two instances. The neighbor relationships between instances areshown as edges. Colocations(A,B) and(B,C) may be considered to be fre-quent in this example. Figure 19.8(b) shows transactions created by choosingC as the reference feature. Colocation(A,B) will not be found since it does


Model Items Transactionsdefined by

Interest measures forC1 → C2

Prevalence Conditional probabilityreferencefeaturecentric

predicateson refer-ence andrelevantfeatures

instances ofreference fea-ture C1 andC2 involvedwith

fraction ofinstance ofreferencefeature withC1 ∪ C2

Pr(C2 is true for an in-stance of reference fea-tures givenC1 is truefor that instance of ref-erence feature)

dataparti-tioning

Booleanfeaturetypes

a partitioningof spatialdataset

fraction ofpartitionswith C1∪C2

Pr(C2 in a partitiongiven C1 in that parti-tion)

eventcentric

Booleanfeaturetypes

neighborhoodsof instancesof featuretypes

participationindex ofC1 ∪ C2

Pr(C2 in a neighbor-hood ofC1)

Table 19.2: Interest measures for different models.

not involve the reference feature.

Defining transactions by a data-partition approach (Morimoto 2001) de-fines transactions by dividing spatial datasets into disjoint partitions. Theremay be many distinct ways of partitioning the data, each yielding a distinct setof transactions, which in turn yields different values of support of a given colo-cation. Figure 19.8 (c) shows two possible partitions for the dataset of figure19.8 (a), along with the supports for colocation(A,B).

The event centric model finds subsets of spatial features likely to occur ina neighborhood around instances of given subsets of event types. For example,let us determine the probability of finding at least one instance of feature typeB in the neighborhood of an instance of feature typeA in figure 19.8 (a). Thereare two instances of typeA and both have some instance(s) of typeB in theirneighborhoods. The conditional probability for the colocation rule is:spatialfeature A at location l→ spatial feature type B in neighborhood is 100%.This yields a well-defined prevalence measure(i.e., support) without the needfor transactions. Figure 19.8 (d) illustrates that our approach will identify both(A,B) and(B,C) as frequent patterns.

Prevalence measures and conditional probability measures, calledinterestmeasures,are defined differently in different models, as summarized in table19.2. The reference feature centric and data partitioning models “materialize”transactions and thus can use traditional support and confidence measures. Theevent centric approach defined new transaction free measures, e.g., the partic-ipation index (see Shekhar and Huang [2001] for details).


19.3.4 Spatial Clustering

Spatial clustering is a process of grouping a set of spatial objects into clustersso that objects within a cluster have high similarity in comparison to one an-other, but are dissimilar to objects in other clusters. For example, clustering isused to determine the “hot spots” in crime analysis and disease tracking. Hotspot analysis is the process of finding unusually dense event clusters acrosstime and space. Many criminal justice agencies are exploring the benefits pro-vided by computer technologies to identify crime hot spots in order to takepreventive strategies such as deploying saturation patrols in hot spot areas.

Spatial clustering can be applied to group similar spatial objects together;the implicit assumption is that patterns in space tend to be grouped ratherthan randomly located. However, the statistical significance of spatial clustersshould be measured by testing the assumption in the data. The test is criticalbefore proceeding with any serious clustering analyses.

Complete Spatial Randomness, Cluster, and Decluster

In spatial statistics, the standard against which spatial point patterns are of-ten compared is a completely spatially random point process, and departuresindicate that the pattern is not distributed randomly in space.Complete spa-tial randomness(CSR) (Cressie 1993) is synonymous with a homogeneousPoisson process. The patterns of the process are independently and uniformlydistributed over space, i.e., the patterns are equally likely to occur anywhereand do not interact with each other. However, patterns generated by a nonran-dom process can be either cluster patterns(aggregated patterns) or declusterpatterns(uniformly spaced patterns).

To illustrate, figure 19.9 shows realizations from a completely spatiallyrandom process, a spatial cluster process, and a spatial decluster process (eachconditioned to have 80 points) in a square. Notice in figure 19.9 (a) that thecomplete spatial randomness pattern seems to exhibit some clustering. Thisis not an unrepresentive realization, but illustrates a well-known property ofhomogeneous Poisson processes: event-to-nearest-event distances are propor-tional to χ2

2 random variables, whose densities have a substantial amount ofprobability near zero (Cressie 1993). Spatial clustering is more statisticallysignificant when the data exhibit a cluster pattern rather than a CSR pattern ordecluster pattern.

Several statistical methods can be applied to quantify deviations of pat-terns from a complete spatial randomness point pattern (Cressie 1993). Onetype of descriptive statistics is based on quadrats (i.e., well defined area, of-ten rectangle in shape). Usually quadrats of random location and orientationsin the quadrats are counted, and statistics derived from the counters are com-puted. Another type of statistics is based on distances between patterns; one


Figure 19.9: Illustration of CSR, cluster, and decluster patterns.

such type is Ripley’s K-function (Cressie 1993).After the verification of the statistical significance of the spatial clustering,

classical clustering algorithms (Han, Kamber, and Tung 2001) can be used todiscover interesting clusters.

19.4 Computational Process

Many generic algorithmic strategies have been generalized to apply to spatialdata mining. For example, as shown in table 19.3, algorithmic strategies, suchas divide-and-conquer, filter-and-refine, ordering, hierarchical structure, andparameter estimation, have been used in spatial data mining.

In spatial data mining, spatial autocorrelation and low dimensionality inspace(e.g., 2-3) provide more opportunities to improve computational effi-ciency than classical data mining. NASA earth observation systems currentlygenerate a large sequence of global snapshots of the earth, including variousatmospheric, land, and ocean measurements such as sea surface temperature,pressure, precipitation, and net primary production. Each climate attribute in


Generic Spatial Data MiningDivide-and-Conquer Space PartitioningFilter-and-Refine Minimum-Bounding-Rectangle(MBR)Ordering Plane Sweeping, Space Filling CurvesHierarchical Structures Spatial Index, Tree MatchingParameter Estimation Parameter estimation with spatial autocorrelation

Table 19.3: Algorithmic strategies in spatial data mining.

a location has a sequence of observations at different time slots, e.g., a col-lection of monthly temperatures from 1951-2000 in Minneapolis. Finding lo-cations where climate attributes are highly correlated is frequently used toretrieve interesting relationships among spatial objects of earth science data.For example, such queries are used to identify the land locations whose cli-mate is severely affected by El Nino. However, such correlation-based queriesare computationally expensive due to the large number of spatial points, e.g.,more than 250k spatial cells on the earth at a 0.5 degree by 0.5 degree reso-lution, and the high dimensionality of sequences, e.g., 600 for the 1951-2000monthly temperature data.

A spatial indexing approach proposed by Zhang, Huang, Shekhar, andKumar (2003) exploits spatial autocorrelation to facilitate correlation-basedqueries. The approach groups similar time series together based on spatialproximity and constructs a search tree. The queries are processed using thesearch tree in a filter-and-refine style at the group level instead of at the timeseries level. Algebraic analyses using cost models and experimental evalu-ations showed that the proposed approach saves a large portion of computa-tional cost, ranging from 40% to 98% (see Zhang, Huang, Shekhar, and Kumar[2003] for details).

19.5 Research Needs

In this section, we discuss some areas where further research is needed inspatial data mining.

Comparison of classical data mining techniques with spatial data miningtechniques:As we discussed in section 19.1, relationships among spatial ob-jects are often implicit. It is possible to materialize the implicit relationshipsinto traditional data input columns and then apply classical data mining tech-niques (Quinlan 1993, Barnett and Lewis 1994, Agrawal and Srikant 1994,Jain and Dubes 1988). Another way to deal with implicit relationships is touse specialized spatial data mining techniques, e.g., the spatial autoregressionand colocation mining. However, existing literature does not provide guidance


Classical Data Mining Spatial Data MiningPredictiveModel

Classification accuracy Spatial accuracy

Cluster Low coupling and high cohesion infeature space

Spatial continuity, unusual density,boundary

Outlier Different from population or neigh-bors in feature space

Significant attribute discontinuity ingeographic space

Association Subset prevalence, Spatial pattern prevalencePr[B ∈ T | A ∈ T, T :a transaction]

Pr[B ∈ N(A) | N : neighbor-hood]

Correlation Cross K-Function

Table 19.4: Interest measures of patterns for classical data mining and spatialdata mining.

regarding the choice between classical data mining techniques and spatial datamining techniques to mine spatial data. Therefore new research is needed tocompare the two sets of approaches in effectiveness and computational effi-ciency.

Modeling semantically rich spatial properties, such as topology:The spa-tial relationship among locations in a spatial framework is often modeled via acontiguity matrix using a neighborhood relationship defined using adjacencyand distance. However, spatial connectivity and other complex spatial topolog-ical relationships in spatial networks are difficult to model using the continuitymatrix. Research is needed to evaluate the value of enriching the continuitymatrix beyond the neighborhood relationship.

Statistical interpretation models for spatial patterns:Spatial patterns, suchas spatial outliers and colocation rules, are identified in the spatial data min-ing process using unsupervised learning methods. There is a need for an in-dependent measure of the statistical significance of such spatial patterns. Forexample, we may compare the colocation model with dedicated spatial statisti-cal measures, such as Ripley’s K-function, characterize the distribution of theparticipation index interest measure under spatial complete randomness usingMonte Carlo simulation, and develop a statistical interpretation of colocationrules to compare the rules with other patterns in unsupervised learning.

Another challenge is the estimation of the detailed spatial parameters ina statistical model. Research is needed to design effective estimation proce-dures for the continuity matrices used in the spatial autoregressive model andMarkov random field-based Bayesian classifiers from learning samples.

Spatial interest measures:The interest measures of patterns in spatial datamining are different from those in classical data mining, especially regardingthe four important output patterns shown in table 19.4.

For a two-class problem, the standard way to measure classification ac-curacy is to calculate the percentage of correctly classified objects. However,


A

= nest location

P = predicted nest in pixel

A = actual nest in pixelP PA

APP

AA

A

(a)

A

AA

(b) (d)(c)

PP

Legend

Figure 19.10: (a)The actual locations of nests. (b) Pixels with actual nests,(c) Location predicted by a model. (d) Location predicted by another model.Prediction (d) is spatially more accurate than (c).

this measure may not be the most suitable in a spatial context.Spatial ac-curacy−how far the predictions are from the actuals−is equally important inthis application domain due to the effects of the discretizations of a continuouswetland into discrete pixels, as shown in figure 19.10. Figure 19.10(a) showsthe actual locations of nests and 19.10(b) shows the pixels with actual nests.Note the loss of information during the discretization of continuous space intopixels. Many nest locations barely fall within the pixels labeled “A” and arequite close to other blank pixels, which represent “no-nest.” Now consider twopredictions shown in figure 19.10(c) and 19.10(d). Domain scientists preferprediction 19.10(d) over 19.10(c), since the predicted nest locations are closeron average to some actual nest locations. However, the classification accuracymeasure cannot distinguish between 19.10(c) and 19.10(d) since spatial accu-racy is not incorporated in the classification accuracy measure. Hence, there isa need to investigate proper measures for location prediction to improve spatialaccuracy.

Effective visualization of spatial relationships:Visualization in spatial datamining is useful to identify interesting spatial patterns. As we discussed insection 19.1, the data inputs of spatial data mining have both spatial and non-spatial features. To facilitate the visualization of spatial relationships, researchis needed on ways to represent both spatial and nonspatial features.

For example, many visual representations have been proposed for spatialoutliers. However, we do not yet have a way to highlight spatial outliers withinvisualizations of spatial relationships. For instance, in variogram cloud (figure19.5 (a)) and scatterplot (figure 19.6 (b)) visualizations, the spatial relationshipbetween a single spatial outlier and its neighbors is not obvious. It is necessaryto transfer the information back to the original map in geographic space tocheck neighbor relationships. As a single spatial outlier tends to flag not onlythe spatial location of local instability but also its neighboring locations, it isimportant to group flagged locations and identify real spatial outliers from thegroup in the post-processing step.

Improving computational efficiency:Mining spatial patterns is often com-


Classical Data Min-ing

Spatial Data Mining

Input Simple types Complex typesExplicit relation-ship

Implicit relationships

StatisticalFoundation

Independence ofsamples

Spatial autocorrelation

Output Set-based interestmeasures

Spatial interest measures,

e.g., classificationaccuracy

e.g., spatial accuracy

ComputationalProcess

Combinatorial opti-mization,

Computational efficiency opportunity

Numerical Spatial autocorrelation, plane-sweepingAlgorithms New complexity: SAR, colocation

Table 19.5: Difference between classical data mining and spatial data mining.

putationally expensive. For example, the estimation of the parameters for thespatial autoregressive model is an order of magnitude more expensive than thatfor the linear regression in classical data mining. Similarly, colocation miningalgorithm is more expensive than the apriori algorithm for classical associa-tion rule mining (Agrawal and Srikant 1994). Research is needed to reducethe computational costs of spatial data mining algorithms by a variety of ap-proaches including the classical data mining algorithms as potential filters orcomponents.

Preprocessing spatial data:Spatial data mining techniques have been widelyapplied to the data in many application domains. However, research on the pre-processing of spatial data has lagged behind. Hence, there is a need for prepro-cessing techniques for spatial data to deal with problems such as treatment ofmissing location information and imprecise location specifications, cleaningof spatial data, feature selection, and data transformation.

19.6 Summary

In this chapter we have presented the features of spatial data mining that distin-guish it from classical data mining in the following four categories: input, sta-tistical foundation, output, and computational process as shown in table 19.5.We have discussed major research accomplishments and techniques in spatialdata mining, especially those related to four important output patterns: predic-tive models, spatial outliers, spatial colocation rules, and spatial clusters. Wehave also identified research needs for spatial data mining.


Acknowledgments

This work was supported in part by the Army High Performance ComputingResearch Center under the auspices of the Department of the Army, ArmyResearch Laboratory cooperative agreement number DAAD19-01-2-0014, thecontent of which does not necessarily reflect the position or the policy of thegovernment, and no official endorsement should be inferred.

We are particularly grateful to our collaborators Prof. Vipin Kumar, Prof.Paul Schrater, Dr. Sanjay Chawla, Dr. Chang-Tien Lu, Dr. Weili Wu, and Prof.Uygar Ozesmi for their various contributions. We also thank Xiaobin Ma, HuiXiong, Jin Soung Yoo, Qingsong Lu, Baris Kazar, and anonymous reviewersfor their valuable feedbacks on early versions of this chapter. We would alsolike to express our thanks to Kim Koffolt for improving the readability of thischapter.

Shashi Shekhar is a professor of computer science at the University of Minnesota.

Pusheng Zhangis a Ph.D. candidate in the Department of Computer Science and En-gineering at the University of Minnesota.

Yan Huang is a an assistant professor of computer science and engineering at theUniversity of North Texas..

Ranga Raju Vatsavaiis a research fellow at the Army High Performance ComputingResearch Center, University of Minnesota.

trends in spatial data mining - shekhar/research/sdm_dm_next_gen04.pdf · trends in spatial data...

Documents