clustering an overview of clustering algorithms dènis de keijzer gia 2004

40
Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

Post on 22-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

Clustering

An overview of clustering algorithms

Dènis de Keijzer

GIA 2004

Page 2: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

Overview

AlgorithmsGRAVIclustAUTOCLUSTAUTOCLUST+3D Boundary-based Clustering SNN

Page 3: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

Gravity based spatial clustering

GRAVIclust Initialisation Phase

calculate the initial centre clusters

Optimisation Phase improve the position of the cluster centres so as

to achieve a solution which minimizes the distance function

k

=i iCpiLp,d

1

Page 4: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

GRAVIclust: Initialisation Phase

Input:set of points P

Page 5: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

GRAVIclust: Initialisation Phase

Input:set of points Pmatrix of distances between all pairs of

pointsassumption: actual access path distanceexists in GIS maps

e.g.. http://www.transinfo.qld.gov.auvery versatile

footpath road map rail map

Page 6: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

GRAVIclust: Initialisation Phase

Input:set of points Pmatrix of distances between all pairs of

points# of required clusters k

Page 7: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

GRAVIclust: Initialisation Phase

Step 1: calculate first initial centre

the point with the largest number of points within radius r remove first initial centre & all points within radius r from

further consideration

Step 2: repeat Step 1 until k initial centres have been chosen

Step 3: create initial clusters by assigning all points to the closest

cluster centre

Page 8: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

GRAVIclust: radius calculation

Radius rcalculated based on the area of the region

considered for clusteringstatic radius

based on the assumption that all clusters are of the same size

dynamic radius recalculated after each initial cluster centre is

chosen

π

A=r

clusters required #

rectangle bounding minimum of area=A

Page 9: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

GRAVIclust: Static vs. Dynamic

Static reduced computation # points within a radius r has to be calculated

only once not suitable for problems where the points are

separated by large empty areas

Dynamic increases computation time ensures the radius is adjusted as the points are

removed

Differs only when distribution is non-uniform

Page 10: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

GRAVIclust: Optimisation Phase

Step 1: for each cluster, calculate new centre

based on the the point closest to cluster centre of gravity

Step 2: re-assign points to new cluster centres

Step 3: recalculate distance function

never greater than previous

Step 4: repeat Step 1 to 3 until value distance function

equals previous

Page 11: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

GRAVIclust

Deterministic

Can handle obstacles

Monotonic convergence of the distance function to a stable point

Page 12: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST

Definitions

ipd

j=ij

ipNeii pde=pNe=pLocalMean

1

//

ipd

j=ijii pdepLocalMean=pLocalStDev

1

2 /

n

=ii npLocalStDev=PMeanStDev

1

/

PMeanStDevpLocalStDev=pDevRelativeSt ii /

Page 13: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST

Definitions II

PMeanStDevpLocalMean<e|e=pShortEdges ijji

PMeanStDev+pLocalMean>e|e=pLongEdges ijji

iiii pLongEdgespShortEdgespN=pOtherEdges

Page 14: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST

Phase 1: finding boundaries

Phase 2: restoring and re-attaching

Phase 3:detecting second-order inconsistency

Page 15: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST: Phase 1

Finding boundariesCalculate

Delaunay Diagram for each point pi

ShortEdges(pi)

LongEdges(pi)

OtherEdges(pi)

Remove ShortEdges(pi) and LongEdges(pi)

Page 16: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST: Phase 2

Restoring and re-attaching for each point pi where ShortEdges(pi)

Determine a candidate connected component C for p

i

If there are 2 edges ej = (p

i,p

j) and e

k = (p

i,p

k) in

ShortEdges(pi) with CC[p

j] CC[p

k], then

Compute, for each edge e = (pi,p

j) ShortEdges(p

i),

the size ||CC[pj]|| and let M = max

e = (pi,pj)

ShortEdges(pi)

||CC[pj]||

Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to p

i)

Page 17: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST: Phase 2

Restoring and re-attaching for each point p

i where ShortEdges(p

i)

Determine a candidate connected component C for pi

If … Otherwise, let C be the label of the connected

component all edges e ShortEdges(pi) connect pi to

Page 18: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST: Phase 2

Restoring and re-attaching for each point p

i where ShortEdges(p

i)

Determine a candidate connected component C for pi

If the edges in OtherEdges(pi) connect to a connected component different than C, remove them. Note that

all edges in OtherEdges(pi) are removed, and only in this case, will pi swap connected components

Add all edges e ShortEdges(pi) that connect to C

Page 19: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST: Phase 3

Detecting second-order inconsistencycompute the LocalMean for 2-

neighbourhoods remove all edges in N

2,G(pi) that are long

edges

ipGNe ipGipG2,

Ne=LocalMean2,

2,/

PMeanStDev+LocalMean>eipG2,

Page 20: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST

Page 21: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST

No user supplied arguments eliminates expensive human-based exploration

time for finding best-fit arguments

Robust to noise, outliers, bridges and type of distributionAble to detect clusters with arbitrary shapes, different sizes and different densitiesCan handle multiple bridgesO(n log n)

Page 22: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

AUTOCLUST+

Construct Delaunay Diagram

Calculate MeanStDev(P)

For all edges e, remove e if it intersects some obstacles

Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps

Page 23: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

3D Boundary-based Clustering

Benefits from 3D Clusteringmore accurate spatial analysisdistinguish

positive clusters: clusters in higher dimensions but not in lower

dimensions

Page 24: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

3D Boundary-based Clustering

Benefits from 3D Clusteringmore accurate spatial analysisdistinguish

positive clusters: clusters in higher dimensions but not in lower

dimensionsnegative clusters:

clusters in lower dimensions but not in higher dimensions

Page 25: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

3D Boundary-based Clustering

Based on AUTOCLUST

Uses Delaunay Tetrahedrizations

Definitions:e

j potential inter-cluster edge if:

iij pLocalStDevl+pLocalMean>e

l m RelativeStDev pi1 m MeanStDev P LocalStDev pi

PMeanStDevm+pLocalMeanpAIPMeanStDevmpLocalMean iii

Page 26: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

3D Boundary-based Clustering

Phase IFor all the p

i P, classify each edge e

j

incident to pi into one of three groups

ShortEdges(pi) when the length of ej is less than the range in AI(pi)

LongEdges(pi) when the length of ej is greater than the range in AI(pi)

OtherEdges(pi) when the length of ej is within AI(pi)

For all the pi P, remove all edges in

ShortEdges(pi) and LongEdges(pi)

Page 27: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

3D Boundary-based Clustering

Phase IIRecuperate ShortEdges(pi) incident to

border points using connected component analysis

Phase IIIRemove exceptionally long edges in local

regions

PMeanStDevm+LocalMean>eipGj

2,

Page 28: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

Shared Nearest Neighbour

Clustering in higher dimensionsDistances or similarities between points

become more uniform, making clustering more difficult

Also, similarity between points can be misleading

i.e.. a point can be more similar to a point that “actually” belongs to a different cluster

SolutionShared nearest neighbor approach to similarity

Page 29: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: An alternative definition of similarity

Euclidian distancemost common distance metric usedwhile useful in low dimensions, it doesn’t

work well in high dimensions

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

P1 3 0 0 0 0 0 0 0 0 0

P2 0 0 0 0 0 0 0 0 0 4

P3 3 2 4 0 1 2 3 1 2 0

P4 0 2 4 0 1 2 3 1 2 4

Page 30: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: An alternative definition of similarity

Define similarity in terms of their shared nearest neighbours the similarity of the points is “confirmed” by

their common shared nearest neighbours

))()((),( qNNpNNsizeqpsimilarity

Page 31: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: An alternative definition of density

SNN similarity, with the k-nearest neighbour approach if the k-nearest neighbour of a point, with

respect to SNN similarity is close, then we say that there is a high density at this point

since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space

Page 32: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: Algorithm

Compute the similarity matrixcorresponds to a similarity graph with data

points for nodes and edges whose weights are the similarities between data points

Page 33: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: Algorithm

Compute the similarity matrix

Sparsify the similarity matrix by keeping only the k most similar neighbourscorresponds to keeping only the k

strongest links of the similarity graph

Page 34: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: Algorithm

Compute the similarity matrix

Sparsify the similarity matrix …

Construct the shared nearest neighbour graph from the sparsified similarity matrix

Page 35: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: Algorithm

Compute the similarity matrix

Sparsify the similarity matrix …

Construct the shared …

Find the SNN density of each point

Find the core points

Page 36: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: Algorithm

Compute the similarity matrix

Sparsify the similarity matrix …

Construct the shared …

Find the SNN density of each point

Page 37: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: Algorithm

Compute the similarity matrix

Sparsify the similarity matrix …

Construct the shared …

Find the SNN density of each point

Form clusters from the core points

Page 38: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: Algorithm

Compute the similarity matrix

Sparsify the similarity matrix …

Construct the shared …

Find the SNN density of each point

Form clusters from the core points

Discard all noise points

Page 39: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

SNN: Algorithm

Compute the similarity matrix

Sparsify the similarity matrix …

Construct the shared …

Find the SNN density of each point

Form clusters from the core points

Discard all noise points

Assign al non-noise, non-core points to clusters

Page 40: Clustering An overview of clustering algorithms Dènis de Keijzer GIA 2004

Shared Nearest Neighbour

Finds clusters of varying shapes, sizes, and densities, even in the presence of noise and outliers

Handles data of high dimentionality and varying densities

Automaticly detects the # of clusters