clustering an overview of clustering algorithms dènis de keijzer gia 2004

Clustering

An overview of clustering algorithms

Dènis de Keijzer

GIA 2004

Overview

AlgorithmsGRAVIclustAUTOCLUSTAUTOCLUST+3D Boundary-based Clustering SNN

Gravity based spatial clustering

GRAVIclust Initialisation Phase

calculate the initial centre clusters

Optimisation Phase improve the position of the cluster centres so as

to achieve a solution which minimizes the distance function

k

=i iCpiLp,d

1

GRAVIclust: Initialisation Phase

Input:set of points P


Input:set of points Pmatrix of distances between all pairs of

pointsassumption: actual access path distanceexists in GIS maps

e.g.. http://www.transinfo.qld.gov.auvery versatile

footpath road map rail map


Input:set of points Pmatrix of distances between all pairs of

points# of required clusters k


Step 1: calculate first initial centre

the point with the largest number of points within radius r remove first initial centre & all points within radius r from

further consideration

Step 2: repeat Step 1 until k initial centres have been chosen

Step 3: create initial clusters by assigning all points to the closest

cluster centre

GRAVIclust: radius calculation

Radius rcalculated based on the area of the region

considered for clusteringstatic radius

based on the assumption that all clusters are of the same size

dynamic radius recalculated after each initial cluster centre is

chosen

π

A=r

clusters required #

rectangle bounding minimum of area=A

GRAVIclust: Static vs. Dynamic

Static reduced computation # points within a radius r has to be calculated

only once not suitable for problems where the points are

separated by large empty areas

Dynamic increases computation time ensures the radius is adjusted as the points are

removed

Differs only when distribution is non-uniform

GRAVIclust: Optimisation Phase

Step 1: for each cluster, calculate new centre

based on the the point closest to cluster centre of gravity

Step 2: re-assign points to new cluster centres

Step 3: recalculate distance function

never greater than previous

Step 4: repeat Step 1 to 3 until value distance function

equals previous

GRAVIclust

Deterministic

Can handle obstacles

Monotonic convergence of the distance function to a stable point

AUTOCLUST

Definitions

ipd

j=ij

ipNeii pde=pNe=pLocalMean

1

//

ipd

j=ijii pdepLocalMean=pLocalStDev

1

2 /

n

=ii npLocalStDev=PMeanStDev

1

/

PMeanStDevpLocalStDev=pDevRelativeSt ii /

AUTOCLUST

Definitions II

PMeanStDevpLocalMean<e|e=pShortEdges ijji

PMeanStDev+pLocalMean>e|e=pLongEdges ijji

iiii pLongEdgespShortEdgespN=pOtherEdges

AUTOCLUST

Phase 1: finding boundaries

Phase 2: restoring and re-attaching

Phase 3:detecting second-order inconsistency

AUTOCLUST: Phase 1

Finding boundariesCalculate

Delaunay Diagram for each point pi

ShortEdges(pi)

LongEdges(pi)

OtherEdges(pi)

Remove ShortEdges(pi) and LongEdges(pi)

AUTOCLUST: Phase 2

Restoring and re-attaching for each point pi where ShortEdges(pi)

Determine a candidate connected component C for p

i

If there are 2 edges ej = (p

i,p

j) and e

k = (p

i,p

k) in

ShortEdges(pi) with CC[p

j] CC[p

k], then

Compute, for each edge e = (pi,p

j) ShortEdges(p

i),

the size ||CC[pj]|| and let M = max

e = (pi,pj)

ShortEdges(pi)

||CC[pj]||

Let C be the class labels of the largest connected component (if there are two different connected components with cardinality M, we let C be the one with the shortest edge to p

i)

AUTOCLUST: Phase 2

Restoring and re-attaching for each point p

i where ShortEdges(p

i)

Determine a candidate connected component C for pi

If … Otherwise, let C be the label of the connected

component all edges e ShortEdges(pi) connect pi to

AUTOCLUST: Phase 2

Restoring and re-attaching for each point p

i where ShortEdges(p

i)

Determine a candidate connected component C for pi

If the edges in OtherEdges(pi) connect to a connected component different than C, remove them. Note that

all edges in OtherEdges(pi) are removed, and only in this case, will pi swap connected components

Add all edges e ShortEdges(pi) that connect to C

AUTOCLUST: Phase 3

Detecting second-order inconsistencycompute the LocalMean for 2-

neighbourhoods remove all edges in N

2,G(pi) that are long

edges

ipGNe ipGipG2,

Ne=LocalMean2,

2,/

PMeanStDev+LocalMean>eipG2,

AUTOCLUST

AUTOCLUST

No user supplied arguments eliminates expensive human-based exploration

time for finding best-fit arguments

Robust to noise, outliers, bridges and type of distributionAble to detect clusters with arbitrary shapes, different sizes and different densitiesCan handle multiple bridgesO(n log n)

AUTOCLUST+

Construct Delaunay Diagram

Calculate MeanStDev(P)

For all edges e, remove e if it intersects some obstacles

Apply the 3 phases of AUTOCLUST to the planar graph resulting from the previous steps

3D Boundary-based Clustering

Benefits from 3D Clusteringmore accurate spatial analysisdistinguish

positive clusters: clusters in higher dimensions but not in lower

dimensions


Benefits from 3D Clusteringmore accurate spatial analysisdistinguish

positive clusters: clusters in higher dimensions but not in lower

dimensionsnegative clusters:

clusters in lower dimensions but not in higher dimensions


Based on AUTOCLUST

Uses Delaunay Tetrahedrizations

Definitions:e

j potential inter-cluster edge if:

iij pLocalStDevl+pLocalMean>e

l m RelativeStDev pi1 m MeanStDev P LocalStDev pi

PMeanStDevm+pLocalMeanpAIPMeanStDevmpLocalMean iii


Phase IFor all the p

i P, classify each edge e

j

incident to pi into one of three groups

ShortEdges(pi) when the length of ej is less than the range in AI(pi)

LongEdges(pi) when the length of ej is greater than the range in AI(pi)

OtherEdges(pi) when the length of ej is within AI(pi)

For all the pi P, remove all edges in

ShortEdges(pi) and LongEdges(pi)


Phase IIRecuperate ShortEdges(pi) incident to

border points using connected component analysis

Phase IIIRemove exceptionally long edges in local

regions

PMeanStDevm+LocalMean>eipGj

2,

Shared Nearest Neighbour

Clustering in higher dimensionsDistances or similarities between points

become more uniform, making clustering more difficult

Also, similarity between points can be misleading

i.e.. a point can be more similar to a point that “actually” belongs to a different cluster

SolutionShared nearest neighbor approach to similarity

SNN: An alternative definition of similarity

Euclidian distancemost common distance metric usedwhile useful in low dimensions, it doesn’t

work well in high dimensions

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

P1 3 0 0 0 0 0 0 0 0 0

P2 0 0 0 0 0 0 0 0 0 4

P3 3 2 4 0 1 2 3 1 2 0

P4 0 2 4 0 1 2 3 1 2 4

SNN: An alternative definition of similarity

Define similarity in terms of their shared nearest neighbours the similarity of the points is “confirmed” by

their common shared nearest neighbours

))()((),( qNNpNNsizeqpsimilarity

SNN: An alternative definition of density

SNN similarity, with the k-nearest neighbour approach if the k-nearest neighbour of a point, with

respect to SNN similarity is close, then we say that there is a high density at this point

since it reflects the local configuration of the points in the data space, it is relatively insensitive to variations in desitiy and the dimensionality of the space

SNN: Algorithm

Compute the similarity matrixcorresponds to a similarity graph with data

points for nodes and edges whose weights are the similarities between data points

SNN: Algorithm

Compute the similarity matrix

Sparsify the similarity matrix by keeping only the k most similar neighbourscorresponds to keeping only the k

strongest links of the similarity graph

SNN: Algorithm


Sparsify the similarity matrix …

Construct the shared nearest neighbour graph from the sparsified similarity matrix

SNN: Algorithm



Construct the shared …

Find the SNN density of each point

Find the core points

SNN: Algorithm





SNN: Algorithm





Form clusters from the core points

SNN: Algorithm






Discard all noise points

SNN: Algorithm






Discard all noise points

Assign al non-noise, non-core points to clusters

Shared Nearest Neighbour

Finds clusters of varying shapes, sizes, and densities, even in the presence of noise and outliers

Handles data of high dimentionality and varying densities

Automaticly detects the # of clusters

clustering an overview of clustering algorithms dènis de keijzer gia 2004

Documents