pairwise nearest neighbor method revisited parittainen yhdistelymenetelmä uudistettuna

Pairwise Nearest Neighbor Method RevisitedParittainen yhdistelymenetelmä uudistettuna

UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND

Olli Virmajoki

11.12.2004

Clustering

Important combinatorial optimization problem that must often be solved as a part of more complicated tasks in data analysis pattern recognition data mining other fields of science and engineering

Entails partitioning a data set so that similar objects are grouped together and dissimilar objects are placed in separate groups

Example of data sets

ID POSTAL ZONE Self employed

Civil servents

Clerks Manual workers

800 Munchen 56750 57218 300201 242375 801 Munchen-land ost 7684 5790 20279 23491 802 Munchen-land sued 3780 1977 11058 7398 803 Munchen-land west 7226 5623 25571 20380 804 Munchen-land nord 2226 1305 9347 12432 805 Freising 8187 5140 14632 24377 806 Dachau 8165 2763 11638 24489 807 Ingolstadt 5810 5212 15019 30532

Employment statistics

R G B 26 20 45 28 5 46 28 12 44 23 13 46 31 4 51

RGB-data

Summary of data setsData set

Type of data set

Number of data vectors (N)

Number of clusters (M)

Dim of data vector

Bridge Gray-scale

4096 256 16

House RGB 34112 256 3

Miss America

Residual vectors

6480 256 16

Data set S1-S4

Synthetic

5000 15 2

Data sets

An example of clustering

Clustering

Given a set of N data vectors X=x1, x2, ...XN in K-dimensional space, clustering aims at solving the partition P=p1, p2, ...pN, which defines for each data vector the index of the cluster where it is assigned to.Cluster sa = xi|pi=a

Clustering S=s1, s2, ...,sM

Codebook C=c1, c2, ...,cMCost function

Combinatorial optimization problem

N

ipi icx

NPCf

1

21),(

Clustering algorithms

Heuristic methodsOptimization methods K-means Genetic algorithms

Graph-theoretical methodsHierarchical methods Divisive Agglomerative (yhdistelevä)

Agglomerative clustering

N = 22 ( number of data points )M = 3 ( number of final clusters )

Ward’s method (PNN in VQ)

2

, baba

baba cc

nn

nnd

ji

jiNjidba ,

,1,minarg,

Merge cost:

Local optimization strategy:

Nearest neighbor search is needed: (1) finding the cluster pair to be merged(2) updating of NN pointers

The PNN methodM=5000M=4999M=4988...M=50..M=16M=15

M=5000 M=50

M=16 M=15

Nearest neighbor pointers

a

b

c

d

e

f

g

Fast exaxt PNN method:Reduces the amount of the nearest neighbor searchesin each iteration: O(N 3) Ω (N 2)

Combining the PNN and k-means

N

M

M 0

c o m b i n e dPNN

k - m e an s

s t an d ar dPNN

r an d o ms e l e c t i o n

1

M

M 0

N

code

book

siz

e

PNN as a crossover method in the genetic algorithm

Two random codebooksM=15

Combinedcodebook M=30 andfinal codebookM=15

Initial1 Initial2

Combined Result of PNNUnion

PNN

Publication 1: Speed-up methods

Partial distortion search (PDS) Mean-distance-ordered search (MPS) Uses the component means of the

vectors Derives a precondition for the

distance calculationsReduction of the run time to 2 to 15%

Example of the MPS method

A

A '

B

B '

C '

C

A

A '

B

B '

C '

C

Input vector

Best candidate

Publication 2:Graph-based PNN

Based on the exact PNN methodNN search is limited only to the k clusters that are connected by the graph structureReduces the time complexity of every search from O(N) to O(k)Reduction in the run time to 1 to 4%

Why graph structure ?

O(N) searches with the full search (N=4096)

Only O(k) searches with the graph structure !(k = 3)

Sample graph

Publication 3:Multilevel thresholding

Can be considerd as a special case of vector quantization (VQ), where the vectors are 1-dimensionalExisting method (N 2)PNN thresholding can be implemented in O(N·logN)The proposed method works in real time for any number of thresholds

Distances in heap structure

1 2 4 7 8

1 2 4 8

4

7 2

18

update

updatere m o ve

73 15 12 70

m inim um dis tanc e

73 28 88

O(1) O(log N)

Publication 4:Iterative shrinking (IS)

Generates the clustering by a sequence of cluster removal operationsIn the IS method the vectors can be reassigned more freely than in the PNN methodCan be applied as a crossover method in the genetic algorithm (GAIS)GAIS outperforms all other clustering algorithms

Example of the PNN method

Co d e v e cto rs: Data v e cto rs:

Be fo re c lu ste r me rg e Afte r c lu ste r me rg e

Ve cto rs to b e me rg e d

R e ma in in g ve cto rs

D a ta ve cto rs o f th e c lu ste rs to b e me rg e d

O th e r d a ta ve cto rs

S2

S3

S4S5

S1

x

+

x xx

xx

x

xx

x x

x

x

x+

+

+ ++

+

+

+ +

+

+

+

++

++ +

+

x xx

xx

xx

x x

x

x

x+

+

+ ++

+

+

+ +

+

+

++

++ +

+

x

+

Example of the iterative shrinking method

Code v e cto rs: Data v e cto rs:

Be fo re c luste r remova l Afte r c luste r remova l

Vecto r to be removed

R ema in ing vecto rs

D a ta vecto rs o f the c luste r to be removed

O the r da ta vecto rs

S2

S3

S4S5

S1

x

+

+ ++

++

+

++

x x

x

x

x+

+

+ ++

+

+

+ +

+

+

+

++

++ +

+

+ ++

++

++

x x

x

x

x+

+

+ ++

+

+

+ +

+

+

++

++ +

+

+

+

The PNN and IS in the search of the number of clusters

S4

0.000080

0.000085

0.000090

0.000095

0.000100

0.000105

0.000110

0.000115

0.000120

25 20 15 10 5

Number of clusters

F-r

atio

minimum

IS

PNN

Time-distortion performance

160

165

170

175

180

185

190

0 1 10 100 1000 10000 100000Time (s)

MS

E

repeatedK-means

RLS

GAIS

PNN

IS

SAGA

Publication 5:Optimal clustering

Can be found by considering all possible merge sequences and finding the one that minimizes the optimization functionCan be implemented as a branch-and-bound (BB) techniqueTwo suboptimal, but polynomial, time variants: Piecewise optimization Look-Ahead optimization

Example of non-redundant search tree

B C E B C DB C

A B

A B C A B D A B E

A C A D A E B C B D

C D

C E

D E

C D E

C EC D

A B C E A B D E

D E

A C D A C E B D B C EB E D E A D E B C B E B C B D C D B C D B D E

A C D E B E

B D

B D E B C D EA B C D

Branches that do not have any valid clustering have been cut out

Illustration of the Piecewise optimization

N c lu s te rs

N - Z c lu s te rs

N - 2Z c lu s te rs

N - 3Z c lu s te rs

M c lu s te rsF in a l re s u lt

Z m e rg es te ps

Comparative results

160

165

170

175

180

1 10 100 1000 10000 100000

Running time (in seconds)

MS

E

Bridge

GAIS(short) GAIS(long)IS

PNN

Standard k-means

PNN+PDS+MPS+LazyGraph-PNN

Graph-PNN+K-means

K-means+PDS+MPS+Activity

Comparative results

5.5

6.0

6.5

7.0

7.5

8.0

1 10 100 1000 10000 100000 1000000


MS

E

GAIS(long)

House

GAIS(short)

Standard k-means

PNNIS

PNN+PDS+MPS+Lazy

Graph-PNN+K-means

Graph-PNN


Comparative results

5.0

5.2

5.4

5.6

5.8

6.0

1 10 100 1000 10000 100000 1000000


MS

E

Miss AmericaStandard k-means

GAIS(long)GAIS(short)

ISPNN

PNN+PDS+MPS+LazyGraph-PNN

Graph-PNN+K-means


Example of clustering

k-means agglomerative clustering

ConclusionsSeveral speed-up methods Projection-based search Partial distortion search k nearest neighbor graphEfficient O(N·logN) time implementation for the 1-dimensional caseGeneralization of the merge phase by cluster removal philosofy (IS) for better qualityOptimal clustering based on the PNN method

pairwise nearest neighbor method revisited parittainen yhdistelymenetelmä uudistettuna

Documents