pairwise nearest neighbor method revisited parittainen yhdistelymenetelmä uudistettuna

34
Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND Olli Virmajoki 11.12.20 04

Upload: zanna

Post on 02-Feb-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna. Olli Virmajoki. UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND. 11.12.2004. Clustering. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Pairwise Nearest Neighbor Method RevisitedParittainen yhdistelymenetelmä uudistettuna

UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND

Olli Virmajoki

11.12.2004

Page 2: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Clustering

Important combinatorial optimization problem that must often be solved as a part of more complicated tasks in data analysis pattern recognition data mining other fields of science and engineering

Entails partitioning a data set so that similar objects are grouped together and dissimilar objects are placed in separate groups

Page 3: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Example of data sets

ID POSTAL ZONE Self employed

Civil servents

Clerks Manual workers

800 Munchen 56750 57218 300201 242375 801 Munchen-land ost 7684 5790 20279 23491 802 Munchen-land sued 3780 1977 11058 7398 803 Munchen-land west 7226 5623 25571 20380 804 Munchen-land nord 2226 1305 9347 12432 805 Freising 8187 5140 14632 24377 806 Dachau 8165 2763 11638 24489 807 Ingolstadt 5810 5212 15019 30532

Employment statistics

R G B 26 20 45 28 5 46 28 12 44 23 13 46 31 4 51

RGB-data

Page 4: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Summary of data setsData set

Type of data set

Number of data vectors (N)

Number of clusters (M)

Dim of data vector

Bridge Gray-scale

4096 256 16

House RGB 34112 256 3

Miss America

Residual vectors

6480 256 16

Data set S1-S4

Synthetic

5000 15 2

Page 5: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Data sets

Page 6: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

An example of clustering

Page 7: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Clustering

Given a set of N data vectors X=x1, x2, ...XN in K-dimensional space, clustering aims at solving the partition P=p1, p2, ...pN, which defines for each data vector the index of the cluster where it is assigned to.Cluster sa = xi|pi=a

Clustering S=s1, s2, ...,sM

Codebook C=c1, c2, ...,cMCost function

Combinatorial optimization problem

N

ipi icx

NPCf

1

21),(

Page 8: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Clustering algorithms

Heuristic methodsOptimization methods K-means Genetic algorithms

Graph-theoretical methodsHierarchical methods Divisive Agglomerative (yhdistelevä)

Page 9: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Agglomerative clustering

N = 22 ( number of data points )M = 3 ( number of final clusters )

Page 10: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Ward’s method (PNN in VQ)

2

, baba

baba cc

nn

nnd

ji

jiNjidba ,

,1,minarg,

Merge cost:

Local optimization strategy:

Nearest neighbor search is needed: (1) finding the cluster pair to be merged(2) updating of NN pointers

Page 11: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

The PNN methodM=5000M=4999M=4988...M=50..M=16M=15

M=5000 M=50

M=16 M=15

Page 12: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Nearest neighbor pointers

a

b

c

d

e

f

g

Fast exaxt PNN method:Reduces the amount of the nearest neighbor searchesin each iteration: O(N 3) Ω (N 2)

Page 13: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Combining the PNN and k-means

N

M

M 0

c o m b i n e dPNN

k - m e an s

s t an d ar dPNN

r an d o ms e l e c t i o n

1

M

M 0

N

code

book

siz

e

Page 14: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

PNN as a crossover method in the genetic algorithm

Two random codebooksM=15

Combinedcodebook M=30 andfinal codebookM=15

Initial1 Initial2

Combined Result of PNNUnion

PNN

Page 15: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Publication 1: Speed-up methods

Partial distortion search (PDS) Mean-distance-ordered search (MPS) Uses the component means of the

vectors Derives a precondition for the

distance calculationsReduction of the run time to 2 to 15%

Page 16: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Example of the MPS method

A

A '

B

B '

C '

C

A

A '

B

B '

C '

C

Input vector

Best candidate

Page 17: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Publication 2:Graph-based PNN

Based on the exact PNN methodNN search is limited only to the k clusters that are connected by the graph structureReduces the time complexity of every search from O(N) to O(k)Reduction in the run time to 1 to 4%

Page 18: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Why graph structure ?

O(N) searches with the full search (N=4096)

Only O(k) searches with the graph structure !(k = 3)

Page 19: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Sample graph

Page 20: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Publication 3:Multilevel thresholding

Can be considerd as a special case of vector quantization (VQ), where the vectors are 1-dimensionalExisting method (N 2)PNN thresholding can be implemented in O(N·logN)The proposed method works in real time for any number of thresholds

Page 21: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Distances in heap structure

1 2 4 7 8

1 2 4 8

4

7 2

18

update

updatere m o ve

73 15 12 70

m inim um dis tanc e

73 28 88

O(1) O(log N)

Page 22: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Publication 4:Iterative shrinking (IS)

Generates the clustering by a sequence of cluster removal operationsIn the IS method the vectors can be reassigned more freely than in the PNN methodCan be applied as a crossover method in the genetic algorithm (GAIS)GAIS outperforms all other clustering algorithms

Page 23: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Example of the PNN method

Co d e v e cto rs: Data v e cto rs:

Be fo re c lu ste r me rg e Afte r c lu ste r me rg e

Ve cto rs to b e me rg e d

R e ma in in g ve cto rs

D a ta ve cto rs o f th e c lu ste rs to b e me rg e d

O th e r d a ta ve cto rs

S2

S3

S4S5

S1

x

+

x xx

xx

x

xx

x x

x

x

x+

+

+ ++

+

+

+ +

+

+

+

++

++ +

+

x xx

xx

xx

x x

x

x

x+

+

+ ++

+

+

+ +

+

+

++

++ +

+

x

+

Page 24: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Example of the iterative shrinking method

Code v e cto rs: Data v e cto rs:

Be fo re c luste r remova l Afte r c luste r remova l

Vecto r to be removed

R ema in ing vecto rs

D a ta vecto rs o f the c luste r to be removed

O the r da ta vecto rs

S2

S3

S4S5

S1

x

+

+ ++

++

+

++

x x

x

x

x+

+

+ ++

+

+

+ +

+

+

+

++

++ +

+

+ ++

++

++

x x

x

x

x+

+

+ ++

+

+

+ +

+

+

++

++ +

+

+

+

Page 25: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

The PNN and IS in the search of the number of clusters

S4

0.000080

0.000085

0.000090

0.000095

0.000100

0.000105

0.000110

0.000115

0.000120

25 20 15 10 5

Number of clusters

F-r

atio

minimum

IS

PNN

Page 26: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Time-distortion performance

160

165

170

175

180

185

190

0 1 10 100 1000 10000 100000Time (s)

MS

E

repeatedK-means

RLS

GAIS

PNN

IS

SAGA

Page 27: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Publication 5:Optimal clustering

Can be found by considering all possible merge sequences and finding the one that minimizes the optimization functionCan be implemented as a branch-and-bound (BB) techniqueTwo suboptimal, but polynomial, time variants: Piecewise optimization Look-Ahead optimization

Page 28: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Example of non-redundant search tree

B C E B C DB C

A B

A B C A B D A B E

A C A D A E B C B D

C D

C E

D E

C D E

C EC D

A B C E A B D E

D E

A C D A C E B D B C EB E D E A D E B C B E B C B D C D B C D B D E

A C D E B E

B D

B D E B C D EA B C D

Branches that do not have any valid clustering have been cut out

Page 29: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Illustration of the Piecewise optimization

N c lu s te rs

N - Z c lu s te rs

N - 2Z c lu s te rs

N - 3Z c lu s te rs

M c lu s te rsF in a l re s u lt

Z m e rg es te ps

Page 30: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Comparative results

160

165

170

175

180

1 10 100 1000 10000 100000

Running time (in seconds)

MS

E

Bridge

GAIS(short) GAIS(long)IS

PNN

Standard k-means

PNN+PDS+MPS+LazyGraph-PNN

Graph-PNN+K-means

K-means+PDS+MPS+Activity

Page 31: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Comparative results

5.5

6.0

6.5

7.0

7.5

8.0

1 10 100 1000 10000 100000 1000000

Running time (in seconds)

MS

E

GAIS(long)

House

GAIS(short)

Standard k-means

PNNIS

PNN+PDS+MPS+Lazy

Graph-PNN+K-means

Graph-PNN

K-means+PDS+MPS+Activity

Page 32: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Comparative results

5.0

5.2

5.4

5.6

5.8

6.0

1 10 100 1000 10000 100000 1000000

Running time (in seconds)

MS

E

Miss AmericaStandard k-means

GAIS(long)GAIS(short)

ISPNN

PNN+PDS+MPS+LazyGraph-PNN

Graph-PNN+K-means

K-means+PDS+MPS+Activity

Page 33: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

Example of clustering

k-means agglomerative clustering

Page 34: Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna

ConclusionsSeveral speed-up methods Projection-based search Partial distortion search k nearest neighbor graphEfficient O(N·logN) time implementation for the 1-dimensional caseGeneralization of the merge phase by cluster removal philosofy (IS) for better qualityOptimal clustering based on the PNN method