pairwise nearest neighbor method revisited parittainen yhdistelymenetelmä uudistettuna
DESCRIPTION
Pairwise Nearest Neighbor Method Revisited Parittainen yhdistelymenetelmä uudistettuna. Olli Virmajoki. UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND. 11.12.2004. Clustering. - PowerPoint PPT PresentationTRANSCRIPT
Pairwise Nearest Neighbor Method RevisitedParittainen yhdistelymenetelmä uudistettuna
UNIVERSITY OF JOENSUU DEPARTMENT OF COMPUTER SCIENCE JOENSUU, FINLAND
Olli Virmajoki
11.12.2004
Clustering
Important combinatorial optimization problem that must often be solved as a part of more complicated tasks in data analysis pattern recognition data mining other fields of science and engineering
Entails partitioning a data set so that similar objects are grouped together and dissimilar objects are placed in separate groups
Example of data sets
ID POSTAL ZONE Self employed
Civil servents
Clerks Manual workers
800 Munchen 56750 57218 300201 242375 801 Munchen-land ost 7684 5790 20279 23491 802 Munchen-land sued 3780 1977 11058 7398 803 Munchen-land west 7226 5623 25571 20380 804 Munchen-land nord 2226 1305 9347 12432 805 Freising 8187 5140 14632 24377 806 Dachau 8165 2763 11638 24489 807 Ingolstadt 5810 5212 15019 30532
Employment statistics
R G B 26 20 45 28 5 46 28 12 44 23 13 46 31 4 51
RGB-data
Summary of data setsData set
Type of data set
Number of data vectors (N)
Number of clusters (M)
Dim of data vector
Bridge Gray-scale
4096 256 16
House RGB 34112 256 3
Miss America
Residual vectors
6480 256 16
Data set S1-S4
Synthetic
5000 15 2
Data sets
An example of clustering
Clustering
Given a set of N data vectors X=x1, x2, ...XN in K-dimensional space, clustering aims at solving the partition P=p1, p2, ...pN, which defines for each data vector the index of the cluster where it is assigned to.Cluster sa = xi|pi=a
Clustering S=s1, s2, ...,sM
Codebook C=c1, c2, ...,cMCost function
Combinatorial optimization problem
N
ipi icx
NPCf
1
21),(
Clustering algorithms
Heuristic methodsOptimization methods K-means Genetic algorithms
Graph-theoretical methodsHierarchical methods Divisive Agglomerative (yhdistelevä)
Agglomerative clustering
N = 22 ( number of data points )M = 3 ( number of final clusters )
Ward’s method (PNN in VQ)
2
, baba
baba cc
nn
nnd
ji
jiNjidba ,
,1,minarg,
Merge cost:
Local optimization strategy:
Nearest neighbor search is needed: (1) finding the cluster pair to be merged(2) updating of NN pointers
The PNN methodM=5000M=4999M=4988...M=50..M=16M=15
M=5000 M=50
M=16 M=15
Nearest neighbor pointers
a
b
c
d
e
f
g
Fast exaxt PNN method:Reduces the amount of the nearest neighbor searchesin each iteration: O(N 3) Ω (N 2)
Combining the PNN and k-means
N
M
M 0
c o m b i n e dPNN
k - m e an s
s t an d ar dPNN
r an d o ms e l e c t i o n
1
M
M 0
N
code
book
siz
e
PNN as a crossover method in the genetic algorithm
Two random codebooksM=15
Combinedcodebook M=30 andfinal codebookM=15
Initial1 Initial2
Combined Result of PNNUnion
PNN
Publication 1: Speed-up methods
Partial distortion search (PDS) Mean-distance-ordered search (MPS) Uses the component means of the
vectors Derives a precondition for the
distance calculationsReduction of the run time to 2 to 15%
Example of the MPS method
A
A '
B
B '
C '
C
A
A '
B
B '
C '
C
Input vector
Best candidate
Publication 2:Graph-based PNN
Based on the exact PNN methodNN search is limited only to the k clusters that are connected by the graph structureReduces the time complexity of every search from O(N) to O(k)Reduction in the run time to 1 to 4%
Why graph structure ?
O(N) searches with the full search (N=4096)
Only O(k) searches with the graph structure !(k = 3)
Sample graph
Publication 3:Multilevel thresholding
Can be considerd as a special case of vector quantization (VQ), where the vectors are 1-dimensionalExisting method (N 2)PNN thresholding can be implemented in O(N·logN)The proposed method works in real time for any number of thresholds
Distances in heap structure
1 2 4 7 8
1 2 4 8
4
7 2
18
update
updatere m o ve
73 15 12 70
m inim um dis tanc e
73 28 88
O(1) O(log N)
Publication 4:Iterative shrinking (IS)
Generates the clustering by a sequence of cluster removal operationsIn the IS method the vectors can be reassigned more freely than in the PNN methodCan be applied as a crossover method in the genetic algorithm (GAIS)GAIS outperforms all other clustering algorithms
Example of the PNN method
Co d e v e cto rs: Data v e cto rs:
Be fo re c lu ste r me rg e Afte r c lu ste r me rg e
Ve cto rs to b e me rg e d
R e ma in in g ve cto rs
D a ta ve cto rs o f th e c lu ste rs to b e me rg e d
O th e r d a ta ve cto rs
S2
S3
S4S5
S1
x
+
x xx
xx
x
xx
x x
x
x
x+
+
+ ++
+
+
+ +
+
+
+
++
++ +
+
x xx
xx
xx
x x
x
x
x+
+
+ ++
+
+
+ +
+
+
++
++ +
+
x
+
Example of the iterative shrinking method
Code v e cto rs: Data v e cto rs:
Be fo re c luste r remova l Afte r c luste r remova l
Vecto r to be removed
R ema in ing vecto rs
D a ta vecto rs o f the c luste r to be removed
O the r da ta vecto rs
S2
S3
S4S5
S1
x
+
+ ++
++
+
++
x x
x
x
x+
+
+ ++
+
+
+ +
+
+
+
++
++ +
+
+ ++
++
++
x x
x
x
x+
+
+ ++
+
+
+ +
+
+
++
++ +
+
+
+
The PNN and IS in the search of the number of clusters
S4
0.000080
0.000085
0.000090
0.000095
0.000100
0.000105
0.000110
0.000115
0.000120
25 20 15 10 5
Number of clusters
F-r
atio
minimum
IS
PNN
Time-distortion performance
160
165
170
175
180
185
190
0 1 10 100 1000 10000 100000Time (s)
MS
E
repeatedK-means
RLS
GAIS
PNN
IS
SAGA
Publication 5:Optimal clustering
Can be found by considering all possible merge sequences and finding the one that minimizes the optimization functionCan be implemented as a branch-and-bound (BB) techniqueTwo suboptimal, but polynomial, time variants: Piecewise optimization Look-Ahead optimization
Example of non-redundant search tree
B C E B C DB C
A B
A B C A B D A B E
A C A D A E B C B D
C D
C E
D E
C D E
C EC D
A B C E A B D E
D E
A C D A C E B D B C EB E D E A D E B C B E B C B D C D B C D B D E
A C D E B E
B D
B D E B C D EA B C D
Branches that do not have any valid clustering have been cut out
Illustration of the Piecewise optimization
N c lu s te rs
N - Z c lu s te rs
N - 2Z c lu s te rs
N - 3Z c lu s te rs
M c lu s te rsF in a l re s u lt
Z m e rg es te ps
Comparative results
160
165
170
175
180
1 10 100 1000 10000 100000
Running time (in seconds)
MS
E
Bridge
GAIS(short) GAIS(long)IS
PNN
Standard k-means
PNN+PDS+MPS+LazyGraph-PNN
Graph-PNN+K-means
K-means+PDS+MPS+Activity
Comparative results
5.5
6.0
6.5
7.0
7.5
8.0
1 10 100 1000 10000 100000 1000000
Running time (in seconds)
MS
E
GAIS(long)
House
GAIS(short)
Standard k-means
PNNIS
PNN+PDS+MPS+Lazy
Graph-PNN+K-means
Graph-PNN
K-means+PDS+MPS+Activity
Comparative results
5.0
5.2
5.4
5.6
5.8
6.0
1 10 100 1000 10000 100000 1000000
Running time (in seconds)
MS
E
Miss AmericaStandard k-means
GAIS(long)GAIS(short)
ISPNN
PNN+PDS+MPS+LazyGraph-PNN
Graph-PNN+K-means
K-means+PDS+MPS+Activity
Example of clustering
k-means agglomerative clustering
ConclusionsSeveral speed-up methods Projection-based search Partial distortion search k nearest neighbor graphEfficient O(N·logN) time implementation for the 1-dimensional caseGeneralization of the merge phase by cluster removal philosofy (IS) for better qualityOptimal clustering based on the PNN method