applications similarity search - cs.tau.ac.ilnin/courses/advsem0405a/nearneigh-intr… · minkowsi...
TRANSCRIPT
![Page 1: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance](https://reader034.vdocument.in/reader034/viewer/2022042920/5f675489eff8a21ffc2accfd/html5/thumbnails/1.jpg)
Similarity Search
Meir Cohen
Applications• content-based retrieval (streams, time series, text)
• case-based reasoning (machine learning)
• orthogonal range search (databases)
• pattern recognition (signal processing)
• novelty detection
• clusters creation, segmentation
• data compression
• bioinformatics & computational chemistry
• Computational Geometry
Definitions• S is a set of distinct objects - the database -
from a space �
.
• |S| = n is the number of objects.
• Q is a set of queries, from �
too.
• f is a similarity measure between objects.
• F:�������
[0,1]• 0 for identical objects and
• 1 for non-similar, independent objects.• Weighted graph (matrix) – embedding in Rd
•NN - Nearest Neighbor search
•Range search (rect. window)
•Proximity search
•(1+ � )-Approximate
•Probabilistic•k-NN•post office problem
Similarity Search
Argmins � S f (s , q)
{s S : f (s , q) < � }
Pr (s = argmins’ � S f (s’,q)) > 1- �� � � � � � � � � � � � � � � � � � � � � � � � � � �
s S : F (s , q) < (1+� ) mins’ � S f (s’,q)
• Humans: object definition (segmentation, features) and similarity measure are subjective, context dependent and dynamic. (visual paradoxes)
• Similarity measure is based on the semantics of the data.
• Noise modeling. (error modeling)
• Transformation modeling.
• Data modeling - features extraction.• Common measures - Hamming, Correlation, L1, L2, L � Lp,
Euclidian distance, Minkowski norm, Metric, Dynamic Partial, Mutual Information, Edit distance.
Similarity Measure
![Page 2: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance](https://reader034.vdocument.in/reader034/viewer/2022042920/5f675489eff8a21ffc2accfd/html5/thumbnails/2.jpg)
Minkowsi Norms Search Setups• Off Line - S & Q are fixed
• On Line - Q unknown in advance
• Dynamic - insertions & deletions to/from S
• Time / Space constraints (streams)• Batch (continuous - sliding window)• Approximate• Distributed - different dynamic S at each node (approx.)
• Integrated with an existing system (rel. DB)• Simplicity (Implementation)
Continuous Batch QueryExhaustive Search
Simply the Best for High Dim.
• Given a query q compute f (q,s) for every s in Sand return the most similar one.
• Time: O(n) evaluations of f. - O(n d)
• Space: O(|q|). - O(d)
• I/Os: O(n) (sequential scan [x10])
• Update: Insert & Delete in O(1)• Batch: no additional I/Os• Parallel: O(n) processors
– O(d log(n)) time and O(1) I/Os
Static NNS� Unsupervised learning (clustering)� PCA – principal component analysis (SVD,
KLT, LSI)� SVM – support vector machine� VQ – vector quantization� BP – back propagation� RBF – radial basis functions
Optimizations• Prestructuring (similarity measure properties)
– Indexing Method : searching a data structure
• Partial Distances (multi scale)• Approximating
– Dimension Reduction• distance preserving – DFT, SVD, wavelet, ...• low rate distortion (contraction, expansion).• feature extraction ( ).��������• random projections.
• Editing, Pruning or Condensing (borders)
![Page 3: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance](https://reader034.vdocument.in/reader034/viewer/2022042920/5f675489eff8a21ffc2accfd/html5/thumbnails/3.jpg)
Cost Models• Main memory
– CPU operations– Distance evaluations
• Secondary memory– accessed blocks / pages
• Construction / Update– Insert– Delete
• Data Distribution - Worst, Average, Best• Integrated with an existing system (rel. DB)• Simplicity (Implementation)
Dimension Scalability• d=1 Binary search
– Pre: O(n log(n)) Space: O(n) Search Time: O(log(n))
• d=2 Voronoi Diag., Fractional cascading, persistence, – Pre: O(n log(n)) Space: O(n) Search Time: O(log(n))
• d>2 Exponential dependency on d.
• d>20 No better than exhaustive search.
• d>360 Unaware of any tests.• General metric space:
– intrinsic dim > 20: no better than exhaustive search.
0 < low < 6 < high < 20 < very high < 400 < huge?
Curse of Dimensionality
• General Metric– Discrete Metric– Histogram of Distances
• Vector Space– Cube volume grows exponentially.– Points are sparse.– The variance of the distances becomes small.
Data Distribution• Uniform (independent) (worst case)• Sparse• Homogeneity• Order of Arrival• Intrinsic Dimension – embedding
– Correlation Fractal Dimension– VC Dimension
• Dimension Reduction• Domain Reshape
� = -----� 2
2 � 2
(U(Xd),Lp)� = � (d1/p)
� = � (d1/p-1/2)
Dynamic Indexing Methods
– Space Organizing – Rn
– Space Partitioning - (kd-tree)– Data Organizing– Metric Spaces– Pivoting – internal– Rn
– Pivoting – external - (LSH)– Space Partitioning – (R-tree)
Prestructuring (99-00)
• Vector Space Index Structures (C. Bohm 2000)
– Hierarchical Indexes• Data organizing structures (R-trees)
– Metric Spaces (M-tree)
• Space organizing structures (LSD-h-tree)• insert & delete in O(log(n)) ? O(logd-1(n)) ?
• construction time O(n log(n)) ? O(n logd-1(n)) ?
– Bucket Indexes (hashing, grid file)• Mixture of Gaussians (EM)
• Static NNS problems (T. Panayiotis 1999)
– Random Sampling & Voronoi Diag.– Parametric Search
![Page 4: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance](https://reader034.vdocument.in/reader034/viewer/2022042920/5f675489eff8a21ffc2accfd/html5/thumbnails/4.jpg)
Prestructuring (01)
• Metric Spaces Methods– Building Set of Equivalence Classes– Discarding Some Classes– Searching Exhaustively the Rest
– F0([x],[y])=minx � [x],y � [y]{ f(x,y) }
• Voronoi Partitions (centers)• Pivoting
Taken from E. Chavez 2001
pivotingMetric Spaces
Taken from E. Chavez 2001
pivoting
Indexing Algorithms for Metric Spaces
(median)
Taken from E. Chavez 2001 pivotingIndexing Algorithm for Metric Spaces
GHT
Taken from C. Bohm 2000
voronoi
Indexing Algorithms for Metric SpacesIndexing Algorithms for Metric Spaces
Pivot-BasedVoronoi Type
Hyperplane GNAT
GHT Covering Radius
SATVT MTBST
Scope Coarsened
VPF BKT MVPT
Fixed Height
TreesArrays
Coarsified
FMVPA FHQALAESA-like
AESA
FHQTFQT FMVPT
Taken from E. Chavez 2001
![Page 5: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance](https://reader034.vdocument.in/reader034/viewer/2022042920/5f675489eff8a21ffc2accfd/html5/thumbnails/5.jpg)
R Tree
Taken from P. Agarwal 1999
Data Organizing Structures
Metric Spaces
Data Organizing Structures
BKT(73-94)
VPT(91-99)
MT(97-02)
GHT(91)
GNAT(95)RT(84)
R+T(87)R*T(90)
XT(96)
SST(96)
SRT(97)
Spatial Access Methods (Rd)
VA-File(98)
IQT(00)
TVT(95)
Pyramid Tree (98)
Based on C. Bohm 2000
FQT(94)
Taken from C. Bohm 2000
K-d-B treeSpace Organizing Structures Space Organizing Structures
LSDhT(98)kdBT(81)
Spatial Access Methods (Rd)
hBT(94)
kdT(79)
Update: O(log(n))Space: O(n d)Search: O(n d), O(d log(n))(T. Panayiotis 1999)
Based on C. Bohm 2000
Exact NNS (T. Panayiotis 1999)
Friedman et al. - Optimized k-d-tree
Taken from E. Chavez 2001
![Page 6: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance](https://reader034.vdocument.in/reader034/viewer/2022042920/5f675489eff8a21ffc2accfd/html5/thumbnails/6.jpg)
Tak
en F
rom
Indy
k L
ectu
re
(Orthogonal)
Tak
en F
rom
Ind
yk L
ectu
reApprox. NNS (T. Panayiotis 1999) Approx. L2
Tak
en F
rom
Ind
yk L
ectu
re
SR-tree
Google?
Approx. L
Tak
en F
rom
Indy
k L
ectu
re
A Few More Points
–Nearest neighbor by binary search of near neighbor.
–Continuous batch queries using FFT.
–2-Stable LSH in depth.
–Yianilos in depth.
–Empirical evaluation – graphs.
![Page 7: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance](https://reader034.vdocument.in/reader034/viewer/2022042920/5f675489eff8a21ffc2accfd/html5/thumbnails/7.jpg)
Omitted Topics• Lower Bounds• Fingerprints, Signatures, • Query Languages for Similarity Search• Metric Space Mapping into Vector Space• Clustering• Semi-Group, Partition-Graph, ...• Geometric/Simplex/Half-Space/Rectangle
Range-Searching (Counting/Reporting),
Further Research• Schemes for Evaluations of Algorithms.
– Real Data– Distributions– Similarity Measures
• Better Exact Methods.• Better Approx. Methods.