applications similarity search - cs.tau.ac.ilnin/courses/advsem0405a/nearneigh-intr… · minkowsi...

7
Similarity Search Meir Cohen Applications content-based retrieval (streams, time series, text) case-based reasoning (machine learning) orthogonal range search (databases) pattern recognition (signal processing) novelty detection clusters creation, segmentation data compression bioinformatics & computational chemistry Computational Geometry Definitions S is a set of distinct objects - the database - from a space . |S| = n is the number of objects. Q is a set of queries, from too. f is a similarity measure between objects. F: [0,1] 0 for identical objects and 1 for non-similar, independent objects. Weighted graph (matrix) – embedding in R d •NN - Nearest Neighbor search •Range search (rect. window) •Proximity search •(1+ )-Approximate •Probabilistic •k-NN •post office problem Similarity Search Argmin s S f (s , q) {s S : f (s , q) < } Pr (s = argmin s’ S f (s’,q)) > 1- s S : F (s , q) < (1+ ) min s’ S f (s’,q) Humans: object definition (segmentation, features) and similarity measure are subjective, context dependent and dynamic. (visual paradoxes) Similarity measure is based on the semantics of the data. Noise modeling. (error modeling) Transformation modeling. Data modeling - features extraction. Common measures - Hamming, Correlation, L 1 , L 2 , L L p , Euclidian distance, Minkowski norm, Metric, Dynamic Partial, Mutual Information, Edit distance. Similarity Measure

Upload: others

Post on 23-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance

Similarity Search

Meir Cohen

Applications• content-based retrieval (streams, time series, text)

• case-based reasoning (machine learning)

• orthogonal range search (databases)

• pattern recognition (signal processing)

• novelty detection

• clusters creation, segmentation

• data compression

• bioinformatics & computational chemistry

• Computational Geometry

Definitions• S is a set of distinct objects - the database -

from a space �

.

• |S| = n is the number of objects.

• Q is a set of queries, from �

too.

• f is a similarity measure between objects.

• F:�������

[0,1]• 0 for identical objects and

• 1 for non-similar, independent objects.• Weighted graph (matrix) – embedding in Rd

•NN - Nearest Neighbor search

•Range search (rect. window)

•Proximity search

•(1+ � )-Approximate

•Probabilistic•k-NN•post office problem

Similarity Search

Argmins � S f (s , q)

{s S : f (s , q) < � }

Pr (s = argmins’ � S f (s’,q)) > 1- �� � � � � � � � � � � � � � � � � � � � � � � � � � �

s S : F (s , q) < (1+� ) mins’ � S f (s’,q)

• Humans: object definition (segmentation, features) and similarity measure are subjective, context dependent and dynamic. (visual paradoxes)

• Similarity measure is based on the semantics of the data.

• Noise modeling. (error modeling)

• Transformation modeling.

• Data modeling - features extraction.• Common measures - Hamming, Correlation, L1, L2, L � Lp,

Euclidian distance, Minkowski norm, Metric, Dynamic Partial, Mutual Information, Edit distance.

Similarity Measure

Page 2: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance

Minkowsi Norms Search Setups• Off Line - S & Q are fixed

• On Line - Q unknown in advance

• Dynamic - insertions & deletions to/from S

• Time / Space constraints (streams)• Batch (continuous - sliding window)• Approximate• Distributed - different dynamic S at each node (approx.)

• Integrated with an existing system (rel. DB)• Simplicity (Implementation)

Continuous Batch QueryExhaustive Search

Simply the Best for High Dim.

• Given a query q compute f (q,s) for every s in Sand return the most similar one.

• Time: O(n) evaluations of f. - O(n d)

• Space: O(|q|). - O(d)

• I/Os: O(n) (sequential scan [x10])

• Update: Insert & Delete in O(1)• Batch: no additional I/Os• Parallel: O(n) processors

– O(d log(n)) time and O(1) I/Os

Static NNS� Unsupervised learning (clustering)� PCA – principal component analysis (SVD,

KLT, LSI)� SVM – support vector machine� VQ – vector quantization� BP – back propagation� RBF – radial basis functions

Optimizations• Prestructuring (similarity measure properties)

– Indexing Method : searching a data structure

• Partial Distances (multi scale)• Approximating

– Dimension Reduction• distance preserving – DFT, SVD, wavelet, ...• low rate distortion (contraction, expansion).• feature extraction ( ).��������• random projections.

• Editing, Pruning or Condensing (borders)

Page 3: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance

Cost Models• Main memory

– CPU operations– Distance evaluations

• Secondary memory– accessed blocks / pages

• Construction / Update– Insert– Delete

• Data Distribution - Worst, Average, Best• Integrated with an existing system (rel. DB)• Simplicity (Implementation)

Dimension Scalability• d=1 Binary search

– Pre: O(n log(n)) Space: O(n) Search Time: O(log(n))

• d=2 Voronoi Diag., Fractional cascading, persistence, – Pre: O(n log(n)) Space: O(n) Search Time: O(log(n))

• d>2 Exponential dependency on d.

• d>20 No better than exhaustive search.

• d>360 Unaware of any tests.• General metric space:

– intrinsic dim > 20: no better than exhaustive search.

0 < low < 6 < high < 20 < very high < 400 < huge?

Curse of Dimensionality

• General Metric– Discrete Metric– Histogram of Distances

• Vector Space– Cube volume grows exponentially.– Points are sparse.– The variance of the distances becomes small.

Data Distribution• Uniform (independent) (worst case)• Sparse• Homogeneity• Order of Arrival• Intrinsic Dimension – embedding

– Correlation Fractal Dimension– VC Dimension

• Dimension Reduction• Domain Reshape

� = -----� 2

2 � 2

(U(Xd),Lp)� = � (d1/p)

� = � (d1/p-1/2)

Dynamic Indexing Methods

– Space Organizing – Rn

– Space Partitioning - (kd-tree)– Data Organizing– Metric Spaces– Pivoting – internal– Rn

– Pivoting – external - (LSH)– Space Partitioning – (R-tree)

Prestructuring (99-00)

• Vector Space Index Structures (C. Bohm 2000)

– Hierarchical Indexes• Data organizing structures (R-trees)

– Metric Spaces (M-tree)

• Space organizing structures (LSD-h-tree)• insert & delete in O(log(n)) ? O(logd-1(n)) ?

• construction time O(n log(n)) ? O(n logd-1(n)) ?

– Bucket Indexes (hashing, grid file)• Mixture of Gaussians (EM)

• Static NNS problems (T. Panayiotis 1999)

– Random Sampling & Voronoi Diag.– Parametric Search

Page 4: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance

Prestructuring (01)

• Metric Spaces Methods– Building Set of Equivalence Classes– Discarding Some Classes– Searching Exhaustively the Rest

– F0([x],[y])=minx � [x],y � [y]{ f(x,y) }

• Voronoi Partitions (centers)• Pivoting

Taken from E. Chavez 2001

pivotingMetric Spaces

Taken from E. Chavez 2001

pivoting

Indexing Algorithms for Metric Spaces

(median)

Taken from E. Chavez 2001 pivotingIndexing Algorithm for Metric Spaces

GHT

Taken from C. Bohm 2000

voronoi

Indexing Algorithms for Metric SpacesIndexing Algorithms for Metric Spaces

Pivot-BasedVoronoi Type

Hyperplane GNAT

GHT Covering Radius

SATVT MTBST

Scope Coarsened

VPF BKT MVPT

Fixed Height

TreesArrays

Coarsified

FMVPA FHQALAESA-like

AESA

FHQTFQT FMVPT

Taken from E. Chavez 2001

Page 5: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance

R Tree

Taken from P. Agarwal 1999

Data Organizing Structures

Metric Spaces

Data Organizing Structures

BKT(73-94)

VPT(91-99)

MT(97-02)

GHT(91)

GNAT(95)RT(84)

R+T(87)R*T(90)

XT(96)

SST(96)

SRT(97)

Spatial Access Methods (Rd)

VA-File(98)

IQT(00)

TVT(95)

Pyramid Tree (98)

Based on C. Bohm 2000

FQT(94)

Taken from C. Bohm 2000

K-d-B treeSpace Organizing Structures Space Organizing Structures

LSDhT(98)kdBT(81)

Spatial Access Methods (Rd)

hBT(94)

kdT(79)

Update: O(log(n))Space: O(n d)Search: O(n d), O(d log(n))(T. Panayiotis 1999)

Based on C. Bohm 2000

Exact NNS (T. Panayiotis 1999)

Friedman et al. - Optimized k-d-tree

Taken from E. Chavez 2001

Page 6: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance

Tak

en F

rom

Indy

k L

ectu

re

(Orthogonal)

Tak

en F

rom

Ind

yk L

ectu

reApprox. NNS (T. Panayiotis 1999) Approx. L2

Tak

en F

rom

Ind

yk L

ectu

re

SR-tree

Google?

Approx. L

Tak

en F

rom

Indy

k L

ectu

re

A Few More Points

–Nearest neighbor by binary search of near neighbor.

–Continuous batch queries using FFT.

–2-Stable LSH in depth.

–Yianilos in depth.

–Empirical evaluation – graphs.

Page 7: Applications Similarity Search - cs.tau.ac.ilnin/Courses/AdvSem0405A/NearNeigh-intr… · Minkowsi Norms Search Setups Ł Off Line - S & Q are fixed Ł On Line - Q unknown in advance

Omitted Topics• Lower Bounds• Fingerprints, Signatures, • Query Languages for Similarity Search• Metric Space Mapping into Vector Space• Clustering• Semi-Group, Partition-Graph, ...• Geometric/Simplex/Half-Space/Rectangle

Range-Searching (Counting/Reporting),

Further Research• Schemes for Evaluations of Algorithms.

– Real Data– Distributions– Similarity Measures

• Better Exact Methods.• Better Approx. Methods.