geometry of similarity searchandoni/similaritysearch_simons.pdfmeasuring similarity 000000 011100...
Post on 12-Aug-2020
4 Views
Preview:
TRANSCRIPT
Geometry of
Similarity Search
Alex Andoni(Columbia University)
Find pairs of similar images
2
how should we
measure similarity?
Naรฏvely: about ๐2 comparisons
Can we do better?
Measuring similarity
000000
011100
010100
000100
010100
011111
000000
001100
000100
000100
110100
111111
objects โ high-dimensional vectors
similarity โ distance b/w vectors
Image courtesy of Kristen Grauman
0,1 ๐
Hamming dist.
๐ ๐
Euclidean dist.
Sets of points
Earth-Mover
Distance
Problem: Nearest Neighbor Search (NNS)
Preprocess: a set ๐ of points
Query: given a query point ๐, report a point
๐โ โ ๐ with the smallest distance to ๐
Primitive for: finding all similar pairs
But also clustering problems, and many other
problems on large set of multi-feature objects
Applications:
speech/image/video/music recognition, signal
processing, bioinformatics, etcโฆ
๐
๐โ
4
๐: number of points
๐: dimension
Preamble: How to check for an exact match ?
5
Query time Space
๐(log ๐) ๐(๐)
just pre-sort !Preprocess:
Sort the points
Query:
Perform binary search
Also works for NNS for 1-dimensional vectorsโฆ
1.1 2 5 6 7 7.33.5
000000
011100
010100
000100
010100
011111
000000
001100
000100
000100
110100
111111
000000
000000
000000
000000
000000
000000
โฆ
001100
011100
001100
001100
001100
001100
High-dimensional case
6
Algorithm Query time Space
No indexing ๐(๐ โ ๐) ๐(๐ โ ๐)
Full indexing ๐(๐) 2๐
0,1 ๐
Hamming dist.
Overprepared: store an answer
for every possible query
Best indexing ? ๐(๐) ๐(๐ โ ๐)
Underprepared: no
preprocessing
unaffordable if ๐ โซ log ๐
A little better
indexing ?
๐0.99 ๐(๐2)
Curse of dimensionality:
would refute a (very) strong
version of ๐ท โ ๐ต๐ท conjecture
[Williamsโ04]
๐ = 1,000,000,000
๐ = 400
Relaxed problem: Approximate Near Neighbor Search
๐-near neighbor: given a query point ๐,
report a point ๐โฒ โ ๐ s.t. ๐โฒ โ ๐ โค ๐
as long as there is some point within
distance ๐
Remarks:
In practice: used as a filter
Randomized algorithms: each point
reported with 90% probability
Can use to solve nearest neighbor too
[HarPeled-Indyk-Motwaniโ12]
๐
๐
๐โ
๐โฒ
๐๐
๐๐
7
similar
not similar
kind ofโฆ
either way
Approach: Locality Sensitive Hashing
Map ๐ on ๐ ๐ s.t. for any points ๐, ๐
for similar pairs (when ๐ โ ๐ โค ๐ )
๐(๐) = ๐(๐)
for dissimilar pairs (when ๐ โ ๐โฒ > ๐๐ )
๐ ๐ โ ๐ ๐โฒ
๐
๐
๐ โ ๐
Pr[๐(๐) = ๐(๐)]
๐ ๐๐
1
๐1
๐2
๐๐, where
๐
Pr[๐(๐) = ๐(๐)] is not-too-low๐1 =
๐2 =
๐ =log 1/๐1log 1/๐2
8
๐โฒMap: points โ codes: s.t.
โsimilarโ โ โexact matchโ
Pr[๐(๐) = ๐(๐)] is low
[Indyk-Motwaniโ98]
How to construct good maps?
several indexes
Use an index on ๐(๐) for ๐ โ ๐
Space Time Exponent ๐ = ๐
๐1+๐ ๐๐ ๐ = 1/๐ ๐ = 1/2
Map #1 : random grid
9
๐ ๐
๐โฒ
Map ๐:
โข partition in a regular grid
โข randomly shifted
โข randomly rotated
Can we do better?
[Datar-Indyk-Immorlica-Mirrokniโ04]
Regular grid โ grid of balls ๐ can hit empty space, so take more such grids
until ๐ is in a ball
How many grids? about ๐๐
start by projecting in dimension ๐ก
Choice of reduced dimension ๐ก? ๐ closer to bound for higher ๐ก Number of grids is ๐ก๐(๐ก)
Map #2 : ball carving
2D
๐
๐๐ ๐ก
[A-Indykโ06]
Space Time Exponent ๐ = ๐
๐1+๐ ๐๐ ๐ โ 1/๐2 ๐ โ 1/4
Similar space partitions ubiquitous:
11
Approximation algorithms [Goemans, Williamson 1995], [Karger, Motwani,
Sudan 1995], [Charikar, Chekuri, Goel, Guha, Plotkin 1998], [Chlamtac,
Makarychev, Makarychev 2006], [Louis, Makarychev 2014]
Spectral graph partitioning [Lee, Oveis Gharan, Trevisan 2012], [Louis,
Raghavendra, Tetali, Vempala 2012]
Spherical cubes [Kindler, OโDonnell, Rao, Wigderson 2008]
Metric embeddings [Fakcharoenphol, Rao, Talwar 2003], [Mendel, Naor 2005]
Communication complexity [Bogdanov, Mossel 2011], [Canonne, Guruswami,
Meka, Sudan 2015]
LSH Algorithms for Euclidean space
12
Space Time Exponent ๐ = ๐ Reference
๐1+๐ ๐๐ ๐ = 1/๐ ๐ = 1/2 [IMโ98, DIIMโ04]
๐ โ 1/๐2 ๐ = 1/4 [AIโ06]
Is there even better LSH map?
NO: any map must satisfy
๐ โฅ 1/๐2
[Motwani-Naor-Panigrahyโ06, OโDonell-Wu-Zhouโ11]
Example of isoperimetry, example of which is question:
Among bodies in ๐ ๐ of volume 1, which has the lowest perimeter?
A ball!
Some other LSH algorithms
Hamming distance
๐: pick a random coordinate(s) [IMโ98]
Manhattan distance:
๐: cell in a randomly shifted grid
Jaccard distance between sets:
๐ฝ ๐ด, ๐ต =๐ดโฉ๐ต
๐ดโช๐ต
๐: pick a random permutation ๐ on the words
๐ ๐ด = min๐โ๐ด
๐(๐)
min-wise hashing
[Broderโ97, Christiani-Paghโ17]
13
To be or
not to beTo search or
not to search
โฆ21102โฆ
be
toor
not
sketc
h
โฆ01122โฆ
be
toor
not
sear
ch
โฆ11101โฆ โฆ01111โฆ
{be,not,or,to} {not,or,to,search}
1 1
not
not
for ๐=be,to,search,or,not
be to
LSH is tightโฆ whatโs next?
14
Space-time trade-offsโฆ[Panigrahyโ06, A.-Indykโ06, Kapralovโ15,
A.-Laarhoven-Razenshteyn-Waingartenโ17]
Datasets with additional structure[Clarksonโ99,
Karger-Ruhlโ02,
Krauthgamer-Leeโ04,
Beygelzimer-Kakade-Langfordโ06,
Indyk-Naorโ07,
Dasgupta-Sinhaโ13,
Abdullah-A.-Krauthgamer-Kannanโ14,โฆ]
Are we really done with basic NNS algorithms?
Beyond Locality Sensitive Hashing?
15
Non-example:
define ๐ ๐ to be the identity of closest point to ๐
computing ๐(๐) is as hard as the problem-to-be-solved!
Can get better maps, if allowed to
depend on the dataset!
Can get better, efficient maps, if
depend on the dataset!
Space Time Exponent ๐ = ๐ Reference
๐1+๐ ๐๐ ๐ โ 1/๐2 ๐ = 1/4 [AIโ06]
๐ โ1
2๐2 โ 1
๐ = 1/7 [A.-Indyk-Nguyen-Razenshteynโ14,
A.-Razenshteynโ15]
best LSH
algorithm
โ Iโll tell you where to find
The Origin of Species once
you recite all existing books
New Approach: Data-dependent LSH
16
Two new ideas:
has LSH with better quality ๐
data-dependent
[A-Razenshteynโ15]
1) a nice point configuration
2) can always reduce to such configuration
17
As if vectors chosen randomly from Gaussian distribution
Points on a unit sphere, where
๐๐ โ 2, i.e., dissimilar pair is (near) orthogonal
Similar pair: ๐ = 2/๐
Like ball carving
Curvature helps get better quality partition
๐๐
1) a nice point configuration
Map ๐:
โข Randomly slice out caps
on sphere surface
18
A worst-case to (pseudo-)random-case reduction
a form of โregularity lemmaโ
Lemma: any pointset ๐ โ ๐ ๐ can be decomposed
into clusters, where one cluster is pseudo-random
and the rest have smaller diameter
1) a nice point configuration
2) can always reduce to such configuration
๐๐
Beyond Euclidean space
19
Data-dependent hashing:
Better algorithms for Hamming space
Also algorithms for distances where vanilla LSH does not work!
E.g.: distance ||๐ฅ โ ๐ฆ||โ = max๐=1..๐
|๐ฅ๐ โ ๐ฆ๐| [Indykโ98, โฆ]
Even more beyond?
Approach 3: metric embeddings
Geometric reduction b/w
different spaces
Rich theory in Functional Analysis
0,1 ๐
Hamming dist.
Sets of points
Earth-Mover Distance
(Wasserstein space)
[Charikarโ02,
Indyk-Thaperโ04,
Naor-Schechtmanโ06,
A-Indyk-Krauthgamerโ08โฆ]
Summary: Similarity Search
20
Different applications lead to different geometries
Connects to rich mathematical areas: Space partitions and isoperimetry: whatโs the body with least perimeter?
Metric embeddings: can we map some geometries into others well?
Only recently we (think we) understood the Euclidean metric Properties of many other geometries remain unsolved!
objects
similarity
high-dimensional vectors
distance b/w vectors
Similarity
Search
Nearest Neighbor
Search
Geometry
To search or
not to search
top related