geometry of similarity searchandoni/similaritysearch_simons.pdfmeasuring similarity 000000 011100...

Post on 12-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Geometry of

Similarity Search

Alex Andoni(Columbia University)

Find pairs of similar images

2

how should we

measure similarity?

Naรฏvely: about ๐‘›2 comparisons

Can we do better?

Measuring similarity

000000

011100

010100

000100

010100

011111

000000

001100

000100

000100

110100

111111

objects โ‡’ high-dimensional vectors

similarity โ‡’ distance b/w vectors

Image courtesy of Kristen Grauman

0,1 ๐‘‘

Hamming dist.

๐‘…๐‘‘

Euclidean dist.

Sets of points

Earth-Mover

Distance

Problem: Nearest Neighbor Search (NNS)

Preprocess: a set ๐‘ƒ of points

Query: given a query point ๐‘ž, report a point

๐‘โˆ— โˆˆ ๐‘ƒ with the smallest distance to ๐‘ž

Primitive for: finding all similar pairs

But also clustering problems, and many other

problems on large set of multi-feature objects

Applications:

speech/image/video/music recognition, signal

processing, bioinformatics, etcโ€ฆ

๐‘ž

๐‘โˆ—

4

๐‘›: number of points

๐‘‘: dimension

Preamble: How to check for an exact match ?

5

Query time Space

๐‘‚(log ๐‘›) ๐‘‚(๐‘›)

just pre-sort !Preprocess:

Sort the points

Query:

Perform binary search

Also works for NNS for 1-dimensional vectorsโ€ฆ

1.1 2 5 6 7 7.33.5

000000

011100

010100

000100

010100

011111

000000

001100

000100

000100

110100

111111

000000

000000

000000

000000

000000

000000

โ€ฆ

001100

011100

001100

001100

001100

001100

High-dimensional case

6

Algorithm Query time Space

No indexing ๐‘‚(๐‘› โ‹… ๐‘‘) ๐‘‚(๐‘› โ‹… ๐‘‘)

Full indexing ๐‘‚(๐‘‘) 2๐‘‘

0,1 ๐‘‘

Hamming dist.

Overprepared: store an answer

for every possible query

Best indexing ? ๐‘‚(๐‘‘) ๐‘‚(๐‘› โ‹… ๐‘‘)

Underprepared: no

preprocessing

unaffordable if ๐‘‘ โ‰ซ log ๐‘›

A little better

indexing ?

๐‘›0.99 ๐‘‚(๐‘›2)

Curse of dimensionality:

would refute a (very) strong

version of ๐‘ท โ‰  ๐‘ต๐‘ท conjecture

[Williamsโ€™04]

๐‘› = 1,000,000,000

๐‘‘ = 400

Relaxed problem: Approximate Near Neighbor Search

๐‘Ÿ-near neighbor: given a query point ๐‘ž,

report a point ๐‘โ€ฒ โˆˆ ๐‘ƒ s.t. ๐‘โ€ฒ โˆ’ ๐‘ž โ‰ค ๐‘Ÿ

as long as there is some point within

distance ๐‘Ÿ

Remarks:

In practice: used as a filter

Randomized algorithms: each point

reported with 90% probability

Can use to solve nearest neighbor too

[HarPeled-Indyk-Motwaniโ€™12]

๐‘Ÿ

๐‘ž

๐‘โˆ—

๐‘โ€ฒ

๐‘๐‘Ÿ

๐‘๐‘Ÿ

7

similar

not similar

kind ofโ€ฆ

either way

Approach: Locality Sensitive Hashing

Map ๐‘” on ๐‘…๐‘‘ s.t. for any points ๐‘, ๐‘ž

for similar pairs (when ๐‘ž โˆ’ ๐‘ โ‰ค ๐‘Ÿ )

๐‘”(๐‘ž) = ๐‘”(๐‘)

for dissimilar pairs (when ๐‘ž โˆ’ ๐‘โ€ฒ > ๐‘๐‘Ÿ )

๐‘” ๐‘ž โ‰  ๐‘” ๐‘โ€ฒ

๐‘ž

๐‘

๐‘ž โˆ’ ๐‘

Pr[๐‘”(๐‘ž) = ๐‘”(๐‘)]

๐‘Ÿ ๐‘๐‘Ÿ

1

๐‘ƒ1

๐‘ƒ2

๐‘›๐œŒ, where

๐‘ž

Pr[๐‘”(๐‘ž) = ๐‘”(๐‘)] is not-too-low๐‘ƒ1 =

๐‘ƒ2 =

๐œŒ =log 1/๐‘ƒ1log 1/๐‘ƒ2

8

๐‘โ€ฒMap: points โ†’ codes: s.t.

โ€œsimilarโ€ โ‡” โ€œexact matchโ€

Pr[๐‘”(๐‘ž) = ๐‘”(๐‘)] is low

[Indyk-Motwaniโ€™98]

How to construct good maps?

several indexes

Use an index on ๐‘”(๐‘) for ๐‘ โˆˆ ๐‘ƒ

Space Time Exponent ๐’„ = ๐Ÿ

๐‘›1+๐œŒ ๐‘›๐œŒ ๐œŒ = 1/๐‘ ๐œŒ = 1/2

Map #1 : random grid

9

๐‘ž ๐‘

๐‘โ€ฒ

Map ๐‘”:

โ€ข partition in a regular grid

โ€ข randomly shifted

โ€ข randomly rotated

Can we do better?

[Datar-Indyk-Immorlica-Mirrokniโ€™04]

Regular grid โ†’ grid of balls ๐‘ can hit empty space, so take more such grids

until ๐‘ is in a ball

How many grids? about ๐‘‘๐‘‘

start by projecting in dimension ๐‘ก

Choice of reduced dimension ๐‘ก? ๐œŒ closer to bound for higher ๐‘ก Number of grids is ๐‘ก๐‘‚(๐‘ก)

Map #2 : ball carving

2D

๐‘

๐‘๐‘…๐‘ก

[A-Indykโ€™06]

Space Time Exponent ๐’„ = ๐Ÿ

๐‘›1+๐œŒ ๐‘›๐œŒ ๐œŒ โ†’ 1/๐‘2 ๐œŒ โ†’ 1/4

Similar space partitions ubiquitous:

11

Approximation algorithms [Goemans, Williamson 1995], [Karger, Motwani,

Sudan 1995], [Charikar, Chekuri, Goel, Guha, Plotkin 1998], [Chlamtac,

Makarychev, Makarychev 2006], [Louis, Makarychev 2014]

Spectral graph partitioning [Lee, Oveis Gharan, Trevisan 2012], [Louis,

Raghavendra, Tetali, Vempala 2012]

Spherical cubes [Kindler, Oโ€™Donnell, Rao, Wigderson 2008]

Metric embeddings [Fakcharoenphol, Rao, Talwar 2003], [Mendel, Naor 2005]

Communication complexity [Bogdanov, Mossel 2011], [Canonne, Guruswami,

Meka, Sudan 2015]

LSH Algorithms for Euclidean space

12

Space Time Exponent ๐’„ = ๐Ÿ Reference

๐‘›1+๐œŒ ๐‘›๐œŒ ๐œŒ = 1/๐‘ ๐œŒ = 1/2 [IMโ€™98, DIIMโ€™04]

๐œŒ โ‰ˆ 1/๐‘2 ๐œŒ = 1/4 [AIโ€™06]

Is there even better LSH map?

NO: any map must satisfy

๐œŒ โ‰ฅ 1/๐‘2

[Motwani-Naor-Panigrahyโ€™06, Oโ€™Donell-Wu-Zhouโ€™11]

Example of isoperimetry, example of which is question:

Among bodies in ๐‘…๐‘‘ of volume 1, which has the lowest perimeter?

A ball!

Some other LSH algorithms

Hamming distance

๐‘”: pick a random coordinate(s) [IMโ€™98]

Manhattan distance:

๐‘”: cell in a randomly shifted grid

Jaccard distance between sets:

๐ฝ ๐ด, ๐ต =๐ดโˆฉ๐ต

๐ดโˆช๐ต

๐‘”: pick a random permutation ๐œ‹ on the words

๐‘” ๐ด = min๐‘Žโˆˆ๐ด

๐œ‹(๐‘Ž)

min-wise hashing

[Broderโ€™97, Christiani-Paghโ€™17]

13

To be or

not to beTo search or

not to search

โ€ฆ21102โ€ฆ

be

toor

not

sketc

h

โ€ฆ01122โ€ฆ

be

toor

not

sear

ch

โ€ฆ11101โ€ฆ โ€ฆ01111โ€ฆ

{be,not,or,to} {not,or,to,search}

1 1

not

not

for ๐œ‹=be,to,search,or,not

be to

LSH is tightโ€ฆ whatโ€™s next?

14

Space-time trade-offsโ€ฆ[Panigrahyโ€™06, A.-Indykโ€™06, Kapralovโ€™15,

A.-Laarhoven-Razenshteyn-Waingartenโ€™17]

Datasets with additional structure[Clarksonโ€™99,

Karger-Ruhlโ€™02,

Krauthgamer-Leeโ€™04,

Beygelzimer-Kakade-Langfordโ€™06,

Indyk-Naorโ€™07,

Dasgupta-Sinhaโ€™13,

Abdullah-A.-Krauthgamer-Kannanโ€™14,โ€ฆ]

Are we really done with basic NNS algorithms?

Beyond Locality Sensitive Hashing?

15

Non-example:

define ๐‘” ๐‘ž to be the identity of closest point to ๐‘ž

computing ๐‘”(๐‘ž) is as hard as the problem-to-be-solved!

Can get better maps, if allowed to

depend on the dataset!

Can get better, efficient maps, if

depend on the dataset!

Space Time Exponent ๐’„ = ๐Ÿ Reference

๐‘›1+๐œŒ ๐‘›๐œŒ ๐œŒ โ‰ˆ 1/๐‘2 ๐œŒ = 1/4 [AIโ€™06]

๐œŒ โ‰ˆ1

2๐‘2 โˆ’ 1

๐œŒ = 1/7 [A.-Indyk-Nguyen-Razenshteynโ€™14,

A.-Razenshteynโ€™15]

best LSH

algorithm

โ€œ Iโ€™ll tell you where to find

The Origin of Species once

you recite all existing books

New Approach: Data-dependent LSH

16

Two new ideas:

has LSH with better quality ๐œŒ

data-dependent

[A-Razenshteynโ€™15]

1) a nice point configuration

2) can always reduce to such configuration

17

As if vectors chosen randomly from Gaussian distribution

Points on a unit sphere, where

๐‘๐‘Ÿ โ‰ˆ 2, i.e., dissimilar pair is (near) orthogonal

Similar pair: ๐‘Ÿ = 2/๐‘

Like ball carving

Curvature helps get better quality partition

๐‘๐‘Ÿ

1) a nice point configuration

Map ๐‘”:

โ€ข Randomly slice out caps

on sphere surface

18

A worst-case to (pseudo-)random-case reduction

a form of โ€œregularity lemmaโ€

Lemma: any pointset ๐‘ƒ โˆˆ ๐‘…๐‘‘ can be decomposed

into clusters, where one cluster is pseudo-random

and the rest have smaller diameter

1) a nice point configuration

2) can always reduce to such configuration

๐‘๐‘Ÿ

Beyond Euclidean space

19

Data-dependent hashing:

Better algorithms for Hamming space

Also algorithms for distances where vanilla LSH does not work!

E.g.: distance ||๐‘ฅ โˆ’ ๐‘ฆ||โˆž = max๐‘–=1..๐‘‘

|๐‘ฅ๐‘– โˆ’ ๐‘ฆ๐‘–| [Indykโ€™98, โ€ฆ]

Even more beyond?

Approach 3: metric embeddings

Geometric reduction b/w

different spaces

Rich theory in Functional Analysis

0,1 ๐‘‘

Hamming dist.

Sets of points

Earth-Mover Distance

(Wasserstein space)

[Charikarโ€™02,

Indyk-Thaperโ€™04,

Naor-Schechtmanโ€™06,

A-Indyk-Krauthgamerโ€™08โ€ฆ]

Summary: Similarity Search

20

Different applications lead to different geometries

Connects to rich mathematical areas: Space partitions and isoperimetry: whatโ€™s the body with least perimeter?

Metric embeddings: can we map some geometries into others well?

Only recently we (think we) understood the Euclidean metric Properties of many other geometries remain unsolved!

objects

similarity

high-dimensional vectors

distance b/w vectors

Similarity

Search

Nearest Neighbor

Search

Geometry

To search or

not to search

top related