set similarity · 2019-08-05 · before we start… let’s consider three internet technologies...

Rasmus Pagh IT University of Copenhagen

Google ResearchBARC

WADS, Edmonton, August 5, 2019

06/02/2017, 08.30

of 1file:///Users/pagh/Downloads/potrace-1.13.mac-x86_64/barc.svg

SCALABLESIMILARITYSEARCH

Set Similarity – a Survey

!4

Set of Q&A

Before we start…

Let’s consider three internet technologies launched around 20 years ago

!5

Recommendations

!6

Advanced search

!7

Wildcard operator

Edm?nt?n map

!8

Before we start…

What happened to wildcard search and to boolean expressions?

!9

Set of shopping carts with Canadian Train Ride

Set of shopping carts with Trans-American Train Ride

(4,o)

(1,E) (2,d) (3,m) (5,n) (6,t) (8,n)

(7,o)

Web pages containing “ballroom”

Web pages containing

“dance”Web pages containing

“salsa”

It’s all about sets

!10

Setting of this talk• We are given a collection of sets � that we

are allowed to preprocess.

• Seek answer to queries such as:

- Given � what is the size of � ? � ?

- Given a set � , is there an � such that � ? � ?

- Given a set � and an integer � , is there an � such that� ?

S1, …, Sn ⊆ U

i, j Si ∪ Sj Si ∩ Sj

Q i Q ⊆ Si Q ⊇ Si

Q t i|Q ∩ Si | ≥ t

!11

Similarity computation

Similarity search

!14


Similarity search

Good news 3 4Bad news 1 2

Bad news• Query: Given � what is the size of � ?

• [Pǎtraşcu ’10], [Kopelowitz et al. ’14]:

- Assume we can preprocess sets � , each of size � , in time � such that it is possible to determine if � in time � .

- Then integer 3SUM can be solved in time � .

i, j Si ∩ Sj

S1, …, Sn ⊆ [n]n O(n 1.99)

Si ∩ Sj = ∅ O(n 0.49)O(n 1.991)

Suggests polylog ! query time not possible without

essentially precomputing all answers(n)

!15


Bad news• Given a set � , is there an � such that � ?

• [Williams ’04], [Alman & Williams ’15]:- Assume we can preprocess sets � in

time poly� such that it is possible to determine if � , in time � .

- Then � such that k-SAT witn � variables can be solved in time � .

Q i Q ⊆ Si

S1, …, Sn ⊆ [n 0.01](n)

∃i : Q ⊆ Si O(n 0.99)∃c < 2 n

cn

Under strong exponential time hypothesis, this is not possible!

!16

Similarity search

The good news…• We can now explain why nearly no progress on basic set

processing problems has been made since the 1970s.

• More constructively, it justifies looking at � -approximate versions of these problems:

- Given � what is the approximate size of � and � , up to a multiplicative error � ?

- Given a set � and an integer � , is there an � such that� or is � for all � ?

c

i, j Si ∪ SjSi ∩ Sj c > 1

Q t i|Q ∩ Si | ≥ t |Q ∩ Si | ≤ t/c i

!18

!19


Similarity search

Good news 4Bad news 1 2

3

Similarity estimation attempt 2:

Coordinated sampling• Sample � where � independently with

probability � , and let � .

• Observe that � , and by Chernoff bounds � with probability � .

• Can estimate � if sampling rate � .

• Time to compute estimate is � .

U′� ⊆ U x ∈ U′�α S′�i = Si ∩ U′�

μ = E[ |S′ �i ∩ S′�j | ] = α |Si ∩ Sj ||S′�i ∩ S′�j | ≈ μ 1 − e− Ω(μ)

|Si ∩ Sj | ≈ |S′ �i ∩ S′�j | /αα ≫ 1/ |Si ∩ Sj |

|S′�i | + |S′ �j | ≈ α( |Si | + |Sj | )

[Brewer et al. ’72]

!20

Need to store set !U′�

Sample size is variable

A mystery of alpine flowers

!21

Bulletin de la Société Vaudoise des Sciences Naturelles

Vol. XXXVn. N" 140. 1901

DISTRIBUTION DE LA FLORE ALPINE

DANS LE

Bassin des Dranses et dans quelques régions voisines

PAR LE

Dr Paul JACCARD, professeur.

I

Dans un précédent mémoire1, la comparaison de la florealpine des trois régions : Trient, Bagnes, Wildhorn, m'a¬menait à conclure que la richesse en espèces et surtout laproportion des espèces spéciales à chacune des régionscomparées est sensiblement proportionnée à la variété deleurs conditions biologiques.Jusqu'à quel point cette conclusion est-elle générale?

C'est ce que je me propose d'établir dans le présent mé¬moire en m'occupant tout d'abord d'une exception appa¬rente à la conclusion que je viens de rappeler.

Il s'agit du Grand Saint-Bernard ct du val d'Entremont.

1 Ce travail est la suite d'un mémoire publié dans le Bulletin de la Soc. vau¬doise de l'année dernière, vol. XXXVI, et intitulé : Contribution au problèmede l'immigration de la flore alpine. Il reproduit en les développant les deuxnotes parues dans les Archives des Se. phys. et nat. de Genève, t. X, octobreiqoo : L'immigration post-glaciaire et la distribution actuelle de la flore al.pine dans quelques régions des Alpes, et dans les Comptes rendus du Congrèsinternational de botanique de Paris, 1900, p. 3i-38, Méthode de déterminationde la distribution de la flore alpine.

XXXVII IÖ

1901-1996:41 citations (Google scholar)

1997-2019:~2800 citations (Google scholar)

Min-wise hashing (aka. minhash)

• Pick random hash function � and define:�

• �

• Repeat � times to get sample of size � . Advantages:

- Coordinated samples without storing a set � .- Storage requirement is fixed.

h : U → [n 10]

minhashh(Si) = arg minx∈ Si

h(x)

Pr[minhashh(Si) = minhashh(Sj)] ≈ |Si ∩ Sj | / |Si ∪ Sj |

s sU′�

[Broder ’97]

!22

DISTRIBUTION DE LA FLORE ALPINE

DANS LE

Bassin des Dranses et dans quelques régions voisines

PAR LE

Dr Paul JACCARD, professeur.

Minhash estimation

• Pick random hash functions � , � .

• Create sketch vectors � , where � .

• Estimator: �

• �

ht : U → [n 10] t = 1,…, s

v(Si) v(Si)t = minhashht(Si)

X = 1s ∑

t1v(Si)t= v(Sj)t

E[X] ≈|Si ∩ Sj ||Si ∪ Sj |

= J(S1, S2)

[Broder ’97]

!23

!Si

!Sj

!v(Si)

!v(Sj)

�=�? �=�?�…

Var[X] = J(1 − J)s

1-bit minhash• Idea: Compress the vector � to � .

• Use hash functions � , and define:�

• Estimator for Jaccard sim.: � .

v(Si) ∈ Us v′�(Si) ∈ {0,1}s

gt : U → {0,1}

v′�(Si)t = gt(v(Si)t)

X′� = 2s (∑

t1v′�(Si)t= v′�(Sj)t) − 1

!24

[Li and König ’09]

Var[X′�] = (1 + J)(1 − J)s

Factor ! larger than minhash

(1 + J)/J

Optimality of 1-bit minhash

• [P.-Stöckel-Woodruff ’14]: The variance of any estimator for Jaccard similarity based on � -bit summaries must be � for � .

• What happens when � is close to zero or one?

- Not much seems to be known about � .

- Experiments in [Li and König ’09] suggest that using � -bit minwise hashing is better for � .

s Ωε(1/s) J ∈ (ε,1 − ε)

J

J ≈ 0b

J ≈ 0

!25

!26

[Christiani ’18]

Lower variance for low similarities

0.2 0.4 0.6 0.8 1.0|Si∩Sj|/w

0.10.20.30.40.5

Hamming distance / s

CP hash1-bit minwise1-bit CP

1-bit minwise

• Choose � , where � indep.

• Parameter � is chosen s.t. � .

• Define � .

It ⊆ UPr[k ∈ It] = p

pPr[S∩ It = ∅] = 1

2

v′�′�′ �(S)t = 1S∩It≠∅

“CP hash”

Assume for simplicity that all sets have size !w

• Variance improves by factor almost 2 for small � .J

Lower variance for high similarities

!27

[Mitzenmacher, P., Pham ’14]

• Start with minhash � .

• 1-bit minhash: �

• Alternative binarization, “odd sketch”:

Use hash function � , define � .

• Can estimate � from � , error proportional to � .

v(Si) ∈ Uαs

v′�(Si)t = gt(v(Si)t)

g : U → {1,…, s}v′�′�(S)t = ∑

j1g(v(S)j)= t mod 2

|Si △ Sj | = |Si\Sj | + |Sj\Si |v′�′�(Si) ⊕ v′�′ �(Sj) 1 − J(S1, S2)

g g g(x) g(x’)

!28

Similarity estimation

Similarity search

Good news 3Bad news 1 2

4

Minhash for searching• Fix reals � .

• Query: Given � , find � such that � , assuming that for all � we have � .

• Data structure: Choose � such that � . For each set �store � in a hash table, with pointer to � .

• Query: Look up � in hash table, inspect linked set(s).• Analysis:

- Expected number of matching sets, � .

- Success probability � ; repeat until success.

1 > j1 > j2 > 0Q ⊆ U i J(Q, Si) ≥ j1

i ≠j J(Q, Sj) ≤ j2s js

2 ≈ 1/n Siv(Si) Si

v(Q)

E [∑i

1v(Q)= v(Si)] ≤ 2

js1 ≈ n − log( j1)/log( j2)

!29

[Indyk & Motwani ’98]

Is min-hash search optimal?

!30

Can we hope to beat � ?

• [Christiani-P. ’17], [Ahle ’19]: Improvement of the exponent is possible!

• [Chen-Williams ’19], [Stausholm-P.-Thorup ’19]: Assuming the Strong Exponential Time Hypothesis, time� requires that � .

O (n log( j1)/log( j2))

n 1− Ω(1) log( j1)/log( j2) < 1 − Ω(1)

Assume for simplicity that all sets have size !w

ChosenPath algorithm

• Choose � , where � .

• Create recursive data structures for

sets � for �

until recursion depth � .

• Queries: For each � , recurse in subtree � (if it exists), perform exhaustive search at leaves.

I ⊆ U Pr[k ∈ I] = 1 + j12j1w

Xk = {Si | k ∈ Si} k ∈ I

⌈log(n)/log ( 1 + j22j2 )⌉

k ∈ Q Xk

31

[Christiani-P. ’17]

X = {Si | i = 1,…, n}

ChosenPath analysis

32

• Suppose � . Then the set of “good” recursive calls � has expected size at least 1.

• In branching process terminology: expected number of offspring is at least 1 at each level of the recursion.

• Theory of branching processes [Agresti ’74] implies success probability � at level � .

• Repeat � times for constant success probability.

|Si ∩ Q | / |Si ∪ Q | ≥ j1k ∈ I ∩ Q ∩ Si

1/(λ + 1) λλ

x x'ySi SjQ

• Combines ChosenPath with an idea of “supermajorities” inspired by angular LSH to get improved results for asymmetric sets, � .|Q | ≠|Si |

!33

Partial match

• Special case is “partial match” queries, � .|Q | = j1 |Si |

!34

Supermajorities for partial matchConsider

case where minhash leads to

search time ! .n

!35

Beyond set similarityIn many research communities: Hashing = mapping to ! .{0,1}s

Some open problems

1. Is there a single sketch that is simultaneously space/variance optimal for low and high Jaccard similarity?

2.Known ! -bit sketches and estimators for Jaccard similarity are symmetric. Can asymmetry improve precision?

3.How many bits are needed to estimate Jaccard similarity up to factor ! when ! ?

s

1 + ε J → 0

!36

Similarity estimation

bit.ly/2T3laP0

More open problems

4.We wish to choose ! from an explicit family of functions such

that ! .Is there an explicit such family of size ! ?

5.Similarity search in Euclidean/Hamming space can be made faster using data dependent LSH. What kind of speedup can be achieved for set similarity (maybe via embedding)?

6. Is the performance of Ahle’s supermajorities algorithm the best possible for LSH-based partial match?

hPr[minhashh(Si) = minhashh(Si)] = (1 ± ε)

|Si ∩ Sj |

|Si ∪ Sj |

O(poly(1/ε) log |U | )

!37

Similarity search

bit.ly/2T3laP0

!38

That’s all Folks!not

Timothy Chan, Saladi Rahul and Jie Xue. Range closest-pair search in higher dimensions

Boris Aronov, Omrit Filtser, Michael Horton, Matthew Katz and Khadijeh Sheikhan. Efficient Nearest-Neighbor Query and Clustering of Planar Curves

Timothy M. Chan, Yakov Nekrich and Michiel Smid. Orthogonal Range Reporting and Rectangle Stabbing for Fat Rectangles

Matteo Ceccarello, Anne Driemel and Francesco Silvestri. FRESH: Fréchet Similarity with Hashing

…

set similarity · 2019-08-05 · before we start… let’s consider three internet technologies...

Documents