set similarity · 2019-08-05 · before we start… let’s consider three internet technologies...
TRANSCRIPT
Rasmus Pagh IT University of Copenhagen
Google ResearchBARC
WADS, Edmonton, August 5, 2019
06/02/2017, 08.30
Page 1 of 1file:///Users/pagh/Downloads/potrace-1.13.mac-x86_64/barc.svg
SCALABLESIMILARITYSEARCH
Set Similarity – a Survey
!4
Set of Q&A
Before we start…
Let’s consider three internet technologies launched around 20 years ago
!5
Recommendations
!6
Advanced search
!7
Wildcard operator
Edm?nt?n map
!8
Before we start…
What happened to wildcard search and to boolean expressions?
!9
Set of shopping carts with Canadian Train Ride
Set of shopping carts with Trans-American Train Ride
(4,o)
(1,E) (2,d) (3,m) (5,n) (6,t) (8,n)
(7,o)
Web pages containing “ballroom”
Web pages containing
“dance”Web pages containing
“salsa”
It’s all about sets
!10
Setting of this talk• We are given a collection of sets � that we
are allowed to preprocess.
• Seek answer to queries such as:
- Given � what is the size of � ? � ?
- Given a set � , is there an � such that � ? � ?
- Given a set � and an integer � , is there an � such that� ?
S1, …, Sn ⊆ U
i, j Si ∪ Sj Si ∩ Sj
Q i Q ⊆ Si Q ⊇ Si
Q t i|Q ∩ Si | ≥ t
!11
Similarity computation
Similarity search
!14
Similarity computation
Similarity search
Good news 3 4Bad news 1 2
Bad news• Query: Given � what is the size of � ?
• [Pǎtraşcu ’10], [Kopelowitz et al. ’14]:
- Assume we can preprocess sets � , each of size � , in time � such that it is possible to determine if � in time � .
- Then integer 3SUM can be solved in time � .
i, j Si ∩ Sj
S1, …, Sn ⊆ [n]n O(n 1.99)
Si ∩ Sj = ∅ O(n 0.49)O(n 1.991)
Suggests polylog ! query time not possible without
essentially precomputing all answers(n)
!15
Similarity computation
Bad news• Given a set � , is there an � such that � ?
• [Williams ’04], [Alman & Williams ’15]:- Assume we can preprocess sets � in
time poly� such that it is possible to determine if � , in time � .
- Then � such that k-SAT witn � variables can be solved in time � .
Q i Q ⊆ Si
S1, …, Sn ⊆ [n 0.01](n)
∃i : Q ⊆ Si O(n 0.99)∃c < 2 n
cn
Under strong exponential time hypothesis, this is not possible!
!16
Similarity search
The good news…• We can now explain why nearly no progress on basic set
processing problems has been made since the 1970s.
• More constructively, it justifies looking at � -approximate versions of these problems:
- Given � what is the approximate size of � and � , up to a multiplicative error � ?
- Given a set � and an integer � , is there an � such that� or is � for all � ?
c
i, j Si ∪ SjSi ∩ Sj c > 1
Q t i|Q ∩ Si | ≥ t |Q ∩ Si | ≤ t/c i
!18
!19
Similarity computation
Similarity search
Good news 4Bad news 1 2
3
Similarity estimation attempt 2:
Coordinated sampling• Sample � where � independently with
probability � , and let � .
• Observe that � , and by Chernoff bounds � with probability � .
• Can estimate � if sampling rate � .
• Time to compute estimate is � .
U′� ⊆ U x ∈ U′�α S′�i = Si ∩ U′�
μ = E[ |S′ �i ∩ S′�j | ] = α |Si ∩ Sj ||S′�i ∩ S′�j | ≈ μ 1 − e− Ω(μ)
|Si ∩ Sj | ≈ |S′ �i ∩ S′�j | /αα ≫ 1/ |Si ∩ Sj |
|S′�i | + |S′ �j | ≈ α( |Si | + |Sj | )
[Brewer et al. ’72]
!20
Need to store set !U′�
Sample size is variable
A mystery of alpine flowers
!21
Bulletin de la Société Vaudoise des Sciences Naturelles
Vol. XXXVn. N" 140. 1901
DISTRIBUTION DE LA FLORE ALPINE
DANS LE
Bassin des Dranses et dans quelques régions voisines
PAR LE
Dr Paul JACCARD, professeur.
I
Dans un précédent mémoire1, la comparaison de la florealpine des trois régions : Trient, Bagnes, Wildhorn, m'a¬menait à conclure que la richesse en espèces et surtout laproportion des espèces spéciales à chacune des régionscomparées est sensiblement proportionnée à la variété deleurs conditions biologiques.Jusqu'à quel point cette conclusion est-elle générale?
C'est ce que je me propose d'établir dans le présent mé¬moire en m'occupant tout d'abord d'une exception appa¬rente à la conclusion que je viens de rappeler.
Il s'agit du Grand Saint-Bernard ct du val d'Entremont.
1 Ce travail est la suite d'un mémoire publié dans le Bulletin de la Soc. vau¬doise de l'année dernière, vol. XXXVI, et intitulé : Contribution au problèmede l'immigration de la flore alpine. Il reproduit en les développant les deuxnotes parues dans les Archives des Se. phys. et nat. de Genève, t. X, octobreiqoo : L'immigration post-glaciaire et la distribution actuelle de la flore al.pine dans quelques régions des Alpes, et dans les Comptes rendus du Congrèsinternational de botanique de Paris, 1900, p. 3i-38, Méthode de déterminationde la distribution de la flore alpine.
XXXVII IÖ
1901-1996:41 citations (Google scholar)
1997-2019:~2800 citations (Google scholar)
Min-wise hashing (aka. minhash)
• Pick random hash function � and define:�
• �
• Repeat � times to get sample of size � . Advantages:
- Coordinated samples without storing a set � .- Storage requirement is fixed.
h : U → [n 10]
minhashh(Si) = arg minx∈ Si
h(x)
Pr[minhashh(Si) = minhashh(Sj)] ≈ |Si ∩ Sj | / |Si ∪ Sj |
s sU′�
[Broder ’97]
!22
DISTRIBUTION DE LA FLORE ALPINE
DANS LE
Bassin des Dranses et dans quelques régions voisines
PAR LE
Dr Paul JACCARD, professeur.
Minhash estimation
• Pick random hash functions � , � .
• Create sketch vectors � , where � .
• Estimator: �
• �
ht : U → [n 10] t = 1,…, s
v(Si) v(Si)t = minhashht(Si)
X = 1s ∑
t1v(Si)t= v(Sj)t
E[X] ≈|Si ∩ Sj ||Si ∪ Sj |
= J(S1, S2)
[Broder ’97]
!23
!Si
!Sj
!v(Si)
!v(Sj)
�=�? �=�?�…
Var[X] = J(1 − J)s
1-bit minhash• Idea: Compress the vector � to � .
• Use hash functions � , and define:�
• Estimator for Jaccard sim.: � .
v(Si) ∈ Us v′�(Si) ∈ {0,1}s
gt : U → {0,1}
v′�(Si)t = gt(v(Si)t)
X′� = 2s (∑
t1v′�(Si)t= v′�(Sj)t) − 1
!24
[Li and König ’09]
Var[X′�] = (1 + J)(1 − J)s
Factor ! larger than minhash
(1 + J)/J
Optimality of 1-bit minhash
• [P.-Stöckel-Woodruff ’14]: The variance of any estimator for Jaccard similarity based on � -bit summaries must be � for � .
• What happens when � is close to zero or one?
- Not much seems to be known about � .
- Experiments in [Li and König ’09] suggest that using � -bit minwise hashing is better for � .
s Ωε(1/s) J ∈ (ε,1 − ε)
J
J ≈ 0b
J ≈ 0
!25
!26
[Christiani ’18]
Lower variance for low similarities
0.2 0.4 0.6 0.8 1.0|Si∩Sj|/w
0.10.20.30.40.5
Hamming distance / s
CP hash1-bit minwise1-bit CP
1-bit minwise
• Choose � , where � indep.
• Parameter � is chosen s.t. � .
• Define � .
It ⊆ UPr[k ∈ It] = p
pPr[S∩ It = ∅] = 1
2
v′�′�′ �(S)t = 1S∩It≠∅
“CP hash”
Assume for simplicity that all sets have size !w
• Variance improves by factor almost 2 for small � .J
Lower variance for high similarities
!27
[Mitzenmacher, P., Pham ’14]
• Start with minhash � .
• 1-bit minhash: �
• Alternative binarization, “odd sketch”:
Use hash function � , define � .
• Can estimate � from � , error proportional to � .
v(Si) ∈ Uαs
v′�(Si)t = gt(v(Si)t)
g : U → {1,…, s}v′�′�(S)t = ∑
j1g(v(S)j)= t mod 2
|Si △ Sj | = |Si\Sj | + |Sj\Si |v′�′�(Si) ⊕ v′�′ �(Sj) 1 − J(S1, S2)
g g g(x) g(x’)
!28
Similarity estimation
Similarity search
Good news 3Bad news 1 2
4
Minhash for searching• Fix reals � .
• Query: Given � , find � such that � , assuming that for all � we have � .
• Data structure: Choose � such that � . For each set �store � in a hash table, with pointer to � .
• Query: Look up � in hash table, inspect linked set(s).• Analysis:
- Expected number of matching sets, � .
- Success probability � ; repeat until success.
1 > j1 > j2 > 0Q ⊆ U i J(Q, Si) ≥ j1
i ≠j J(Q, Sj) ≤ j2s js
2 ≈ 1/n Siv(Si) Si
v(Q)
E [∑i
1v(Q)= v(Si)] ≤ 2
js1 ≈ n − log( j1)/log( j2)
!29
[Indyk & Motwani ’98]
Is min-hash search optimal?
!30
Can we hope to beat � ?
• [Christiani-P. ’17], [Ahle ’19]: Improvement of the exponent is possible!
• [Chen-Williams ’19], [Stausholm-P.-Thorup ’19]: Assuming the Strong Exponential Time Hypothesis, time� requires that � .
O (n log( j1)/log( j2))
n 1− Ω(1) log( j1)/log( j2) < 1 − Ω(1)
Assume for simplicity that all sets have size !w
ChosenPath algorithm
• Choose � , where � .
• Create recursive data structures for
sets � for �
until recursion depth � .
• Queries: For each � , recurse in subtree � (if it exists), perform exhaustive search at leaves.
I ⊆ U Pr[k ∈ I] = 1 + j12j1w
Xk = {Si | k ∈ Si} k ∈ I
⌈log(n)/log ( 1 + j22j2 )⌉
k ∈ Q Xk
31
[Christiani-P. ’17]
X = {Si | i = 1,…, n}
ChosenPath analysis
32
• Suppose � . Then the set of “good” recursive calls � has expected size at least 1.
• In branching process terminology: expected number of offspring is at least 1 at each level of the recursion.
• Theory of branching processes [Agresti ’74] implies success probability � at level � .
• Repeat � times for constant success probability.
|Si ∩ Q | / |Si ∪ Q | ≥ j1k ∈ I ∩ Q ∩ Si
1/(λ + 1) λλ
x x'ySi SjQ
• Combines ChosenPath with an idea of “supermajorities” inspired by angular LSH to get improved results for asymmetric sets, � .|Q | ≠|Si |
!33
Partial match
• Special case is “partial match” queries, � .|Q | = j1 |Si |
!34
Supermajorities for partial matchConsider
case where minhash leads to
search time ! .n
!35
Beyond set similarityIn many research communities: Hashing = mapping to ! .{0,1}s
Some open problems
1. Is there a single sketch that is simultaneously space/variance optimal for low and high Jaccard similarity?
2.Known ! -bit sketches and estimators for Jaccard similarity are symmetric. Can asymmetry improve precision?
3.How many bits are needed to estimate Jaccard similarity up to factor ! when ! ?
s
1 + ε J → 0
!36
Similarity estimation
bit.ly/2T3laP0
More open problems
4.We wish to choose ! from an explicit family of functions such
that ! .Is there an explicit such family of size ! ?
5.Similarity search in Euclidean/Hamming space can be made faster using data dependent LSH. What kind of speedup can be achieved for set similarity (maybe via embedding)?
6. Is the performance of Ahle’s supermajorities algorithm the best possible for LSH-based partial match?
hPr[minhashh(Si) = minhashh(Si)] = (1 ± ε)
|Si ∩ Sj |
|Si ∪ Sj |
O(poly(1/ε) log |U | )
!37
Similarity search
bit.ly/2T3laP0
!38
That’s all Folks!not
Timothy Chan, Saladi Rahul and Jie Xue. Range closest-pair search in higher dimensions
Boris Aronov, Omrit Filtser, Michael Horton, Matthew Katz and Khadijeh Sheikhan. Efficient Nearest-Neighbor Query and Clustering of Planar Curves
Timothy M. Chan, Yakov Nekrich and Michiel Smid. Orthogonal Range Reporting and Rectangle Stabbing for Fat Rectangles
Matteo Ceccarello, Anne Driemel and Francesco Silvestri. FRESH: Fréchet Similarity with Hashing
…