1 algorithms for large data sets ziv bar-yossef lecture 11 june 1, 2005
Post on 19-Dec-2015
220 views
TRANSCRIPT
![Page 1: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/1.jpg)
1
Algorithms for Large Data Sets
Ziv Bar-YossefLecture 11
June 1, 2005
http://www.ee.technion.ac.il/courses/049011
![Page 2: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/2.jpg)
2
Sketching
![Page 3: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/3.jpg)
3
Outline
Syntactic clustering of the web Locality sensitive hash functions Resemblance and shingling Min-wise independent permutations The sketching model Hamming distance Edit distance
![Page 4: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/4.jpg)
4
Motivation: Near-Duplicate Elimination Many web pages are duplicates or near-
duplicates of other pages Mirror sites FAQs, manuals, legal documents Different versions of the same document Plagiarism
Duplicates are bad for search engines Increase index size Harm quality of search results
Question: How to efficiently process the repository of crawled pages and eliminate (near)-duplicates?
![Page 5: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/5.jpg)
5
Syntactic Clustering of the Web[Broder, Glassman, Manasse, Zweig 97]
U: space of all possible documents S U: collection of documents sim: U × U [0,1]: a similarity measure among
documents If p,q are very similar sim(p,q) is close to 1 If p,q are very unsimilar, sim(p,q) is close to 0 Usually: sim(p,q) = 1 – d(p,q), where d(p,q) is a
normalized distance between p and q. G: a graph on S:
p,q are connected by an edge iff sim(p,q) t (t = threshold)
Goal: find the connected components of G
![Page 6: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/6.jpg)
6
Challenges
S is huge Web has 10 billion pages
Documents are not compressed Needs many disks to store S Each sim computation is costly
Documents in S should be processed in a stream Main memory is tine relative to |S| Cannot afford more than O(|S|) time How to create the graph G?
Naively, requires |S| passes and |S|2 similarity computations
![Page 7: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/7.jpg)
7
Sketching Schemes
T = a small set (|S| < |T| << |U|)
A sketching scheme for sim:Compression function: a randomized mapping
: U TReconstruction function: : TT [0,1]For every pair p,q, with high probability
((p),(q)) sim(p,q)
![Page 8: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/8.jpg)
8
Syntactic Clustering by Sketching
1. P empty table of size |S|2. G empty graph on |S| nodes3. for i = 1,…,|S|
4. read document pi from the stream
5. P[i] (pi)6. for i = 1,…,|S|7. for j = 1,…,|S|8. if ((P[i],P[j]) t) 9. add edge (i,j) to G10. output connected components of G
![Page 9: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/9.jpg)
9
Analysis
Can compute sketches in one pass Table P can be stored in a single file on a single
machine Creating G requires |S|2 applications of
Easier than full-fledged computations of sim Quadratic time is still a problem
Connected components algorithm is heavy but feasible
![Page 10: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/10.jpg)
10
Locality Sensitive Hashing (LSH)[Indyk, Motwani, 98]
A special kind of sketching schemes H = { h | h: U T }: a family of hash
functions H is locality sensitive w.r.t. sim if for all
p,q U, Pr[h(p) = h(q)] = sim(p,q).Probability is over random choice of h from HProbability of collision = similarity between p
and q
![Page 11: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/11.jpg)
11
Syntactic Clustering by LSH
1. P empty table of size |S|
2. G empty graph on |S| nodes
3. for i = 1,…,|S|
4. read document pi from the stream
5. P[i] h(pi)
6. sort P and group by value
7. output groups
![Page 12: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/12.jpg)
12
Analysis
Can compute hash values in one pass Table P can be stored in a single file on a single
machine Sorting and grouping takes O(|S| log |S|) simple
comparisons Each group A consists of pages whose hash
value is the same By LSH property, they are likely to be similar to each
other
![Page 13: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/13.jpg)
13
Shingling and Resemblance[Broder et al 97]
tokens: words, numbers, HTML tags, etc. tokenization(p): sequence of tokens produced from
document p w: a small integer Sw(p) = w-shingling of p = set all distinct contiguous
subsequences of tokenization(p) of length w. Ex: p = “a rose is a rose is a rose”, w = 4 Sw(p) = { (a rose is a), (rose is a rose), (is a rose is) }
resemblancew(p,q) =
![Page 14: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/14.jpg)
14
LSH for Resemblance
resemblancew(p,q) =
= a random permutation on w
induces a random order on all length w sequences of tokens also induces a random order on any subset X W
For each such subset and for each x X, Pr(min ((X)) = x) = 1/|X| LSH for resemblance: h(p) = min((Sw(p)))
Sw(p) Sw(q)
![Page 15: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/15.jpg)
15
LSH for Resemblance (cont.)
Lemma: Proof:
![Page 16: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/16.jpg)
16
Min-Wise Independent Permutations [Broder, Charikar, Frieze, Mitzenmacher, 98]
Usual problem: Storing takes too much space O(||w log ||w) bits to represent
Use small families of permutations A family = { | is a permutation on w } is
min-wise independent, if For all subsets X w and for all x X,
Pr(min ((X)) = x) = 1/|X|
Explicit constructions of small families of “approximately” min-wise independent permutations [Indyk 98]
![Page 17: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/17.jpg)
17
The Sketching ModelAlice Bob
Refereed(x,y) ≤ kd(x,y) ≤ k
x y
x)
y)
d(x,y) ≥ rd(x,y) ≥ r
Shared Randomness
Shared Randomnessk vs. r Gap
Problem
d(x,y) ≤ k or d(x,y) ≥ r
Decide which of the two holds.
ApproximationApproximation
Promise:
Goal:
![Page 18: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/18.jpg)
18
Applications
Large data sets Clustering Nearest Neighbor schemes Data streams Management of Files
over the Network Differential backup Synchronization
Theory Low distortion embeddings Simultaneous messages
communication complexity
![Page 19: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/19.jpg)
19
Known Sketching Schemes
Resemblance [Broder, Glassman, Manasse, Zweig 97], [Broder, Charikar, Frieze, Mitzenmacher 98]
Hamming distance [Kushilevitz, Ostrovsky, Rabani 98], [Indyk, Motwani 98] [Feigenbaum,Ishai,Malkin,Nissim,Strauss,Wright 01]
Cosine similarity [Charikar 02] Earth mover distance [Charikar 02] Edit distance [Bar-Yossef, Jayram, Krauthgamer,
Kumar 04]
![Page 20: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/20.jpg)
20
Sketching Algorithm for Hamming Distance [Kushilevitz, Ostrovsky, Rabani 98]
x,y: binary strings of length n HD(x,y) = # of positions in which x,y differ HD(x,y) = | { i | xi yi } |
Ex: x = 10101, y = 01010, HD(x,y) = 5
Goal: If HD(x,y) ≤ k, output “accept” w.p. 1 - If HD(x,y) ≥ 2k, output “reject” w.p. 1 -
KOR algorithm: O(log(1/)) size sketch.
![Page 21: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/21.jpg)
21
The KOR Algorithm Shared randomness: n i.i.d. random bits r1,…,rn, where
Basic sketch: h(x) = (i xi ri ) mod 2 Full sketch: (x) = (h1(x),…,ht(x))
t = O(log(1/)) h1,…,ht are generated independently like h
Reconstruction: 1. for j = 1,…,t do2. if (hj(x) = hj(y)) then3. zj 14. else5. zj 06. if avg(z1,…,zt) > 11/18 output “accept” and else output “reject”
![Page 22: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/22.jpg)
22
KOR: Analysis
dd
Note: # of terms in the sum = HD(x,y) Given HD(x,y) independent random bits, each with
probability 1/2k to be 1, what is the probability that their parity is 0?
![Page 23: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/23.jpg)
23
KOR: Analysis (cont.)
r1,…,rm: m independent random bits For each j, Pr(rj = 1) = What is Pr[j rj = 0)? Can view distribution of each bit as a mixture of two
distributions: Dist A (with probability 1 - 2): the bit 0 w.p. 1 Dist B (with probability 2): a uniformly chosen bit
Note: If all bits “choose” Dist A, then the parity is 0 w.p. 1 If one of the m bits “chooses” Dist B, then the parity is 0 w.p. ½
Hence,
![Page 24: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/24.jpg)
24
KOR Analysis (cont.)
ff
Therefore, If HD(x,y) ≤ k, then Pr[h(x) = h(y)] ≥ 1/2 + 1/2e 4/6 = 12/18 If HD(x,y) ≥ 2k, then Pr[h(x) = h(y)] ≤ 1/2 + 1/2e2 10/18
Define:
If HD(x,y) ≤ k, then E[Z] ≥ 12/18 If HD(x,y) ≥ 2k, then E[Z] ≤ 10/18
By Chernoff, t = O(log(1/)) enough to guarantee: If HD(x,y) ≤ k, then Z > 11/18 w.p. 1 - If HD(x,y) ≥ 2k, then Z ≤ 11/18 w.h.p 1 -
![Page 25: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/25.jpg)
25
Edit Distance
x 2 n, y 2 m
Minimum number of character insertions, deletions and substitutions that transform x to y.
Examples:
ED(00000, 1111) = 5
ED(01010, 10101) = 2
Applications
• Genomics
• Text processing
• Web searchFor simplicity: m = n, = {0,1}.
ED(x,y):
![Page 26: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/26.jpg)
26
Sketching Algorithm for Edit Distance [Bar-Yossef,Jayram,Krauthgamer,Kumar 04]
x,y: binary strings of length n Goal:
If ED(x,y) ≤ k, output “accept” w.p. 1 - If ED(x,y) ≥ ((kn)2/3), output “reject” w.p. ≥ 1 -
BJKK algorithm: O(log(1/)) size sketch.
![Page 27: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/27.jpg)
27
Basic Framework
Underlying Principle
ED(x,y) is small iff x and y share many common substrings at nearby positions.
Sx = set of pairs of the form (,h(i))
a substring of x
h(i): a “locality sensitive” encoding of the substring’s position
x
Sx
y
Sy
ED(x,y) small iff intersection Sx Å Sy
large
common substrings at nearby positions
![Page 28: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/28.jpg)
28
Basic Framework (cont.)
•Need to estimate size of symmetric difference
•Hamming distance computation of characteristic vectors
•Use O(log(1/)) size sketches [KOR]
x
Sx
y
Sy
ED(x,y) small iff symmetric difference
Sx Sy small
Reduced Edit Distance to Hamming DistanceReduced Edit Distance to Hamming Distance
![Page 29: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/29.jpg)
29
1 2 3
12 3
Encoding Scheme
Gap: k vs. O((kn)2/3)
x
y
B = n2/3/k1/3, W = n/B
1
Sx = {
Sy = {
2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
(1,1),
(1,1),
(2,1),
(2,1),
(3,2),
(3,2),
…
…
B windows of size W each.
,(i, win(i)),…
,(i, win(i)),…
![Page 30: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/30.jpg)
30
Analysis
j
ix
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Case 1: ED(x,y) · k
• If i is “unmarked”, it has a matching “companion” j
• (i,win(i)) 2 Sx n Sy, only if:
• either i is “marked”
• or i is unmarked, but win(i) win(j)
• At most kB marked substrings• At most k * n/W = kB companions with mismatched windows
• Therefore, Ham(Sx,Sy) · 4kB
![Page 31: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/31.jpg)
31
Analysis (cont.)
2
1x
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Case 2: Ham(Sx,Sy) · 8kB
• If i has a “companion” j and win(i) = win(j), can align i with
j using at most W operations
• Otherwise, substitute first character of i
• At most 8kB substrings of x have no companion• Therefore, ED(x,y) · 8kB + W * n/B = O((kn)2/3)
B+1 2B+1
B-1
![Page 32: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 1, 2005](https://reader035.vdocument.in/reader035/viewer/2022081514/56649d2a5503460f949fee2b/html5/thumbnails/32.jpg)
32
End of Lecture 11