1 algorithms for large data sets ziv bar-yossef lecture 12 june 18, 2006

1

Algorithms for Large Data Sets

Ziv Bar-YossefLecture 12

June 18, 2006

http://www.ee.technion.ac.il/courses/049011

2

Data Streams

3

Outline

The data stream model Approximate counting Distinct elements Frequency moments

4

The Data Stream Model f: An B

A,B arbitrary sets n: positive integer (think of n as large) Given x 2 An, each entry xi is called an “element”. Typically, A,B are “small” (constant size) sets

Goal: given x An, compute f(x) Frequently, approximation of f(x) suffices Usually, will use randomization

Streaming access to input Algorithm reads input in “sequential passes” In each pass x is read in the following order: x1,x2,…,xn

Impossible: random access, go backwards Possible: store portions of x (or other functions of x) in memory

5

Complexity Measures Space

Objective: use as little memory as possible Note: if we allow unlimited space, data stream model is the

same as the standard RAM model Ideally, up to O(logc n) for some constant c

Number of passes Objective: use as few passes as possible Ideally, only a single pass Usually, no more than a constant number of passes

Running time Objective: use as little time as possible Ideally, up to O(n logc n) for some constant c

6

Motivation Types of large data sets:

Pre-stored Stored on magnetic or optical media: tapes, disks, DVDs,…

Generated on the fly Data feeds, streaming media, packet streams,…

Access to large data sets: Random access:

costly (if data is pre-stored) infeasible (if data is generated on the fly)

Streaming access: the only feasible option Resources:

Memory: the primary bottleneck Number of passes:

a few (if data is pre-stored) single (if data is generated on the fly)

Time: cannot be more than quasi-linear

7

Approximate Counting [Morris 77, Flajolet 85] Input: a bitstring x {0,1}n

Goal: find H = number of 1’s in x Naïve solution: just count them!

O(log H) bits of space Can we do better? Answer 1: No!

Information theory implies an (log H) lower bound Answer 2: Yes!

But only approximately: Output closest power of 1+ to H Note: # possible outputs is O(log1+ H) = O(1/ log H) Hence, only O(log log H + log(1/)) bits of space suffice

8

Approximate Counting ( = 1)

k 0 for i = 1 to n do

if xi = 1, then with probability 1/2k, set k k + 1 output 2k - 1

General idea: Expected # of 1’s needed to increment k to k + 1 is 2k

k = 0 k = 1: after seeing 1 one k = 1 k = 2: after seeing 2 additional 1’s k = 2 k = 3: after seeing 4 additional 1’s … k = i-1 k = i: after seeing 2i-1 additional 1’s Therefore, we expect k to become i after seeing

1 + 2 + 4 + … + 2i-1 = 2i – 1 1’s

9

Approximate Counting: Analysis

For m = 0,…,H, let: Km = value of counter after seeing m 1’s.

For i = 0,…,m, let pm,i = Pr[Km = i]

Recursion: p0,0 = 1 pm,0 = 0, for m = 1,…,H pm,i = pm-1,i (1 – 1/2i) + pm-1,i-1 1/2i-1, for m = 1,…,H, i = 1,

…,m-1 pm,m = pm-1,m-1 1/2m-1, for m = 1,…,H

10


Define: Cm = 2Km

Lemma: E[Cm] = m + 1 Therefore, CH - 1 is an unbiased estimator for H Can show that Var[CH] is small, and therefore

w.h.p.

H/2 ≤ CH – 1 ≤ 2H.

Proof of lemma: By induction on m. Basis: E[C0] = 1, E[C1] = 2. Suppose m ≥ 2 and E[Cm-1] = m.

11


12

Better Approximation

So far, factor 2 approximation How do we obtain a 1+ approximation?

k 0 for i = 1 to n do

if xi = 1, then with probability 1/(1 + )k, set k k + 1

output ((1 + )k – 1)/

13

Distinct Elements[Flajolet, Martin 85] [Alon, Matias, Szegedy 96], [Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan 02]

Input: a vector x {1,2,…,m}n

Goal: find D = number of distinct elements of x Example: if x = (1,2,3,1,2,3), then D = 3

Naïve solution: use a bit vector of size m, and track the values that appeared at least once O(m) bits of space

Can we do better? Answer 1: No!

If we want exact number, need (m) bits of space Information theory gives only (log m) Need communication complexity arguments

Answer 2: Yes! But only approximately: Use only O(log m) bits of space

14

Estimating the Size of a Random Set Suppose we choose D << M1/2 elements uniformly and

independently from {1,…,M}: X1 is uniformly chosen from {1,…,M} X2 is uniformly chosen from {1,…,M} … XD is uniformly chosen from {1,…,M}

For each k = 1,…,D, how many elements of {1,…,M} do we expect to be smaller than min{X1,…,Xk}? k = 1, we expect M/2 elements to be less than X1 k = 2, we expect M/3 elements to be less than min{X1,X2} k = 3, we expect M/4 elements to be less than min{X1,X2,X3} … k = D, we expect M/(D+1) elements to be less than min{X1,…,XD}

Conversely, suppose S is a set of randomly chosen elements from {1,…,M} whose size is unknown

Then, if t = min S, we can estimate |S| as M/t – 1.

15

Distinct Elements, 1st Attempt

Let M >> m2

Pick a random “hash function” h: {1,…,m} {1,…,M},

h(1),…,h(m) are chosen uniformly and independently from {1,…,M}

Since M >> m2, probability of collisions is tiny

min M for i = 1 to n do

if h(xi) < min, min h(xi) output M/min

16

Distinct Elements: Analysis

Space: O(log M) = O(log m) Not quite. We’ll discuss this later.

Correctness: Let a1,…,aD be the distinct values of x1,…,xn

S = { h(a1),…,h(aD) } is a set of D random and independent elements from { 1,…,M }

Note: min = min S Algorithm outputs M/(min S)

Lemma: With probability at least 2/3, D/6 ≤ M/min ≤ 6D.

17

Distinct Elements: Correctness

Part 1: show

Define for k = 1,…,D:

Define:

Note:

18

Markov’s Inequality

X 0: a non-negative random variable t > 1 Then:

Need to show: By Markov’s inequality,

19

Distinct Elements: Correctness

Part 2: show

Define for k = 1,…,D:

Define:

Note:

20

Chebyshev’s Inequality X: an arbitrary random variable > 0 Then:

Need to show: By Chebyshev’s inequality,

By independence of Y1,…,YD:

Hence,

21

How to Store the Hash Function?

How many bits needed to represent a random hash function h: [m] [M]? O(m log M) = O(m log m) bits More than the naïve algorithm!

Solution: use “small” families of hash functions H will be a family of functions h: [m] [M] |H| = O(mc) for some constant c Each h H, can be represented in O(log m) bits Need H to be “explicit”: given representation of h, can

compute h(x), for any x, efficiently. How do we make sure H has the “random-like”

properties of totally random hash functions?

22

Universal Hash Functions[Carter, Wegman 79]

H is a 2-universal family of hash functions if: For all x y [m] and for all z,w [M],

when choosing h from H randomly, then Pr[h(x) = z and h(y) = w] = 1/M2

Conclusions: For each x, h(x) is uniform in [M] For all x y, h(x) and h(y) are independent h(1),…,h(m) is a sequence of uniform pairwise-

independent random variables k-universal families: straightforward

generalization

23

Construction of a Universal Family Suppose m = M and m is a prime power. [m] can then be associated with the finite field Fm Each two elements a,b Fm will define one hash function

in H |H| = |Fm|2 = m2

ha,b(x) = ax + b (operations in Fm) Note: if x y [m] and z,w [m], then ha,b(x) = z and

ha,b(y) = w iff

Since x y, the above system has a unique solution, and thus if we choose a,b at random the probability to hit the solution is exactly 1/m2.

24

Distinct Elements, 2nd Attempt Use a random hash function from a 2-universal family of

hash functions rather than a totally random hash function Space:

O(log m) for tracking the minimum O(log m) for storing the hash function

Correctness: Part 1:

h(a1),…,h(aD) are still uniform in [M] Linearity of expectation holds regardless of whether Z1,…,Zk are

independent or not. Part 2:

h(a1),…,h(aD) are still uniform in [M] Main point: variance of pairwise independent variables is additive:

25

Distinct Elements, Better Approximation So far we had a factor 6 approximation. How do we get a better one? 1 + approximation algorithm:

Find the t = O(1/2) smallest elements, rather than just the smallest one.

If v is the largest among these, output tM/v

Space: O(1/2 log m) Better algorithm: O(1/2 + log m)

26

Frequency Moments[Alon, Matias, Szegedy 96]

Input: a vector x {1,2,…,m}n

Goal: find Fk = k-th frequency moment of x

For each j {1,…,m}, fj = # of occurrences of j in x Ex: if x = (1,1,1,2,2,3) then f1 = 3, f2 = 2, f3 = 1

Examples F1 = n (counting) F0 = distinct elements F2 = measure of “pairwise collisions” Fk = measure of “k-wise collisions”

27

Frequency Moments: Data Stream Algorithms

F0: O(1/2 + log m)

F1: O(log log n + log(1/)) F2: O(1/2 (log m + log n))

Fk, k > 2: O(1/2 m1-2/k)

28

End of Lecture 12

1 algorithms for large data sets ziv bar-yossef lecture 12 june 18, 2006

Documents

h slide

entry x i

h note

s slide

varc h

x n impossible

h2 c h

constant c slide