# sketching, sampling and other sublinear algorithms: streaming

22
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)

Post on 23-Feb-2016

53 views

Category:

## Documents

Tags:

• #### sublinear space algorithm

DESCRIPTION

Sketching, Sampling and other Sublinear Algorithms: Streaming. Alex Andoni (MSR SVC). A scenario. Challenge : compute something on the table, using small space . 131.107.65.14. 18.9.22.69. Example of “something”: # distinct IPs max frequency other statistics…. 131.107.65.14. - PowerPoint PPT Presentation

TRANSCRIPT

Sketching, Sampling and other Sublinear Algorithms:

Streaming

Alex Andoni(MSR SVC)

A scenario

IP Frequency131.107.65.14 318.9.22.69 280.97.56.20 2

131.107.65.14

131.107.65.14

18.9.22.69

18.9.22.69

80.97.56.20

80.97.56.20IP Frequency131.107.65.14 318.9.22.69 280.97.56.20 2128.112.128.81 9127.0.0.1 8257.2.5.7 07.8.20.13 1

Challenge: compute something on the

table, using small space.

131.107.65.14Example of “something”: • # distinct IPs• max frequency• other statistics…

Sublinear: a panacea? Sub-linear space algorithm for solving Travelling

Salesperson Problem? Sorry, perhaps a different lecture

Hard to solve sublinearly even very simple problems: Ex: what is the count of distinct IPs seen

Will settle for: Approximate algorithms: 1+ approximation

true answer ≤ output ≤ (1+) * (true answer) Randomized: above holds with probability 95%

Quick and dirty way to get a sense of the data

IP Frequency131.107.65.14 318.9.22.69 280.97.56.20 2128.112.128.81 9127.0.0.1 8257.2.5.7 08.3.20.12 1

Streaming data

Data through a router Data stored on a hard drive, or streamed

remotely More efficient to do a linear scan on a hard drive Working memory is the (smaller) main memory

22

Application areas

Data can come from: Network logs, sensor data Real time data Search queries, served ads Databases (query planning) …

Problem 1: # distinct elements

Problem: compute the number of distinct elements in the stream

Trivial solution: space for distinct elements Will see: space (approximate)

2 5 7 5 5i Frequency2 15 37 1

Distinct Elements: idea 1 Algorithm:

Hash function Compute Output is

“Analysis”: repeats of the same element i don’t matter , for distinct elements

Algorithm DISTINCT:

Initialize: minHash=1 hash function h into [0,1]

Process(int i): if (h(i) < minHash) minHash = h(index);

Output: 1/minHash-1

h (2)

275

h (5) h (7)1/(𝑚+1)

10

[Flajolet-Martin’85, Alon-Matias-Szegedy’96]

Distinct Elements: idea 2 Store approximately

Store just the count of trailing zeros Need only bits

Randomness: 2-wise enough! bits

Better accuracy using more space: error repeat times with different hash functions HyperLogLog: can also with just one hash function

[FFGM’07]

ZEROS(x)x=0.0000001100101

Algorithm DISTINCT:

Initialize: minHash=1 hash function h into [0,1]

Process(int i): if (h(i) < minHash) minHash = h(index);

Output: 1/minHash-1

Algorithm DISTINCT:

Initialize: minHash2=0 hash function h into [0,1]

Process(int i): if (h(i) < 1/2^minHash2) minHash2 = ZEROS(h(index));

Output: 2^minHash2

Problem 2: max count Problem: compute the maximum frequency of an

element in the stream

Bad news: Hard to distinguish whether an element repeated

(max = 1 vs 2) Good news:

Can find “heavy hitters” elements with frequency > total frequency / s using space proportional to s

IP Frequency2 15 37 1

2 5 7 5 5

heavy hitters

Heavy Hitters: CountMin

11

1

2 5 7 5 5

1 12

1 1

1 21 2

1 1 1

2 21 3

1 2 1

3 21 4

1 3 1

Algorithm CountMin:

Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1}

Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1;

Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate

𝑤

𝐿

freqfreqfreq

11

freq

h1(2)h2(2)h3 (2 )

[Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05]

Heavy Hitters: analysis

= frequency of 5, plus “extra mass” Expected “extra mass” ≤ total mass / w Chebyshev: true with probability >1/2 to get high probability (for all

elements) Compute heavy hitters from freq[]

5

3 21 4

1 3 1

3𝐿

𝑤Algorithm CountMin:

Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1}

Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1;

Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate

Problem 3: Moments Problem: compute frequency moment

variance or higher moments for

Skewness (k=3), kurtosis (k=4), etc a different proxy for max:

IP Frequency

2 15 37 2

194

1+9+4=14

18116

1+81+16=98

moment Use Johnson-Lindenstrauss lemma! (2nd lecture) Store sketch

= frequency vector = by matrix of Gaussian entries

Update on element :

Guarantees: counters (words) time to update

Better: entries, update [AMS’96, TZ’04] : precision sampling => next

Scenario 2: distributed traffic Statistics on traffic difference/aggregate between two routers

Eg: traffic different by how many packets? Linearity is the power!

Sketch(data ) + Sketch(data ) = Sketch(data + data ) Sketch(data ) - Sketch(data ) = Sketch(data - data )

Two sketches should be sufficient to compute something on the difference or sum

IP Frequency131.107.65.14

1

18.9.22.69 135.8.10.140 1

IP Frequency

131.107.65.14

1

18.9.22.69 2

131.107.65.14

18.9.22.69

18.9.22.69

35.8.10.140

Common primitive: estimate sum Given: quantities in the range Goal: estimate “cheaply”

Standard sampling: pick random set of size Estimator:

Chebyshev bound: with 90% success probability

a1 a2 a3 a4

a1 a3

Compute an estimate from

Precision Sampling Framework Alternative “access” to ’s:

For each term , we get a (rough) estimate up to some precision , chosen in advance:

Challenge: achieve good trade-off between quality of approximation to use only weak precisions (minimize “cost” of

estimating )

a1 a2 a3 a4

u1 u2 u3 u4

a1 a2 a3 a4

Compute an estimate from

1. fix 1. fix precisions

2. fix s.t. 3. given , output s.t..

What is cost? Here, average cost = to achieve precision , use “resources”: e.g., if is itself a sum

computed by subsampling, then one needs samples For example, can choose all

Average cost ≈

Precision Sampling Lemma Goal: estimate ∑ai from {ai} satisfying |ai-ai|

<ui. Precision Sampling Lemma: can get, with 90%

success: O(1) additive error and 1.5 multiplicative error:

S – O(1) < S < 1.5*S + O(1) with average cost equal to O(log n)

Example: distinguish Σai=3 vs Σai=0 Consider two extreme cases:

if three ai=1: enough to have crude approx for all (ui=0.1)if all ai=3/n: only few with good approx ui=1/n, and the rest with ui=1

ε 1+εS – ε < S < (1+ ε)S + ε

O(ε-3 log n)

[A-Krauthgamer-Onak’11]

Precision Sampling Algorithm Precision Sampling Lemma: can get, with 90%

success: O(1) additive error and 1.5 multiplicative error:

S – O(1) < S < 1.5*S + O(1) with average cost equal to O(log n)

Algorithm: Choose each ui[0,1] i.i.d. Estimator: S = count number of i‘s s.t. ai / ui > 6 (up

to a normalization constant) Proof of correctness:

we use only ai which are 1.5-approximation to ai E[S ] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6. E[1/ui] = O(log n) w.h.p.

function of [ai /ui - 4/ε]+ and ui’sconcrete distrib. = minimum of O(ε-3) u.r.v.

O(ε-3 log n)

ε 1+εS – ε < S < (1+ ε)S + ε

Moments () via precision sampling Theorem: linear sketch for with

approximation, and space (90% succ. prob.). Sketch:

Pick random , and let throw into one hash table , cells

Estimator:

Randomness: independence suffices

x1 x2 x3 x4 x5 x6

y1+y3

y4 y2+y5+y6

x=

H=

Streaming++ LOTS of work in the area: