linear sketches - umass amherstmcgregor/stocworkshop/... · 2012-05-25 · linear sketches! answer...

30
Linear Sketches with Applications to Data Streams Andrew McGregor University of Massachusetts

Upload: others

Post on 19-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Linear Sketcheswith Applications to Data Streams

Andrew McGregorUniversity of Massachusetts

Page 2: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

• Random linear projection: M: ℝn→ℝk (where k≪n) that preserves properties of any v∈ℝn with high probability.

• Many Results: Estimating norms, entropy, support size, quantiles, heavy hitters, fitting histograms and polynomials, ...

• Rich Theory: Related to compressed sensing and sparse recovery, dimensionality reduction and metric embeddings, ...

Linear Sketches

�! answer

2

666666664

v

3

777777775

2

4 M

3

5 =

2

4Mv

3

5

Page 3: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

v1 v2 v3 v4

Input: v

Output: Mv=Mv1+Mv2+Mv3+Mv4

Mv1 Mv2 Mv3 Mv4

Why? Distributed Processing

Page 4: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Why? Data Streams

• Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

• Goal: Estimate properties of the stream, e.g., median, number of distinct elements, longest increasing sequence.

• The Catch:

• i) Limited working memory, e.g., polylog(n,m)

• ii) Access data sequentially & process elements quickly

• Rich theory with links to communication complexity, pseudo-randomness... Very applicable to network monitoring, sensor network fusion, I/O efficiency in external memory...

• Sketches: Can maintain Mf where fi is freq of i. On seeing j:

Mf Mf +Mej

Page 5: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

I: Classic Sketches II: Graph Sketches III: Other Things

Page 6: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

• Count-Min: In each column, place a “1” in a random row.

• Point Queries: For example, use v2+v6+v8 as estimate for v2.

• Analysis: Over-estimate by ≤2F1/k with probability 1/2. Setting k=O(ε-1) and repeating O(log n) times yields estimate ṽ of v such that with high probability:

Classics: Count-Min

2

66666666664

v1v2v3v4v5v6v7v8

3

77777777775

2

40 0 1 1 0 0 0 01 0 0 0 1 0 1 00 1 0 0 0 1 0 1

3

5 =

2

4v3 + v4

v1 + v5 + v7v2 + v6 + v8

3

5

8i 2 [n] ; vi vi vi + ✏F1 where F1 =X

i

|vi |

Page 7: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

• Count-Sketch: Like Count-Min but non-zero entries ∈R {-1,1}.

• Point Queries: For example, use v2-v6+v8 as estimate for v2.

• Analysis: Correct in expectation with variance F2/k. Setting k=O(ε-2) and repeating O(log n) times yields estimate ṽ of v such that with high probability:

Classics: Count-Sketch

2

40 0 1 �1 0 0 0 0�1 0 0 0 1 0 �1 00 1 0 0 0 �1 0 1

3

5 =

2

4v3 � v4

�v1 + v5 � v7v2 � v6 + v8

3

5

2

66666666664

v1v2v3v4v5v6v7v8

3

77777777775

8i 2 [n] ; vi = vi ± ✏p

F2 where F2 =X

i

|vi |2

Page 8: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

• Goal: Return (i,vi) where i chosen proportional to (1±ε) |vi|p

• Sketch: Count-Sketch on u where ui=λivi and λi-2∈R [0,1]:

• Post-Processing: Estimate ũi and return (i,vi) if ũi2 > F2/ε

• Repeat O(1/ε) times to find a sample.

• Analysis: O(ε-1 log2 n)-dimensional Count-Sketch suffices.

New Classic: Lp-Sampling

2

66666666664

�1 0 0 0 0 0 0 00 �2 0 0 0 0 0 00 0 �3 0 0 0 0 00 0 0 �4 0 0 0 00 0 0 0 �5 0 0 00 0 0 0 0 �6 0 00 0 0 0 0 0 �7 00 0 0 0 0 0 0 �8

3

77777777775

2

66666666664

v1v2v3v4v5v6v7v8

3

77777777775

2

4Count-Sketch

3

5

Pr[u2i > F2/✏] = Pr⇥��2i < ✏v2

i /F2

⇤= ✏v2

i /F2

Page 9: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Recall: O(w log n)-dimension Count-Sketch ensures

Exercise: F2(u) = O(F2(v) log n) with probability .99Set w = O(ε−1 log n). If ũi2 > F2(v)/ε then,

Hence get (1±ε) approx if value ≥ threshold.

L2 Sampling: Proof Sketch

ui = ui ±p

F2(u)/w

pF2(u)/w

p✏F2(v) ✏ui

Page 10: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

For p=2: Estimating Fk = ∑|vi|k for k>2.Let (i,vi) be an L2 sample and T=F2 |vi|k-2. Then,

Repeat O(ε−2 n1-2/k) times and return mean.For p=1: Finding duplicates and estimating entropy.For p=0: Corresponds to sampling from the support of v. Use for graph sketching...

Lp Sampling: Applications

E[T ] = F2

X

i

✓v2i

F2

◆|vi |k�2 = Fk

Page 11: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

• Template: Combine classic sketch with some transformation:

• Examples:

a) Histograms: Count-Sketch + Transform to Haar Basis

b) Periodicity: AMS Sketch + Transform to Fourier Basis

c) Quantiles: Count-Min + Dyadic Interval Dictionary

Algorithm Template

2

66666666664

v1v2v3v4v5v6v7v8

3

77777777775

2

4 “Classic” Sketch

3

5

2

66666666664

“Transform Matrix”

3

77777777775

Page 12: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

I: Classic Sketches II: Graph Sketches III: Other Things

Page 13: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

• Problem #1: Small-Space Dynamic Graph Connectivity

• Input: Observe stream of edge inserts/deletes on n nodes.

• Goal: Using Õ(n) space, maintain connected components.

• Note: Easy if there are no deletes; just use Union-Find.

Eduardo and Mark are now friends.

Like · Add Friend

Lawyers are now friends with everyone.

Like · Add Friend

Mark and Erica are now friends.

Like · Add Friend

Mark and Erica are no longer friends.

Like · Add Friend

... ...

Page 14: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

...

• Problem #2: Communication Complexity

• Input: Each player knows neighborhood Γ(v) for a node v

• Goal: Simultaneously, each player sends O(polylog n) bits to a central player who then determines if graph is connected.

• Note: May assume players have access to public random bits.

Page 15: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

• Suppose there’s a bridge (u,v) in the graph, i.e., a special friendship that is essential to ensuring the graph is connected.

? Claim: At least one of the players needs to send Ω(n) bits.a) Central player needs to know about the special friendship.b) Participant don’t know which of their friendships are special.c) Participants may have Ω(n) friends.

It can’t be done!?

Page 16: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

How to do it...• Players send “appropriate” sketches of their address books.

• Main Idea: a) Sketch b) Run Algorithm in Sketch Space

• Catch: Sketch must be homomorphic for algorithm operations.

Original Graph Sketch Space

AlgorithmAlgorithm ANSWER

Sketch

Page 17: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Basic Algorithm (Spanning Forest): 1. For each node: pick incident edge2.For each connected comp: pick incident edge3.Repeat until no edges between connected comp.

Lemma: Takes O(log n) steps and selected edges include spanning forest.

Ingredient 1: Basic Connectivity Algorithm

Page 18: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i.Example:

Lemma: For any subset of nodes S⊂V,

Ingredient 2: Graph Representation

1

2

3

5

4

{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}

a1 =�1 1 0 0 0 0 0 0 0 0

a2 =��1 0 0 0 1 0 0 0 0 0

support

X

i2S

ai

!= E (S ,V \ S)

Page 19: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Ingredient 3: L0-Sampling

Recall: Exists random M: ℝN→ℝk with k=O(log2 N) such that for any a ∈ ℝN

with probability 9/10.Ma �! e 2 support(a)

Page 20: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Sketch: Apply L0-sketch matrix M to each aj

Run Algorithm in Sketch Space:Use Maj to get incident edge on each node jFor i=2 to t:

To get incident edge on component S⊂V use:

Recipe: Sketch & Compute on Sketches

�! e 2 support(

X

j2S

aj) = E (S ,V \ S)X

j2S

Maj = M

0

@X

j2S

aj

1

A

Page 21: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

• Thm: Can determine connectivity:

a) Of dynamic graph stream using O(n polylog n) memory.

b) Using simultaneous messages of length O(polylog n).

Connectivity Results

Original Graph Sketch Space

AlgorithmAlgorithm ANSWER

Sketch

Page 22: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

k-Connectivity• A graph is k-connected if every cut has size ≥ k.

• Thm: Can determine k-connectivity:

a) Of dynamic graph stream using O(n k polylog n) space.

b) Using simultaneous messages of length O(k polylog n).

• Extension: A weighted subgraph is a cut-sparsifier if it preserves all cuts up to a factor (1+ε). Can construct with O(n ε-2 polylog n) space or O(ε-2 polylog n)-length messages.

Page 23: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Algorithm (k-Connectivity): 1. Let F1 be spanning forest of G(V,E)2.For i=2 to k:

2.1. Let Fi be spanning forest of G(V,E-F1-...-Fi-1)Lemma: G(V,F1+...+Fk) is k-connected iff G(V,E) is.

Ingredient 1: Basic Algorithm

Page 24: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Ingredient 2: Connectivity Sketches

Sketch: Simultaneously construct k independent sketches {M1G, M2G, ... MkG} for connectivity.Run Algorithm in Sketch Space:

Use M1G to find a spanning forest F1 of G Use M2G-M2F1=M2(G-F1) to find F2

Use M3G-M3F1-M3F2=M3(G-F1-F2) to find F3 etc.

Page 25: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Original Graph Sparsifier Graph

k-Connectivity• A graph is k-connected if every cut has size ≥ k.

• Thm: Can determine k-connectivity:

a) Of dynamic graph stream using O(n k polylog n) space.

b) Using simultaneous messages of length O(k polylog n).

• Extension: A cut-sparsifier is a weighted subgraph that preserves all cuts up to a (1+ε) factor. Can construct in O(n ε-2 polylog n) space or O(ε-2 polylog n)-length messages.

Page 26: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

I: Classic Sketches II: Graph Sketches III: Other Things

Page 27: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Geometric Sketches

• Input: Set of points p1, p2, ... , pn ∈ {1, ... , Δ}d

• Goal: Estimate geometric properties, e.g., diameter, width clustering cost, Steiner tree weight, min-cost matchings, ...

• Basic Idea: Consider points at different resolutions and relate numerical properties of quantizations to geometric problem.

x2

x1 x3

x6

Page 28: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Other Stream Algorithms• Order Dependent Functions: Longest-increasing subsequence,

time-series data, well-balanced parenthesis...

• Multi-Pass Streams: Space complexity vs. p-passes

a) Length k increasing subsequence:

b) Median of length m stream:

• Stochastic Streams: Space complexity vs. Sample complexity

a) Stream is sequence of iid samples from some unknown distribution. Goal: Infer parameters of distribution.

b) Stream is formed by randomly subsampling an original stream. Goal: Infer properties of original stream.

space = ⇥(k1+ 12p�1 )

space = ⇥(m1/p)

Page 29: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4

Summary• Sketches: Linear projections that (approximately) preserve

relevant properties of vectors, point-sets, graphs etc. Embarrassingly parallelizable and applicable to data streams.

• Further References:

• Courses: ! http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11

• ! http://people.cs.umass.edu/~mcgregor/courses/CS711S12/

• ! http://stellar.mit.edu/S/course/6/fa07/6.895/

• Book: Data Stream Algorithms. McGregor and Muthukrishnan (forthcoming)

• Blog: http://polylogblog.wordpress.com/

Page 30: Linear Sketches - UMass Amherstmcgregor/stocworkshop/... · 2012-05-25 · Linear Sketches! answer 2 6 6 6 6 6 6 6 6 4 v 3 7 7 7 7 7 7 7 7 5 2 4 M 3 5 = 2 4Mv 3 5. v 1 v 2 v 3 v 4