linear sketches - umass amherstmcgregor/stocworkshop/... · 2012-05-25 · linear sketches! answer...

Post on 19-Jun-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Linear Sketcheswith Applications to Data Streams

Andrew McGregorUniversity of Massachusetts

• Random linear projection: M: ℝn→ℝk (where k≪n) that preserves properties of any v∈ℝn with high probability.

• Many Results: Estimating norms, entropy, support size, quantiles, heavy hitters, fitting histograms and polynomials, ...

• Rich Theory: Related to compressed sensing and sparse recovery, dimensionality reduction and metric embeddings, ...

Linear Sketches

�! answer

2

666666664

v

3

777777775

2

4 M

3

5 =

2

4Mv

3

5

v1 v2 v3 v4

Input: v

Output: Mv=Mv1+Mv2+Mv3+Mv4

Mv1 Mv2 Mv3 Mv4

Why? Distributed Processing

Why? Data Streams

• Stream: m elements from some universe of size n

e.g., 3,5,3,7,5,4,8,5,3,7,5,4,8,6,3,2,6,4,7, ...

• Goal: Estimate properties of the stream, e.g., median, number of distinct elements, longest increasing sequence.

• The Catch:

• i) Limited working memory, e.g., polylog(n,m)

• ii) Access data sequentially & process elements quickly

• Rich theory with links to communication complexity, pseudo-randomness... Very applicable to network monitoring, sensor network fusion, I/O efficiency in external memory...

• Sketches: Can maintain Mf where fi is freq of i. On seeing j:

Mf Mf +Mej

I: Classic Sketches II: Graph Sketches III: Other Things

• Count-Min: In each column, place a “1” in a random row.

• Point Queries: For example, use v2+v6+v8 as estimate for v2.

• Analysis: Over-estimate by ≤2F1/k with probability 1/2. Setting k=O(ε-1) and repeating O(log n) times yields estimate ṽ of v such that with high probability:

Classics: Count-Min

2

66666666664

v1v2v3v4v5v6v7v8

3

77777777775

2

40 0 1 1 0 0 0 01 0 0 0 1 0 1 00 1 0 0 0 1 0 1

3

5 =

2

4v3 + v4

v1 + v5 + v7v2 + v6 + v8

3

5

8i 2 [n] ; vi vi vi + ✏F1 where F1 =X

i

|vi |

• Count-Sketch: Like Count-Min but non-zero entries ∈R {-1,1}.

• Point Queries: For example, use v2-v6+v8 as estimate for v2.

• Analysis: Correct in expectation with variance F2/k. Setting k=O(ε-2) and repeating O(log n) times yields estimate ṽ of v such that with high probability:

Classics: Count-Sketch

2

40 0 1 �1 0 0 0 0�1 0 0 0 1 0 �1 00 1 0 0 0 �1 0 1

3

5 =

2

4v3 � v4

�v1 + v5 � v7v2 � v6 + v8

3

5

2

66666666664

v1v2v3v4v5v6v7v8

3

77777777775

8i 2 [n] ; vi = vi ± ✏p

F2 where F2 =X

i

|vi |2

• Goal: Return (i,vi) where i chosen proportional to (1±ε) |vi|p

• Sketch: Count-Sketch on u where ui=λivi and λi-2∈R [0,1]:

• Post-Processing: Estimate ũi and return (i,vi) if ũi2 > F2/ε

• Repeat O(1/ε) times to find a sample.

• Analysis: O(ε-1 log2 n)-dimensional Count-Sketch suffices.

New Classic: Lp-Sampling

2

66666666664

�1 0 0 0 0 0 0 00 �2 0 0 0 0 0 00 0 �3 0 0 0 0 00 0 0 �4 0 0 0 00 0 0 0 �5 0 0 00 0 0 0 0 �6 0 00 0 0 0 0 0 �7 00 0 0 0 0 0 0 �8

3

77777777775

2

66666666664

v1v2v3v4v5v6v7v8

3

77777777775

2

4Count-Sketch

3

5

Pr[u2i > F2/✏] = Pr⇥��2i < ✏v2

i /F2

⇤= ✏v2

i /F2

Recall: O(w log n)-dimension Count-Sketch ensures

Exercise: F2(u) = O(F2(v) log n) with probability .99Set w = O(ε−1 log n). If ũi2 > F2(v)/ε then,

Hence get (1±ε) approx if value ≥ threshold.

L2 Sampling: Proof Sketch

ui = ui ±p

F2(u)/w

pF2(u)/w

p✏F2(v) ✏ui

For p=2: Estimating Fk = ∑|vi|k for k>2.Let (i,vi) be an L2 sample and T=F2 |vi|k-2. Then,

Repeat O(ε−2 n1-2/k) times and return mean.For p=1: Finding duplicates and estimating entropy.For p=0: Corresponds to sampling from the support of v. Use for graph sketching...

Lp Sampling: Applications

E[T ] = F2

X

i

✓v2i

F2

◆|vi |k�2 = Fk

• Template: Combine classic sketch with some transformation:

• Examples:

a) Histograms: Count-Sketch + Transform to Haar Basis

b) Periodicity: AMS Sketch + Transform to Fourier Basis

c) Quantiles: Count-Min + Dyadic Interval Dictionary

Algorithm Template

2

66666666664

v1v2v3v4v5v6v7v8

3

77777777775

2

4 “Classic” Sketch

3

5

2

66666666664

“Transform Matrix”

3

77777777775

I: Classic Sketches II: Graph Sketches III: Other Things

• Problem #1: Small-Space Dynamic Graph Connectivity

• Input: Observe stream of edge inserts/deletes on n nodes.

• Goal: Using Õ(n) space, maintain connected components.

• Note: Easy if there are no deletes; just use Union-Find.

Eduardo and Mark are now friends.

Like · Add Friend

Lawyers are now friends with everyone.

Like · Add Friend

Mark and Erica are now friends.

Like · Add Friend

Mark and Erica are no longer friends.

Like · Add Friend

... ...

...

• Problem #2: Communication Complexity

• Input: Each player knows neighborhood Γ(v) for a node v

• Goal: Simultaneously, each player sends O(polylog n) bits to a central player who then determines if graph is connected.

• Note: May assume players have access to public random bits.

• Suppose there’s a bridge (u,v) in the graph, i.e., a special friendship that is essential to ensuring the graph is connected.

? Claim: At least one of the players needs to send Ω(n) bits.a) Central player needs to know about the special friendship.b) Participant don’t know which of their friendships are special.c) Participants may have Ω(n) friends.

It can’t be done!?

How to do it...• Players send “appropriate” sketches of their address books.

• Main Idea: a) Sketch b) Run Algorithm in Sketch Space

• Catch: Sketch must be homomorphic for algorithm operations.

Original Graph Sketch Space

AlgorithmAlgorithm ANSWER

Sketch

Basic Algorithm (Spanning Forest): 1. For each node: pick incident edge2.For each connected comp: pick incident edge3.Repeat until no edges between connected comp.

Lemma: Takes O(log n) steps and selected edges include spanning forest.

Ingredient 1: Basic Connectivity Algorithm

For node i, let ai be vector indexed by node pairs. Non-zero entries: ai[i,j]=1 if j>i and ai[i,j]=-1 if j<i.Example:

Lemma: For any subset of nodes S⊂V,

Ingredient 2: Graph Representation

1

2

3

5

4

{1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5}

a1 =�1 1 0 0 0 0 0 0 0 0

a2 =��1 0 0 0 1 0 0 0 0 0

support

X

i2S

ai

!= E (S ,V \ S)

Ingredient 3: L0-Sampling

Recall: Exists random M: ℝN→ℝk with k=O(log2 N) such that for any a ∈ ℝN

with probability 9/10.Ma �! e 2 support(a)

Sketch: Apply L0-sketch matrix M to each aj

Run Algorithm in Sketch Space:Use Maj to get incident edge on each node jFor i=2 to t:

To get incident edge on component S⊂V use:

Recipe: Sketch & Compute on Sketches

�! e 2 support(

X

j2S

aj) = E (S ,V \ S)X

j2S

Maj = M

0

@X

j2S

aj

1

A

• Thm: Can determine connectivity:

a) Of dynamic graph stream using O(n polylog n) memory.

b) Using simultaneous messages of length O(polylog n).

Connectivity Results

Original Graph Sketch Space

AlgorithmAlgorithm ANSWER

Sketch

k-Connectivity• A graph is k-connected if every cut has size ≥ k.

• Thm: Can determine k-connectivity:

a) Of dynamic graph stream using O(n k polylog n) space.

b) Using simultaneous messages of length O(k polylog n).

• Extension: A weighted subgraph is a cut-sparsifier if it preserves all cuts up to a factor (1+ε). Can construct with O(n ε-2 polylog n) space or O(ε-2 polylog n)-length messages.

Algorithm (k-Connectivity): 1. Let F1 be spanning forest of G(V,E)2.For i=2 to k:

2.1. Let Fi be spanning forest of G(V,E-F1-...-Fi-1)Lemma: G(V,F1+...+Fk) is k-connected iff G(V,E) is.

Ingredient 1: Basic Algorithm

Ingredient 2: Connectivity Sketches

Sketch: Simultaneously construct k independent sketches {M1G, M2G, ... MkG} for connectivity.Run Algorithm in Sketch Space:

Use M1G to find a spanning forest F1 of G Use M2G-M2F1=M2(G-F1) to find F2

Use M3G-M3F1-M3F2=M3(G-F1-F2) to find F3 etc.

Original Graph Sparsifier Graph

k-Connectivity• A graph is k-connected if every cut has size ≥ k.

• Thm: Can determine k-connectivity:

a) Of dynamic graph stream using O(n k polylog n) space.

b) Using simultaneous messages of length O(k polylog n).

• Extension: A cut-sparsifier is a weighted subgraph that preserves all cuts up to a (1+ε) factor. Can construct in O(n ε-2 polylog n) space or O(ε-2 polylog n)-length messages.

I: Classic Sketches II: Graph Sketches III: Other Things

Geometric Sketches

• Input: Set of points p1, p2, ... , pn ∈ {1, ... , Δ}d

• Goal: Estimate geometric properties, e.g., diameter, width clustering cost, Steiner tree weight, min-cost matchings, ...

• Basic Idea: Consider points at different resolutions and relate numerical properties of quantizations to geometric problem.

x2

x1 x3

x6

Other Stream Algorithms• Order Dependent Functions: Longest-increasing subsequence,

time-series data, well-balanced parenthesis...

• Multi-Pass Streams: Space complexity vs. p-passes

a) Length k increasing subsequence:

b) Median of length m stream:

• Stochastic Streams: Space complexity vs. Sample complexity

a) Stream is sequence of iid samples from some unknown distribution. Goal: Infer parameters of distribution.

b) Stream is formed by randomly subsampling an original stream. Goal: Infer properties of original stream.

space = ⇥(k1+ 12p�1 )

space = ⇥(m1/p)

Summary• Sketches: Linear projections that (approximately) preserve

relevant properties of vectors, point-sets, graphs etc. Embarrassingly parallelizable and applicable to data streams.

• Further References:

• Courses: ! http://www.cs.dartmouth.edu/~ac/Teach/CS49-Fall11

• ! http://people.cs.umass.edu/~mcgregor/courses/CS711S12/

• ! http://stellar.mit.edu/S/course/6/fa07/6.895/

• Book: Data Stream Algorithms. McGregor and Muthukrishnan (forthcoming)

• Blog: http://polylogblog.wordpress.com/

top related