the data stream space complexity of cascaded norms t.s. jayram david woodruff ibm almaden

18
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Upload: colin-bolton

Post on 26-Mar-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

The Data Stream Space Complexity of Cascaded Norms

T.S. JayramDavid Woodruff

IBM Almaden

Page 2: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Data streams Algorithms access data in a sequential fashion

One pass / small space Need to be randomized and approximate [FM, MP, AMS]

Algorithm MainMemory

2 3 4 16 0 100 5 4 501 200 401 2 3 6 0

Page 3: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Frequency Moments and Norms Stream defines updates to a set of

items 1,2,…,d. fi = weight of item i positive-only vs. turnstile model

k-th Frequency Moment Fk = i |fi|k

p-th Norm: Lp = kfkp = (i |fi|p)1/p

Maximum frequency: p=1 Distinct Elements: p=0 Heavy hitters Assume length of stream and

magnitude of updates is · poly(d)

Page 4: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Classical Results

Approximating Lp and Fp is the same problem

For 0 · p · 2, Fp is approximable in O~(1) space (AMS, FM, Indyk, …)

For p > 2, Fp is approximable in

O~(d1-2/p) space (IW) this is best-possible (BJKS, CKS)

Page 5: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Cascaded Aggregates

Stream defines updates to pairs of items in {1,2,…n} x {1,2,…,d} fij = weight of item (i,j)

Two aggregates P and Q

0

BBB@

f 11 f 12 : : : f 1d

f 21 f 22 : : : f 2d...

......

...f n1 f n2 : : : f nd

1

CCCA

Q PP ± Q

P ± Q = cascaded aggregate

0

BBB@

Q(Row1)Q(Row2)

...Q(Rown)

1

CCCA

Page 6: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Motivation

Multigraph streams for analyzing IP traffic [Cormode-Muthukrishnan]

Corresponds to P ± F0 for different P’s F0 returns #destinations accessed by

each source Also introduced the more general

problem of estimating P ± Q Computing complex join estimates Product metrics [Andoni-Indyk-Krauthgamer]

Stock volatility, computational geometry, operator norms

Page 7: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

k

n

n1-2/k d

1

k=p

0 1 2 1

0

1

2

p

n1-2/k d1-2/p

n1-1/k

£(1)

?

£(1)

d1-2/p d

n1-1/k

The Picture

Estimating Lk ± Lp

We give a 1-pass O~(n1-2/kd1-2/p) space algorithm when k ¸ p

We also provide a matching lower bound based on multiparty disjointness

We give a 1-pass O~(n1-2/kd1-2/p) space algorithm when k ¸ p

We also provide a matching lower bound based on multiparty disjointness

We give the (n1-1/k) bound for Lk ± L0 and Lk ± L1

Õ(n1/2) for L2 ± L0 without deletions [CM]Õ(n1-1/k) for Lk ± Lp for any p in {0,1} in turnstile [MW]

We give the (n1-1/k) bound for Lk ± L0 and Lk ± L1

Õ(n1/2) for L2 ± L0 without deletions [CM]Õ(n1-1/k) for Lk ± Lp for any p in {0,1} in turnstile [MW][Ganguly] (without

deletions)[Ganguly] (without deletions)

Follows from techniques of[ADIW]

Follows from techniques of[ADIW] Our upper

bound Our upper bound

Page 8: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Our Problem: Fk ± Fp

Fk ± Fp (M) = i (j |fij|p)k

= i Fp(Row i)k

0

BBB@

f 11 f 12 : : : f 1d

f 21 f 22 : : : f 2d...

......

...f n1 f n2 : : : f nd

1

CCCA

M =

Page 9: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

High Level Ideas: Fk ± Fp

1. We want the Fk-value of the vector (Fp(Row 1), …, Fp(Row n))

2. We try to sample a row i with probability / Fp(Row i)

3. Spend an extra pass to compute Fp(Row i)

4. Could then output Fp(M) ¢ Fp(Row i)k-1

(can be seen as a generalization of [AMS])

How do we do the sampling efficiently??

How do we do the sampling efficiently??

Page 10: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Review – Estimating Fp [IW]

Level sets:

Level t is good if |St|(1+ε)2t ¸ F2/B

Items from such level sets are also good

St = f i j (1+ ²)t · jf i j · (1+ ²)t+1g

Page 11: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

²-Histogram [IW]

Finds approximate sizes s’t of level sets For all St, s’t · (1+ε)|St|

For good St, s’t ¸ (1- ε)|St|

Also provides O~(1) random samples from each good St

Space: O~(B)

Page 12: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Sampling Rows According to Fp value Treat n x d matrix M as a vector:

Run ε-Histogram on M for certain B Obtain (1§ε)-approximation st’ to |St| for good t

Fk ± Fp(M’) ¸ (1-ε) Fk ± Fp(M), where M’ is M restricted to good items (Holder’s inequality)

To sample, Choose a good t with probability

st’(1+ε)pt/Fp’(M),

where Fp’(M) = sumgood t st’ (1+ε)pt

Choose random sample (i, j) from St

Let row i be the current sample

Pr[row i] = t [st’(1+ε)pt/Fp’(M)]¢[|St Å row i|/|St|]

¼ Fp(row i)/Fp(M)

Pr[row i] = t [st’(1+ε)pt/Fp’(M)]¢[|St Å row i|/|St|]

¼ Fp(row i)/Fp(M) Problems1. High level algorithm requires

many samples (up to n1-1/k) from the St, but [IW] just gives O~(1).

Can’t afford to repeat in low space

2. Algorithm may misclassify a pair (i,j) into St when it is in St-1

Problems1. High level algorithm requires

many samples (up to n1-1/k) from the St, but [IW] just gives O~(1).

Can’t afford to repeat in low space

2. Algorithm may misclassify a pair (i,j) into St when it is in St-1

Page 13: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

High Level Ideas: Fk ± Fp

1. We want the Fk-value of the vector (Fp(Row 1), …, Fp(Row n))

2. We try to sample a row i with probability / Fp(Row i)

3. Spend an extra pass to compute Fp(Row i)

4. Could then output Fp(M) ¢ Fp(Row i)k-1

(can be seen as a generalization of [AMS])

How do we avoid an extra pass??

How do we avoid an extra pass??

Page 14: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Avoiding an Extra Pass Now we can sample a Row i / Fp(Row i)

We design a new Fk-algorithm to run on(Fp(Row 1), …, Fp(Row n))

which only receives IDs i with probability / Fp(Row i)

For each j 2 [log n], algorithm does:1. Choose a random subset of n/2j rows2. Sample a row i from this set with Pr[Row i] / Fp(Row i)

We show that O~(n1-1/k) oracle samples is enough to estimate Fk up to 1§ε

Page 15: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

New Lower Bounds

Alice Bob

n x d matrix A n x d matrix B

NO instance: for all rows i, ¢(Ai, Bi) · 1

YES instance: there is a unique row j for which¢(Aj, Bj) = d, and for all i j, ¢(Ai, Bi) · 1

We show distinguishing these cases requires (n/d) randomized communication CC

Implies estimating Lk(L0) or Lk(L1) needs (n1-1/k) space

Page 16: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Information Complexity Paradigm [CSWY, BJKS]: the information cost IC is the

amount of information the transcript reveals about the inputs

For any function f, CC(f) ¸ IC(f)

Using their direct sum theorem, it suffices to show an (1/d) information cost of a protocol for deciding if ¢(x,y) = d or ¢(x,y) · 1

Caveat: distribution is only on instances where ¢(x,y) · 1

Page 17: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Working with Hellinger Distance Given the prob. distribution vector ¼(x,y) over transcripts of an

input (x,y) Let Ã(x,y)¿ = ¼(x,y)¿

1/2 for all ¿

Information cost can be lower bounded by ¢(u,v) = 1 kÃ(u,u) - Ã(u,v)k2

Unlike previous work, we exploit the geometry of the squared Euclidean norm (useful in later work [AJP])

Short diagonals property:¢(u,v) = 1 kÃ(u,u) - Ã(u,v)k2 ¸ (1/d) ¢(u,v) = d kÃ(u,u) - Ã(u,v)k2

a

b

c

d

ef

a2 + b2 + c2 + d2 ¸ e2 + f2

Page 18: The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden

Open Problems

Lk ± Lp estimation for k < p

Other cascaded aggregates, e.g. entropy

Cascaded aggregates with 3 or more stages