tight bounds for distributed functional monitoring

27
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO

Upload: topper

Post on 19-Mar-2016

30 views

Category:

Documents


0 download

DESCRIPTION

Tight Bounds for Distributed Functional Monitoring. David Woodruff IBM Almaden. Qin Zhang Aarhus University MADALGO. Distributed Functional Monitoring. Communication. coordinator. C. P 1. P 2. P 3. …. P k. sites. inputs:. x 1. x 2. x 3. x k. Updates: x i à x i + e j. time. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tight Bounds for Distributed Functional Monitoring

Tight Bounds for Distributed Functional Monitoring

David WoodruffIBM Almaden

Qin ZhangAarhus University

MADALGO

Page 2: Tight Bounds for Distributed Functional Monitoring

Distributed Functional MonitoringC

P1 P2 P3 Pk…

coordinator

time

sites

Static case vs. Dynamic caseProblems on x1 + x2 + … + xk: sampling, p-norms, heavy hitters, compressed sensing, quantiles, entropyAuthors: Can, Cormode, Huang, Muthukrishnan, Patt-Shamir, Shafrir, Tirthapura, Wang, Yi, Zhao, many others

Communication

x1 x2 x3 xkinputs:

Updates:xi à xi + ej

Page 3: Tight Bounds for Distributed Functional Monitoring

Motivation

• Data distributed and stored in the cloud– Impractical to put data on a single device

• Sensor networks– Communication very power-intensive

• Network routers– Bandwidth limitations

Page 4: Tight Bounds for Distributed Functional Monitoring

Problems• Which functions f(x1, …, xk) do we care about?

• x1, …, xk are non-negative length-n vectors

• x = i=1k xi

• f(x1, …, xk) = |x|p = (i=1n xi

p)1/p

• |x|0 is the number of non-zero coordinates

What is the randomized communication cost of these

problems?I.e., the minimal cost of a protocol, which for every input, fails with probability < 1/3

Static case, Dynamic Case

Page 5: Tight Bounds for Distributed Functional Monitoring

Exact Answers• An (n) communication bound for computing |x|p , p 1

• Reduction from 2-Player Set-Disjointness (DISJ)• Alice has a set S µ [n] of size n/4• Bob has a set T µ [n] of size n/4 with either |S Å T| = 0 or

|S Å T| = 1• Is S Å T = ;?• |X Å Y| = 1 ! DISJ(X,Y) = 1, |X Å Y| = 0 !DISJ(X,Y) = 0• [KS, R] (n) communication

• Prohibitive for applications

Page 6: Tight Bounds for Distributed Functional Monitoring

Approximate Answers

f(x1, …, xk) = (1 ± ε) |x |p

What is the randomized communication cost as a function of k, ε, and n?

Ignore log(nk/ε) factors

Page 7: Tight Bounds for Distributed Functional Monitoring

Previous ResultsLower bounds in static model, upper bounds in dynamic

model (underlying vectors are non-negative)

• |x|0: (k + ε-2) and O(k¢ε-2 )

• |x|p: (k + ε-2)

• |x|2: O(k2/ε + k1.5/ε3)

• |x|p, p > 2: O(k2p+1n1-2/p ¢ poly(1/ε))

Page 8: Tight Bounds for Distributed Functional Monitoring

Our ResultsLower bounds in static model, upper bounds in dynamic

model (underlying vectors are non-negative)

• |x|0: (k + ε-2) and O(k¢ε-2 ) (k¢ε-2)

• |x|p: (k + ε-2) (kp-1¢ε-2). Talk will focus on p = 2

• |x|2: O(k2/ε + k1.5/ε3) O(k¢poly(1/ε))

• |x|p, p > 2: O(k2p+1n1-2/p ¢ poly(1/ε)) O(kp-1¢poly(1/ε))

First lower bounds to depend on

product of k and ε-

2

Upper bound doesn’t depend

polynomially on n

Page 9: Tight Bounds for Distributed Functional Monitoring

Talk Outline

• Lower Bounds– Non-zero elements – Euclidean norm

• Upper Bounds– p-norm

Page 10: Tight Bounds for Distributed Functional Monitoring

Previous Lower Bounds• Lower bounds for any p-norm, p != 1

• [CMY](k)

• [ABC] (ε-2) • Reduction from Gap-Orthogonality (GAP-ORT)

• Alice, Bob have u, v 2 {0,1}ε-2 , respectively

• |¢(u, v) – 1/(2ε2)| < 1/ε or |¢(u, v) - 1/(2ε2)| > 2/ε

• [CR, S] (ε-2) communication

Page 11: Tight Bounds for Distributed Functional Monitoring

Talk Outline

• Lower Bounds– Non-zero elements – Euclidean norm

• Upper Bounds– p-norm

Page 12: Tight Bounds for Distributed Functional Monitoring

Lower Bound for Distinct Elements• Improve bound to optimal (k¢ε-2)

• Simpler problem: k-GAP-THRESH– Each site Pi holds a bit Zi

– Zi are i.i.d. Bernoulli(¯)– Decide if

i=1k Zi > ¯ k + (¯ k)1/2 or i=1

k Zi < ¯ k - (¯ k)1/2

Otherwise don’t care

• Rectangle property: for any correct protocol transcript ¿,Z1, Z2, …, Zk are independent conditioned on ¿

Page 13: Tight Bounds for Distributed Functional Monitoring

A Key Lemma• Lemma: For any protocol ¦ which succeeds w.pr. >.9999, the

transcript ¿ is such that w.pr. > 1/2, for at least k/2 different i, H(Zi | ¿) < H(.01 ¯)

• Proof: Suppose ¿ does not satisfy this– With large probability,

¯ k - O(¯ k)1/2 i=1k Zi | ¿] < ¯ k + O(¯ k)1/2

– Since the Zi are independent given ¿, i=1

k Zi | ¿ is a sum of independent Bernoullis

– Since most H(Zi | ¿) are large, by anti-concentration, both events occur with constant probability:

i=1k Zi | ¿ > ¯ k + (¯ k)1/2 , i=1

k Zi | ¿ < ¯ k - (¯ k)1/2

So ¦ can’t succeed with large probability

Page 14: Tight Bounds for Distributed Functional Monitoring

Composition IdeaC

P1 P2 P3 Pk…

Z3Z2Z1Zk

The input to Pi in k-GAP-THRESH, denoted Zi, is the output of a 2-party Disjointness (DISJ) instance between C and Si

- Let X be a random set of size 1/(4ε2) from {1, 2, …, 1/ε2}- For each i, if Zi = 1, then choose Yi so that DISJ(X, Yi) = 1, else choose Yi so that DISJ(X, Yi) = 0- Distributional complexity (1/ε2) [Razborov]

DISJ

DISJ

DISJDISJ

Can think of C as a

player

Page 15: Tight Bounds for Distributed Functional Monitoring

Putting it All Together• Key Lemma ! For most i, H(Zi | ¿) < H(.01¯)

• Since H(Zi) = H(¯) for all i, for most i protocol ¦ solves DISJ(X, Yi) with constant probability

• Since the Zi | ¿ are independent, solving DISJ requires communication (ε-2) on each of k/2 copies

• Total communication is (k¢ε-2)

• Can show a reduction:– |x|0 > 1/(2ε2) + 1/ε if i=1

k Zi > ¯ k + (¯ k)1/2

– |x|0 < 1/(2ε2) - 1/ε if i=1k Zi < ¯ k - (¯ k)1/2

Page 16: Tight Bounds for Distributed Functional Monitoring

Talk Outline

• Lower Bounds– Non-zero elements – Euclidean norm

• Upper Bounds– p-norm

Page 17: Tight Bounds for Distributed Functional Monitoring

Lower Bound for Euclidean Norm• Improve (k + ε-) bound to optimal (k¢ε-2)

• Base problem: Gap-Orthogonality (GAP-ORT(X, Y))– Consider uniform distribution on (X,Y)

• We observe information lower bound for GAP-ORT

• Sherstov’s lower bound for GAP-ORT holds for uniform distribution on (X,Y)

• [BBCR] + [Sherstov] ! for any protocol ¦ and t > 0, I(X, Y; ¦) = (1/(ε2 log t)) or ¦ uses t communication

Page 18: Tight Bounds for Distributed Functional Monitoring

Information Implications

• By chain rule, I(X, Y ; ¦) = i=1

1/ε2 I(Xi, Yi ; ¦ | X< i, Y< i) = (ε-2)

• For most i, I(Xi, Yi ; ¦ | X< i, Y< i) = (1)

• Maximum Likelihood Principle: non-trivial advantage in guessing (Xi, Yi)

Page 19: Tight Bounds for Distributed Functional Monitoring

2-BIT k-Party DISJ

• Choose a random j 2 [k2]– j doesn’t occur in any Ti

– j occurs only in T1, …, Tk/2

– j occurs only in Tk/, …, Tk

– j occurs in T1, …, Tk

• All j’ j occur in at most one set Ti (assume k ¸ 4)• We show (k) information cost

P1 P2 … PkP3

T1 T2 T3 Tk 2 [k2]

We compose GAP-ORT with a variant of k-Party DISJ

Page 20: Tight Bounds for Distributed Functional Monitoring

Rough Composition Idea

2-BIT k-party DISJ instance

2-BIT k-party DISJ instance

2-BIT k-party DISJ instance

{1/ε2

Show (k/ε2) overall information is revealed

Bits Xi and Yi in GAP-ORT determine output of i-th 2-BIT k-party DISJ instance

An algorithm for approximating Euclidean norm solves GAP-ORT, therefore solves most 2-BIT k-party DISJ instances

GAP-ORT

- Information adds (if we condition on enough “helper” variables)- Pi participates in all instances

Page 21: Tight Bounds for Distributed Functional Monitoring

Talk Outline

• Lower Bounds– Non-zero elements – Euclidean norm

• Upper Bounds– p-norm

Page 22: Tight Bounds for Distributed Functional Monitoring

Algorithm for p-norm

• We get kp-1 poly(1/ε), improving k2p+1n1-2/p poly(1/ε) for general p and O(k2/ε + k1.5/ε3) for p = 2

• Our protocol is the first 1-way protocol, that is, all communication is from sites to coordinator

• Focus on Euclidean norm (p = 2) in talk

• Non-negative vectors

• Just determine if Euclidean norm exceeds a threshold θ

Page 23: Tight Bounds for Distributed Functional Monitoring

The Most Naïve Thing to Do

• xi is Site i’s current vector

• x = i=1k xi

• Suppose Site i sees an update xi à xi + ej

• Send j to Coordinator with a certain probability that only depends on k and θ?

Page 24: Tight Bounds for Distributed Functional Monitoring

Sample and Send

P1 P2 … PkP3

C

1…10…00…0………0…0

0…01…10…0………0…0

0…00…01…1………0…0

………………………………………

0…00…00…0………1…1

|x|2 = k2

{k|x|2 = 2k2

1 1 1 1 1

Send each update with probability at least 1/k

Communication = O(k), so okay

Suppose x has k4 coordinates that are 1, and may have a

unique coordinate which is k2, occurring k times on each site

- Send update with probability 1/k2

- Will find the large coordinate

- But communication is (k2)

Page 25: Tight Bounds for Distributed Functional Monitoring

What Is Happening?• Sampling with probability ¼ 1/k2 is good to get a few

samples from heavy item

• But all the light coordinates are in the way, making the communication (k2)

• Suppose we put a barrier of k, that is, sample with probability ¼ 1/k2 but only send an item if it has occurred at least k times on a site

• Now communication is O(1) and found heavy coordinate

• But light coordinates also contribute to overall |x|2 value

Page 26: Tight Bounds for Distributed Functional Monitoring

• Sample at different scales with different barriers

• Use public coin to create O(log n) groups T1, …, Tlog n of the n input coordinates

• Tz contains n/2z random coordinates

• Suppose Site i sees the update xi à xi + ej

• For each Tz containing j • If xij > (θ/2z)1/2/k then with probability

(2z/θ)1/2¢poly(ε-1 log n), send (j, z) to the coordinator

Algorithm for Euclidean Norm

• Expected communication O~(k)

• If a group of coordinates contributes to|x|2, there is a z for which a few coordinates in the group are sampled multiple times

Page 27: Tight Bounds for Distributed Functional Monitoring

Conclusions• Improved communication lower and upper bounds

for estimating |x|p

• Implies tight lower bounds for estimating entropy, heavy hitters, quantiles

• Implications for data stream model– First lower bound for |x|0 without Gap-Hamming– Useful information cost lower bound for Gap-Hamming, or protocol has very large communication– Improve (n1-2/p/ε2/p) bound for estimating |x|p in a

stream to (n1-2/p/ε4/p)