basics of information theory and information...

1

Basics of information theory and information complexity

June 1, 2013

Mark Braverman

Princeton University

a tutorial

Part I: Information theory

• Information theory, in its modern format was introduced in the 1940s to study the problem of transmitting data over physical channels.

2

communication channel

Alice Bob

Quantifying “information”

• Information is measured in bits.

• The basic notion is Shannon’s entropy.

• The entropy of a random variable is the (typical) number of bits needed to remove the uncertainty of the variable.

• For a discrete variable: 𝐻 𝑋 ≔ ∑Pr 𝑋 = 𝑥 log 1/Pr[𝑋 = 𝑥]

3

Shannon’s entropy

• Important examples and properties:

– If 𝑋 = 𝑥 is a constant, then 𝐻 𝑋 = 0.

– If 𝑋 is uniform on a finite set 𝑆 of possible values, then 𝐻 𝑋 = log 𝑆.

– If 𝑋 is supported on at most 𝑛 values, then 𝐻 X ≤ log 𝑛.

– If 𝑌 is a random variable determined by 𝑋, then 𝐻 𝑌 ≤ 𝐻(𝑋).

4

Conditional entropy

• For two (potentially correlated) variables 𝑋, 𝑌, the conditional entropy of 𝑋 given 𝑌 is the amount of uncertainty left in 𝑋 given 𝑌:

𝐻 𝑋 𝑌 ≔ 𝐸𝑦~𝑌H X Y = y .

• One can show 𝐻 𝑋𝑌 = 𝐻 𝑌 + 𝐻(𝑋|𝑌).

• This important fact is knows as the chain rule.

• If 𝑋 ⊥ 𝑌, then 𝐻 𝑋𝑌 = 𝐻 𝑋 + 𝐻 𝑌 𝑋 = 𝐻 𝑋 + 𝐻 𝑌 .

5

Example

• 𝑋 = 𝐵1, 𝐵2, 𝐵3

• 𝑌 = 𝐵1 ⊕𝐵2 , 𝐵2 ⊕𝐵4 , 𝐵3 ⊕𝐵4 , 𝐵5

• Where 𝐵1, 𝐵2, 𝐵3, 𝐵4, 𝐵5 ∈𝑈 {0,1}.

• Then

– 𝐻 𝑋 = 3; 𝐻 𝑌 = 4; 𝐻 𝑋𝑌 = 5;

– 𝐻 𝑋 𝑌 = 1 = 𝐻 𝑋𝑌 − 𝐻 𝑌 ;

– 𝐻 𝑌 𝑋 = 2 = 𝐻 𝑋𝑌 − 𝐻 𝑋 .

6

Mutual information

• 𝑋 = 𝐵1, 𝐵2, 𝐵3

• 𝑌 = 𝐵1 ⊕𝐵2 , 𝐵2 ⊕𝐵4 , 𝐵3 ⊕𝐵4 , 𝐵5

7

𝐻(𝑋) 𝐻(𝑌) 𝐻(𝑌|𝑋) 𝐻(𝑋|𝑌)

𝐵1 𝐵1 ⊕𝐵2

𝐵2 ⊕𝐵3

𝐵4

𝐵5

𝐼(𝑋; 𝑌)

Mutual information

• The mutual information is defined as 𝐼 𝑋; 𝑌 = 𝐻 𝑋 − 𝐻 𝑋 𝑌 = 𝐻 𝑌 − 𝐻(𝑌|𝑋)

• “By how much does knowing 𝑋 reduce the entropy of 𝑌?”

• Always non-negative 𝐼 𝑋; 𝑌 ≥ 0.

• Conditional mutual information: 𝐼 𝑋; 𝑌 𝑍 ≔ 𝐻 𝑋 𝑍 − 𝐻(𝑋|𝑌𝑍)

• Chain rule for mutual information: 𝐼 𝑋𝑌; 𝑍 = 𝐼 𝑋; 𝑍 + 𝐼(𝑌; 𝑍|𝑋)

• Simple intuitive interpretation.

8

Example – a biased coin

• A coin with 𝜀-Heads or Tails bias is tossed several times.

• Let 𝐵 ∈ {𝐻, 𝑇} be the bias, and suppose that a-priori both options are equally likely: 𝐻 𝐵 = 1.

• How many tosses needed to find 𝐵?

• Let 𝑇1, … , 𝑇𝑘 be a sequence of tosses.

• Start with 𝑘 = 2. 9

What do we learn about 𝐵? • 𝐼 𝐵; 𝑇1𝑇2 = 𝐼 𝐵; 𝑇1 + 𝐼 𝐵; 𝑇2 𝑇1 =

𝐼 𝐵; 𝑇1 + 𝐼 𝐵𝑇1; 𝑇2 − 𝐼 𝑇1; 𝑇2

≤ 𝐼 𝐵; 𝑇1 + 𝐼 𝐵𝑇1; 𝑇2 = 𝐼 𝐵; 𝑇1 + 𝐼 𝐵; 𝑇2 + 𝐼 𝑇1; 𝑇2 𝐵

= 𝐼 𝐵; 𝑇1 + 𝐼 𝐵; 𝑇2 = 2 ⋅ 𝐼 𝐵; 𝑇1 .

• Similarly, 𝐼 𝐵; 𝑇1…𝑇𝑘 ≤ 𝑘 ⋅ 𝐼 𝐵; 𝑇1 .

• To determine 𝐵 with constant accuracy, need 0 < c < 𝐼 𝐵; 𝑇1…𝑇𝑘 ≤ 𝑘 ⋅ 𝐼 𝐵; 𝑇1 .

• 𝑘 = Ω(1/𝐼(𝐵; 𝑇1)). 10

Kullback–Leibler (KL)-Divergence

• A distance metric between distributions on the same space.

• Plays a key role in information theory.

𝐷(𝑃 ∥ 𝑄) ≔ 𝑃 𝑥 log𝑃[𝑥]

𝑄[𝑥].

𝑥

• 𝐷(𝑃 ∥ 𝑄) ≥ 0, with equality when 𝑃 = 𝑄.

• Caution: 𝐷(𝑃 ∥ 𝑄) ≠ 𝐷(𝑄 ∥ 𝑃)!

11

Properties of KL-divergence

• Connection to mutual information: 𝐼 𝑋; 𝑌 = 𝐸𝑦∼𝑌𝐷(𝑋𝑌=𝑦 ∥ 𝑋).

• If 𝑋 ⊥ 𝑌, then 𝑋𝑌=𝑦 = 𝑋, and both sides are 0.

• Pinsker’s inequality:

𝑃 − 𝑄 1 = 𝑂( 𝐷(𝑃 ∥ 𝑄)).

• Tight!

𝐷(𝐵1/2+𝜀 ∥ 𝐵1/2) = Θ(𝜀2).

12

Back to the coin example

• 𝐼 𝐵; 𝑇1 = 𝐸𝑏∼𝐵𝐷(𝑇1,𝐵=𝑏 ∥ 𝑇1) = 𝐷 𝐵1

2±𝜀

∥ 𝐵12

= Θ 𝜀2 .

• 𝑘 = Ω1

𝐼 𝐵;𝑇1= Ω(

1

𝜀2).

• “Follow the information learned from the coin tosses”

• Can be done using combinatorics, but the information-theoretic language is more natural for expressing what’s going on.

Back to communication

• The reason Information Theory is so important for communication is because information-theoretic quantities readily operationalize.

• Can attach operational meaning to Shannon’s entropy: 𝐻 𝑋 ≈ “the cost of transmitting 𝑋”.

• Let 𝐶 𝑋 be the (expected) cost of transmitting a sample of 𝑋.

14

𝐻 𝑋 = 𝐶(𝑋)?

• Not quite.

• Let trit 𝑇 ∈𝑈 1,2,3 .

• 𝐶 𝑇 =5

3≈ 1.67.

• 𝐻 𝑇 = log 3 ≈ 1.58.

• It is always the case that 𝐶 𝑋 ≥ 𝐻(𝑋).

15

1 0

2 10

3 11

But 𝐻 𝑋 and 𝐶(𝑋) are close

• Huffman’s coding: 𝐶 𝑋 ≤ 𝐻 𝑋 + 1.

• This is a compression result: “an uninformative message turned into a short one”.

• Therefore: 𝐻 𝑋 ≤ 𝐶 𝑋 ≤ 𝐻 𝑋 + 1.

16

Shannon’s noiseless coding • The cost of communicating many copies of 𝑋

scales as 𝐻(𝑋).

• Shannon’s source coding theorem:

– Let 𝐶 𝑋𝑛 be the cost of transmitting 𝑛 independent copies of 𝑋. Then the amortized transmission cost

lim𝑛→∞

𝐶(𝑋𝑛)/𝑛 = 𝐻 𝑋 .

• This equation gives 𝐻(𝑋) operational meaning.

17


𝐻 𝑋 𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑎𝑙𝑖𝑧𝑒𝑑

18

𝑋1, … , 𝑋𝑛, … 𝐻(𝑋) per copy to transmit 𝑋’s

𝐻(𝑋) is nicer than 𝐶(𝑋)

• 𝐻 𝑋 is additive for independent variables.

• Let 𝑇1, 𝑇2 ∈𝑈 {1,2,3} be independent trits.

• 𝐻 𝑇1𝑇2 = log 9 = 2 log 3.

• 𝐶 𝑇1𝑇2 =29

9< 𝐶 𝑇1 + 𝐶 𝑇2 = 2 ×

5

3=

30

9.

• Works well with concepts such as channel capacity.

19

“Proof” of Shannon’s noiseless coding

• 𝑛 ⋅ 𝐻 𝑋 = 𝐻 𝑋𝑛 ≤ 𝐶 𝑋𝑛 ≤ 𝐻 𝑋𝑛 + 1.

• Therefore lim𝑛→∞


20

Additivity of entropy

Compression (Huffman)

Operationalizing other quantities

• Conditional entropy 𝐻 𝑋 𝑌 :

• (cf. Slepian-Wolf Theorem).


𝑋1, … , 𝑋𝑛, … 𝐻(𝑋|𝑌) per copy to transmit 𝑋’s

𝑌1, … , 𝑌𝑛, …


Operationalizing other quantities

• Mutual information 𝐼 𝑋; 𝑌 :

𝑋1, … , 𝑋𝑛, … 𝐼 𝑋; 𝑌 per copy to sample 𝑌’s

𝑌1, … , 𝑌𝑛, …

Information theory and entropy

• Allows us to formalize intuitive notions.

• Operationalized in the context of one-way transmission and related problems.

• Has nice properties (additivity, chain rule…)

• Next, we discuss extensions to more interesting communication scenarios.

23

Communication complexity

• Focus on the two party randomized setting.

24

A B

X Y A & B implement a functionality 𝐹(𝑋, 𝑌).

F(X,Y)

e.g. 𝐹 𝑋, 𝑌 = “𝑋 = 𝑌? ”

Shared randomness R


A B

X Y

Goal: implement a functionality 𝐹(𝑋, 𝑌). A protocol 𝜋(𝑋, 𝑌) computing 𝐹(𝑋, 𝑌):

F(X,Y)

m1(X,R)

m2(Y,m1,R)

m3(X,m1,m2,R)

Communication cost = #of bits exchanged.

Shared randomness R


• Numerous applications/potential applications (some will be discussed later today).

• Considerably more difficult to obtain lower bounds than transmission (still much easier than other models of computation!).

26


• (Distributional) communication complexity with input distribution 𝜇 and error 𝜀: 𝐶𝐶 𝐹, 𝜇, 𝜀 . Error ≤ 𝜀 w.r.t. 𝜇.

• (Randomized/worst-case) communication complexity: 𝐶𝐶(𝐹, 𝜀). Error ≤ 𝜀 on all inputs.

• Yao’s minimax:

𝐶𝐶 𝐹, 𝜀 = max𝜇

𝐶𝐶(𝐹, 𝜇, 𝜀).

27

Examples

• 𝑋, 𝑌 ∈ 0,1 𝑛.

• Equality 𝐸𝑄 𝑋, 𝑌 ≔ 1𝑋=𝑌.

• 𝐶𝐶 𝐸𝑄, 𝜀 ≈ log1

𝜀.

• 𝐶𝐶 𝐸𝑄, 0 ≈ 𝑛.

28

Equality •𝐹is “𝑋 = 𝑌? ”. •𝜇is a distribution where w.p. ½𝑋 = 𝑌 and w.p. ½ (𝑋, 𝑌) are random.

A B

X Y

• Shows that 𝐶𝐶 𝐸𝑄, 𝜇, 2−129 ≤ 129.

MD5(X) [128 bits]

X=Y? [1 bit]

Error?

Examples

• 𝑋, 𝑌 ∈ 0,1 𝑛.

• Innerproduct𝐼𝑃 𝑋, 𝑌 ≔ ∑ 𝑋𝑖 ⋅ 𝑌𝑖𝑖 (𝑚𝑜𝑑2).

• 𝐶𝐶 𝐼𝑃, 0 = 𝑛 − 𝑜(𝑛).

In fact, using information complexity:

• 𝐶𝐶 𝐼𝑃, 𝜀 = 𝑛 − 𝑜𝜀(𝑛).

30

Information complexity

• Information complexity 𝐼𝐶(𝐹, 𝜀)::

communication complexity 𝐶𝐶 𝐹, 𝜀

as

• Shannon’s entropy 𝐻(𝑋) ::

transmission cost 𝐶(𝑋)

31


• The smallest amount of information Alice and Bob need to exchange to solve 𝐹.

• How is information measured?

• Communication cost of a protocol?

– Number of bits exchanged.

• Information cost of a protocol?

– Amount of information revealed.

32

Basic definition 1: The information cost of a protocol

• Prior distribution: 𝑋, 𝑌 ∼ 𝜇.

A B

X Y

Protocol π Protocol

transcript Π

𝐼𝐶(𝜋, 𝜇) = 𝐼(Π; 𝑌|𝑋) + 𝐼(Π; 𝑋|𝑌)

what Alice learns about Y + what Bob learns about X

Example •𝐹is “𝑋 = 𝑌? ”. •𝜇is a distribution where w.p. ½𝑋 = 𝑌 and w.p. ½ (𝑋, 𝑌) are random.

A B

X Y

𝐼𝐶(𝜋, 𝜇) = 𝐼(Π; 𝑌|𝑋) + 𝐼(Π; 𝑋|𝑌) ≈1 + 64.5 = 65.5 bits


MD5(X) [128 bits]

X=Y? [1 bit]

Prior 𝜇 matters a lot for information cost!

• If 𝜇 = 1 𝑥,𝑦 a singleton,

𝐼𝐶 𝜋, 𝜇 = 0.

35

Example •𝐹is “𝑋 = 𝑌? ”. •𝜇is a distribution where (𝑋, 𝑌) are just uniformly random.

A B

X Y

𝐼𝐶(𝜋, 𝜇) = 𝐼(Π; 𝑌|𝑋) + 𝐼(Π; 𝑋|𝑌) ≈0 + 128 = 128 bits


MD5(X) [128 bits]

X=Y? [1 bit]

Basic definition 2: Information complexity

• Communication complexity:

𝐶𝐶 𝐹, 𝜇, 𝜀 ≔ min𝜋𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠

𝐹𝑤𝑖𝑡ℎ𝑒𝑟𝑟𝑜𝑟≤𝜀

𝜋 .

• Analogously:

𝐼𝐶 𝐹, 𝜇, 𝜀 ≔ inf𝜋𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠


𝐼𝐶(𝜋, 𝜇).

37

Needed!

Prior-free information complexity

• Using minimax can get rid of the prior.

• For communication, we had:

𝐶𝐶 𝐹, 𝜀 = max𝜇

𝐶𝐶(𝐹, 𝜇, 𝜀).

• For information

𝐼𝐶 𝐹, 𝜀 ≔ inf𝜋𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠


max𝜇

𝐼𝐶(𝜋, 𝜇).

38

Ex: The information complexity of Equality

• What is 𝐼𝐶(𝐸𝑄, 0)?

• Consider the following protocol.

39 A B

X in {0,1}n Y in {0,1}n

A non-singular in 𝒁𝒏×𝒏

A1·X

A1·Y

A2·X

A2·Y

Continue for n steps, or until a disagreement is

discovered.

Analysis (sketch)

• If X≠Y, the protocol will terminate in O(1) rounds on average, and thus reveal O(1) information.

• If X=Y… the players only learn the fact that X=Y (≤1 bit of information).

• Thus the protocol has O(1) information complexity for any prior 𝜇.

40

Operationalizing IC: Information equals amortized communication

• Recall [Shannon]: lim𝑛→∞


• Turns out: lim𝑛→∞

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 = 𝐼𝐶 𝐹, 𝜇, 𝜀 ,

for 𝜀 > 0. [Error 𝜀 allowed on each copy]

• For 𝜀 = 0: lim𝑛→∞

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 0+)/𝑛 = 𝐼𝐶 𝐹, 𝜇, 0 .

• [ lim𝑛→∞

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 0)/𝑛 an interesting open

problem.] 41

Information = amortized communication

• lim𝑛→∞

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 = 𝐼𝐶 𝐹, 𝜇, 𝜀 .

• Two directions: “≤” and “≥”.

• 𝑛 ⋅ 𝐻 𝑋 = 𝐻 𝑋𝑛 ≤ 𝐶 𝑋𝑛 ≤ 𝐻 𝑋𝑛 + 1.

Additivity of entropy

Compression (Huffman)

The “≤” direction

• lim𝑛→∞

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 ≤ 𝐼𝐶 𝐹, 𝜇, 𝜀 .

• Start with a protocol 𝜋 solving 𝐹, whose 𝐼𝐶 𝜋, 𝜇 is close to 𝐼𝐶 𝐹, 𝜇, 𝜀 .

• Show how to compress many copies of 𝜋 into a protocol whose communication cost is close to its information cost.

• More on compression later.

43

The “≥” direction

• lim𝑛→∞

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 ≥ 𝐼𝐶 𝐹, 𝜇, 𝜀 .

• Use the fact that 𝐶𝐶 𝐹𝑛,𝜇𝑛,𝜀

𝑛≥

𝐼𝐶 𝐹𝑛,𝜇𝑛,𝜀

𝑛.

• Additivity of information complexity: 𝐼𝐶 𝐹𝑛, 𝜇𝑛, 𝜀

𝑛= 𝐼𝐶 𝐹, 𝜇, 𝜀 .

44

Proof: Additivity of information complexity

• Let 𝑇1(𝑋1, 𝑌1) and 𝑇2(𝑋2, 𝑌2) be two two-party tasks.

• E.g. “Solve 𝐹(𝑋, 𝑌) with error ≤ 𝜀 w.r.t. 𝜇”

• Then

𝐼𝐶 𝑇1 × 𝑇2, 𝜇1 × 𝜇2 = 𝐼𝐶 𝑇1, 𝜇1 + 𝐼𝐶 𝑇2, 𝜇2

• “≤” is easy.

• “≥” is the interesting direction.

𝐼𝐶 𝑇1, 𝜇1 + 𝐼𝐶 𝑇2, 𝜇2 ≤𝐼𝐶 𝑇1 × 𝑇2, 𝜇1 × 𝜇2

• Start from a protocol 𝜋 for 𝑇1 × 𝑇2 with prior 𝜇1 × 𝜇2, whose information cost is 𝐼.

• Show how to construct two protocols 𝜋1 for 𝑇1 with prior 𝜇1 and 𝜋2 for 𝑇2 with prior 𝜇2, with information costs 𝐼1 and 𝐼2, respectively, such that 𝐼1 + 𝐼2 = 𝐼.

46

𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )

𝜋1 𝑋1, 𝑌1 • Publicly sample 𝑋2 ∼ 𝜇2 • Bob privately samples

𝑌2 ∼ 𝜇2|X2

• Run 𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )

𝜋2 𝑋2, 𝑌2 • Publicly sample 𝑌1 ∼ 𝜇1 • Alice privately samples

𝑋1 ∼ 𝜇1|Y1

• Run 𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )

𝜋1 𝑋1, 𝑌1 • Publicly sample 𝑋2 ∼ 𝜇2 • Bob privately samples

𝑌2 ∼ 𝜇2|X2

• Run 𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )

Analysis - 𝜋1

• Alice learns about 𝑌1: 𝐼 Π; 𝑌1 X1X2)

• Bob learns about 𝑋1: 𝐼 Π; 𝑋1 𝑌1𝑌2𝑋2).

• 𝐼1 = 𝐼 Π; 𝑌1 X1X2) + 𝐼 Π; 𝑋1 𝑌1𝑌2𝑋2).

48

𝜋2 𝑋2, 𝑌2 • Publicly sample 𝑌1 ∼ 𝜇1 • Alice privately samples

𝑋1 ∼ 𝜇1|Y1

• Run 𝜋( 𝑋1, 𝑋2 , 𝑌1, 𝑌2 )

Analysis - 𝜋2

• Alice learns about 𝑌2: 𝐼 Π; 𝑌2 X1X2Y1)

• Bob learns about 𝑋2: 𝐼 Π; 𝑋2 𝑌1𝑌2).

• 𝐼2 = 𝐼 Π; 𝑌2 X1X2Y1) + 𝐼 Π; 𝑋2 𝑌1𝑌2).

49

Adding 𝐼1 and 𝐼2

𝐼1 + 𝐼2 = 𝐼 Π; 𝑌1 X1X2) + 𝐼 Π; 𝑋1 𝑌1𝑌2𝑋2)

+ 𝐼 Π; 𝑌2 X1X2Y1) + 𝐼 Π; 𝑋2 𝑌1𝑌2)

= 𝐼 Π; 𝑌1 X1X2) + 𝐼 Π; 𝑌2 X1X2Y1) +𝐼 Π; 𝑋2 𝑌1𝑌2) + 𝐼 Π; 𝑋1 𝑌1𝑌2𝑋2) =

𝐼 Π; 𝑌1𝑌2 𝑋1𝑋2 + 𝐼 Π; 𝑋2𝑋1 𝑌1𝑌2 = 𝐼.

50

Summary

• Information complexity is additive.

• Operationalized via “Information = amortized communication”.

• lim𝑛→∞

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 = 𝐼𝐶 𝐹, 𝜇, 𝜀 .

• Seems to be the “right” analogue of entropy for interactive computation.

51

Entropy vs. Information Complexity

Entropy IC

Additive? Yes Yes

Operationalized lim𝑛→∞

𝐶(𝑋𝑛)/𝑛 lim𝑛→∞

𝐶𝐶 𝐹𝑛, 𝜇𝑛, 𝜀

𝑛

Compression? Huffman:

𝐶 𝑋 ≤ 𝐻 𝑋 + 1 ???!

Can interactive communication be compressed?

• Is it true that 𝐶𝐶 𝐹, 𝜇, 𝜀 ≤ 𝐼𝐶 𝐹, 𝜇, 𝜀 + 𝑂(1)?

• Less ambitiously:

𝐶𝐶 𝐹, 𝜇, 𝑂(𝜀) = 𝑂 𝐼𝐶 𝐹, 𝜇, 𝜀 ?

• (Almost) equivalently: Given a protocol 𝜋 with 𝐼𝐶 𝜋, 𝜇 = 𝐼, can Alice and Bob simulate 𝜋 using 𝑂 𝐼 communication?

• Not known in general…

53

Direct sum theorems

• Let 𝐹 be any functionality.

• Let 𝐶(𝐹) be the cost of implementing 𝐹.

• Let 𝐹𝑛 be the functionality of implementing 𝑛 independent copies of 𝐹.

• The direct sum problem:

“Does 𝐶(𝐹𝑛) ≈ 𝑛 ∙ 𝐶(𝐹)?”

• In most cases it is obvious that 𝐶(𝐹𝑛) ≤ 𝑛 ∙ 𝐶(𝐹).

54

Direct sum – randomized communication complexity

• Is it true that

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?

• Is it true that 𝐶𝐶(𝐹𝑛, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜀))?

55

Direct product – randomized communication complexity

• Direct sum

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?

• Direct product

𝐶𝐶(𝐹𝑛, 𝜇𝑛, (1 − 𝜀)𝑛) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?

56

Direct sum for randomized CC and interactive compression

Direct sum:

• 𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?

In the limit:

• 𝑛 ⋅ 𝐼𝐶(𝐹, 𝜇, 𝜀) = Ω(𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀))?

Interactive compression:

• 𝐶𝐶 𝐹, 𝜇, 𝜀 = 𝑂(𝐼𝐶 𝐹, 𝜇, 𝜀) ?

Same question!

57

The big picture

𝐶𝐶(𝐹, 𝜇, 𝜀) 𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛

𝐼𝐶(𝐹, 𝜇, 𝜀) 𝐼𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛

additivity (=direct sum)

for information

information = amortized

communication direct sum for

communication?

interactive compression?

Current results for compression A protocol 𝜋that has 𝐶 bits of communication, conveys 𝐼 bits of information over prior 𝜇, and works in 𝑟 rounds can be simulated:

• Using 𝑂 (𝐼 + 𝑟) bits of communication.

• Using 𝑂 𝐼 ⋅ 𝐶 bits of communication.

• Using 2𝑂(𝐼) bits of communication.

• If 𝜇 = 𝜇𝑋 × 𝜇𝑌, then using 𝑂(𝐼 polylog 𝐶) bits of communication. 59

Their direct sum counterparts

• 𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω (𝑛1/2 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀)).

• 𝐶𝐶(𝐹𝑛, 𝜀) = Ω (𝑛1/2 ∙ 𝐶𝐶(𝐹, 𝜀)).

For product distributions 𝜇 = 𝜇𝑋 × 𝜇𝑌,

• 𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀) = Ω (𝑛 ∙ 𝐶𝐶(𝐹, 𝜇, 𝜀)).

When the number of rounds is bounded by 𝑟 ≪ 𝑛, a direct sum theorem holds.

60

Direct product

• The best one can hope for is a statement of the type:

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 1 − 2−𝑂 𝑛 ) = Ω(𝑛 ∙ 𝐼𝐶(𝐹, 𝜇, 1/3)).

• Can prove:

𝐶𝐶 𝐹𝑛, 𝜇𝑛, 1 − 2−𝑂 𝑛 = Ω (𝑛1/2 ∙ 𝐶𝐶(𝐹, 𝜇, 1/3)).

61

Proof 2: Compressing a one-round protocol

• Say Alice speaks: 𝐼𝐶 𝜋, 𝜇 = 𝐼 𝑀;𝑋 𝑌 .

• Recall KL-divergence: 𝐼 𝑀; 𝑋 𝑌 = 𝐸𝑌𝐷 𝑀𝑋𝑌 ∥ 𝑀𝑌 = 𝐸𝑌𝐷 𝑀𝑋 ∥ 𝑀𝑌

• Bottom line:

– Alice has 𝑀𝑋; Bob has 𝑀𝑌;

– Goal: sample from 𝑀𝑋 using ∼ 𝐷(𝑀𝑋 ∥ 𝑀𝑌) communication.

62

The dart board

q1 q2 q3 q4 q5 q6 q7 …. u1 u2 u3 u4 u5 u6 u7

1

0

• Interpret the public randomness as random points in 𝑈 × [0,1], where 𝑈 is the universe of all possible messages.

• First message under the histogram of 𝑀 is distributed ∼ 𝑀.

𝑈

u1

u2

u3 u4

u5

64

Proof Idea • Sample using 𝑂(log1/𝜀 + 𝐷(𝑀𝑋 ∥ 𝑀𝑌))

communication with statistical error ε.

MX MY

u1 u1

u2 u2

u3 u3

u4 u4

u4

~|U| samples Public randomness:

q1 q2 q3 q4 q5 q6 q7 …. u1 u2 u3 u4 u5 u6 u7

MX MY 1 1

0 0



u4 u2

h1(u4) h2(u4)

65 MX MY u4

1 1

0 0 u2

MX MY

66 66



u4 u2

h4(u4)… hlog 1/ ε(u4)

u4

h3(u4)

MX 2MY

MX MY u4

u4

h1(u4), h2(u4)

1 1

0 0

67

Analysis

• If 𝑀𝑋(𝑢4) ≈ 2𝑘𝑀𝑌(𝑢4), then the protocol

will reach round 𝑘 of doubling.

• There will be ≈ 2𝑘 candidates.

• About 𝑘 + log1/𝜀 hashes to narrow to one.

• The contribution of 𝑢4 to cost:

– 𝑀𝑋(𝑢4)(log𝑀𝑋(𝑢4)/𝑀𝑌(

𝑢4) + log1/𝜀).

Done! 𝐷(𝑀𝑋 ∥ 𝑀𝑌) ≔ 𝑀𝑋(𝑢) log𝑀𝑋(𝑢)

𝑀𝑌(𝑢).

𝑢

External information cost

• (𝑋, 𝑌)~𝜇.

A B

X Y

Protocol π Protocol

transcript π

𝐼𝐶𝑒𝑥𝑡(𝜋, 𝜇) = 𝐼(Π; 𝑋𝑌)

what Charlie learns about (X,Y)

C

Example

69

•F is “X=Y?”. •μ is a distribution where w.p. ½ X=Y and w.p. ½ (X,Y) are random.

MD5(X)

X=Y?

A B

X Y

𝐼𝐶𝑒𝑥𝑡(𝜋, 𝜇) = 𝐼(Π; 𝑋𝑌) = 129𝑏𝑖𝑡𝑠

what Charlie learns about (X,Y)

External information cost

• It is always the case that 𝐼𝐶𝑒𝑥𝑡 𝜋, 𝜇 ≥ 𝐼𝐶 𝜋, 𝜇 .

• If 𝜇 = 𝜇𝑋 × 𝜇𝑌 is a product distribution, then

𝐼𝐶𝑒𝑥𝑡 𝜋, 𝜇 = 𝐼𝐶 𝜋, 𝜇 .

70

External information complexity

• 𝐼𝐶𝑒𝑥𝑡 𝐹, 𝜇, 𝜀 ≔ inf𝜋𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑠


𝐼𝐶𝑒𝑥𝑡(𝜋, 𝜇).

• Can it be operationalized?

71

Operational meaning of 𝐼𝐶𝑒𝑥𝑡?

• Conjecture: Zero-error communication scales like external information:

lim𝑛→∞

𝐶𝐶 𝐹𝑛, 𝜇𝑛, 0

𝑛= 𝐼𝐶𝑒𝑥𝑡 𝐹, 𝜇, 0 ?

• Recall:

lim𝑛→∞

𝐶𝐶 𝐹𝑛, 𝜇𝑛, 0+

𝑛= 𝐼𝐶 𝐹, 𝜇, 0 .

72

Example – transmission with a strong prior

• 𝑋, 𝑌 ∈ 0,1

• 𝜇 is such that 𝑋 ∈𝑈 0,1 , and 𝑋 = 𝑌 with a very high probability (say 1 − 1/ 𝑛).

• 𝐹 𝑋, 𝑌 = 𝑋 is just the “transmit 𝑋” function.

• Clearly, 𝜋 should just have Alice send 𝑋 to Bob.

• 𝐼𝐶 𝐹, 𝜇, 0 = 𝐼𝐶 𝜋, 𝜇 = 𝐻1

𝑛= o(1).

• 𝐼𝐶𝑒𝑥𝑡 𝐹, 𝜇, 0 = 𝐼𝐶𝑒𝑥𝑡 𝜋, 𝜇 = 1. 73

Example – transmission with a strong prior

• 𝐼𝐶 𝐹, 𝜇, 0 = 𝐼𝐶 𝜋, 𝜇 = 𝐻1

𝑛= o(1).

• 𝐼𝐶𝑒𝑥𝑡 𝐹, 𝜇, 0 = 𝐼𝐶𝑒𝑥𝑡 𝜋, 𝜇 = 1.

• 𝐶𝐶 𝐹𝑛, 𝜇𝑛, 0+ = 𝑜 𝑛 .

• 𝐶𝐶 𝐹𝑛, 𝜇𝑛, 0 = Ω(𝑛).

Other examples, e.g. the two-bit AND function fit into this picture.

74

Additional directions

75


Interactive coding

Information theory in TCS

Interactive coding theory

• So far focused the discussion on noiseless coding.

• What if the channel has noise?

• [What kind of noise?]

• In the non-interactive case, each channel has a capacity 𝐶.

76

Channel capacity

• The amortized number of channel uses needed to send 𝑋 over a noisy channel of capacity 𝐶 is

𝐻 𝑋

𝐶

• Decouples the task from the channel!

77

Example: Binary Symmetric Channel

• Each bit gets independently flipped with probability 𝜀 < 1/2.

• One way capacity 1 − 𝐻 𝜀 .

78

0

1

0

1 1 − 𝜀

1 − 𝜀

𝜀 𝜀

Interactive channel capacity

• Not clear one can decouple channel from task in such a clean way.

• Capacity much harder to calculate/reason about.

• Example: Binary symmetric channel.

• One way capacity 1 − 𝐻 𝜀 .

• Interactive (for simple pointer jumping,

[Kol-Raz’13]):

1 − Θ 𝐻 𝜀 .

0

1

0

1 1 − 𝜀

1 − 𝜀

𝜀

Information theory in communication complexity and beyond

• A natural extension would be to multi-party communication complexity.

• Some success in the number-in-hand case.

• What about the number-on-forehead?

• Explicit bounds for ≥ log 𝑛 players would imply explicit 𝐴𝐶𝐶0 circuit lower bounds.

80

Naïve multi-party information cost

𝐼𝐶 𝜋, 𝜇 = 𝐼(Π; 𝑋|𝑌𝑍)+ 𝐼(Π; 𝑌|𝑋𝑍)+ 𝐼(Π; 𝑍|𝑋𝑌)

81

A B C

YZ XZ XY

Naïve multi-party information cost

𝐼𝐶 𝜋, 𝜇 = 𝐼(Π; 𝑋|𝑌𝑍)+ 𝐼(Π; 𝑌|𝑋𝑍)+ 𝐼(Π; 𝑍|𝑋𝑌)

• Doesn’t seem to work.

• Secure multi-party computation [Ben-Or,Goldwasser, Wigderson], means that anything can be computed at near-zero information cost.

• Although, these construction require the players to share private channels/randomness.

82

Communication and beyond…

• The rest of today:

– Data structures;

– Streaming;

– Distributed computing;

– Privacy.

• Exact communication complexity bounds.

• Extended formulations lower bounds.

• Parallel repetition?

• … 83

84

Thank You!

Open problem: Computability of IC

• Given the truth table of 𝐹 𝑋, 𝑌 , 𝜇 and 𝜀, compute 𝐼𝐶 𝐹, 𝜇, 𝜀 .

• Via 𝐼𝐶 𝐹, 𝜇, 𝜀 = lim𝑛→∞

𝐶𝐶(𝐹𝑛, 𝜇𝑛, 𝜀)/𝑛 can

compute a sequence of upper bounds.

• But the rate of convergence as a function of 𝑛 is unknown.

85

Open problem: Computability of IC

• Can compute the 𝑟-round𝐼𝐶𝑟 𝐹, 𝜇, 𝜀 information complexity of 𝐹.

• But the rate of convergence as a function of 𝑟 is unknown.

• Conjecture:

𝐼𝐶𝑟 𝐹, 𝜇, 𝜀 − 𝐼𝐶 𝐹, 𝜇, 𝜀 = 𝑂𝐹,𝜇,𝜀1

𝑟2.

• This is the relationship for the two-bit AND.

86

basics of information theory and information...

Documents