lecture notes on information theory prefaceyw562/teaching/itlectures.pdf · lecture notes on...

$: LECTURE NOTES ON INFORMATION THEORY Prefaceyw562/teaching/itlectures.pdf · LECTURE NOTES ON INFORMATION THEORY Preface \There is a whole book of readymade, long and convincing, lav-ishly$
LECTURE NOTES ON INFORMATION THEORY

Preface

“There is a whole book of readymade, long and convincing, lav-ishly composed telegrams for all occasions. Sending such atelegram costs only twenty-five cents. You see, what gets trans-mitted over the telegraph is not the text of the telegram, butsimply the number under which it is listed in the book, and thesignature of the sender. This is quite a funny thing, reminiscentof Drugstore Breakfast #2. Everything is served up in a readyform, and the customer is totally freed from the unpleasantnecessity to think, and to spend money on top of it.”

Little Golden America. Travelogue by I. Ilf and E. Petrov, 1937.

[Pre-Shannon encoding, courtesy of M. Raginsky]

These notes provide a graduate-level introduction to the mathematics of Information Theory.They were created by Yury Polyanskiy and Yihong Wu, who used them to teach at MIT (2012,2013 and 2016), UIUC (2013, 2014) and Yale (2017). The core structure and flow of materialis largely due to Prof. Sergio Verdu, whose wonderful class at Princeton University [Ver07]shaped up our own perception of the subject. Specifically, we follow Prof. Verdu’s style inrelying on single-shot results, Feinstein’s lemma and information spectrum methods. Wehave added a number of technical refinements and new topics, which correspond to our owninterests (e.g., modern aspects of finite blocklength results and applications of informationtheoretic methods to statistical decision theory and combinatorics).

Compared to the more popular “typicality” and “method of types” approaches (as inCover-Thomas [CT06] and Csiszar-Korner [CK81b]), these notes prepare the reader toconsider delay-constraints (“non-asymptotics”) and to simultaneously treat continuous anddiscrete sources/channels.

We are especially thankful to Dr. O. Ordentlich, who contributed a lecture on latticecodes. Initial version was typed by Qingqing Huang and Austin Collins, who also createdmany graphics. Rachel Cohen have also edited various parts. Aolin Xu, Pengkun Yang,Ganesh Ajjanagadde, Anuran Makur, Jason Klusowski, and Sheng Xu have contributedsuggestions and corrections to the content. We are indebted to all of them.

Y. Polyanskiy <[email protected]>

Y. Wu <[email protected]>

18 Aug 2016

This version: August 18, 2017

Contents

Contents 2

Notations 8

I Information measures 9

1 Information measures: entropy and divergence 101.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.2 Entropy: axiomatic characterization . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3 History of entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4* Entropy: submodularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.5* Entropy: Han’s inequality and Shearer’s Lemma . . . . . . . . . . . . . . . . . . . 161.6 Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.7 Differential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Information measures: mutual information 232.1 Divergence: main inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2 Conditional divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.4 Conditional mutual information and conditional independence . . . . . . . . . . . 302.5 Strong data-processing inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.6* How to avoid measurability problems? . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Sufficient statistic. Continuity of divergence and mutual information 343.1 Sufficient statistics and data-processing . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Geometric interpretation of mutual information . . . . . . . . . . . . . . . . . . . 363.3 Variational characterizations of divergence: Donsker-Varadhan . . . . . . . . . . . 373.4 Variational characterizations of divergence: Gelfand-Yaglom-Perez . . . . . . . . . 383.5 Continuity of divergence. Dependence on σ-algebra. . . . . . . . . . . . . . . . . . 393.6 Variational characterizations and continuity of mutual information . . . . . . . . . 42

4 Extremization of mutual information: capacity saddle point 444.1 Convexity of information measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.2* Local behavior of divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3* Local behavior of divergence and Fisher information . . . . . . . . . . . . . . . . . 484.4 Extremization of mutual information . . . . . . . . . . . . . . . . . . . . . . . . . 494.5 Capacity = information radius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.6 Existence of caod (general case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2

4.7 Gaussian saddle point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Single-letterization. Probability of error. Entropy rate. 585.1 Extremization of mutual information for memoryless sources and channels . . . . 585.2* Gaussian capacity via orthogonal symmetry . . . . . . . . . . . . . . . . . . . . . . 595.3 Information measures and probability of error . . . . . . . . . . . . . . . . . . . . 605.4 Entropy rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.5 Entropy and symbol (bit) error rate . . . . . . . . . . . . . . . . . . . . . . . . . . 635.6 Mutual information rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.7* Toeplitz matrices and Szego’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 f-divergences: definition and properties 676.1 f -divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676.2 Data processing inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.3 Total variation and hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 716.4 Motivating example: Hypothesis testing with multiple samples . . . . . . . . . . . 726.5 Inequalities between f -divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7 Inequalities between f-divergences via joint range 767.1 Inequalities via joint range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 767.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817.3 Joint range between various divergences . . . . . . . . . . . . . . . . . . . . . . . . 83

II Lossless data compression 84

8 Variable-length Lossless Compression 858.1 Variable-length, lossless, optimal compressor . . . . . . . . . . . . . . . . . . . . . 858.2 Uniquely decodable codes, prefix codes and Huffman codes . . . . . . . . . . . . . 94

9 Fixed-length (almost lossless) compression. Slepian-Wolf. 999.1 Fixed-length code, almost lossless, AEP . . . . . . . . . . . . . . . . . . . . . . . . 999.2 Linear Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1049.3 Compression with side information at both compressor and decompressor . . . . . 1069.4 Slepian-Wolf (Compression with side information at decompressor only) . . . . . . 1079.5 Multi-terminal Slepian Wolf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089.6* Source-coding with a helper (Ahlswede-Korner-Wyner) . . . . . . . . . . . . . . . 110

10 Compressing stationary ergodic sources 11310.1 Bits of ergodic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11410.2 Proof of Shannon-McMillan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11610.3* Proof of Birkhoff-Khintchine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11810.4* Sinai’s generator theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

11 Universal compression 12411.1 Arithmetic coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12511.2 Combinatorial construction of Fitingof . . . . . . . . . . . . . . . . . . . . . . . . . 12511.3 Optimal compressors for a class of sources. Redundancy. . . . . . . . . . . . . . . 12711.4* Approximate minimax solution: Jeffreys prior . . . . . . . . . . . . . . . . . . . . 128

3

11.5 Sequential probability assignment: Krichevsky-Trofimov . . . . . . . . . . . . . . . 13011.6 Individual sequence and universal prediction . . . . . . . . . . . . . . . . . . . . . 13111.7 Lempel-Ziv compressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

III Binary hypothesis testing 136

12 Binary hypothesis testing 13712.1 Binary Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13712.2 Neyman-Pearson formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13812.3 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14012.4 Converse bounds on R(P,Q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14212.5 Achievability bounds on R(P,Q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14312.6 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

13 Hypothesis testing asymptotics I 14713.1 Stein’s regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14713.2 Chernoff regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14913.3 Basics of Large deviation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

14 Information projection and Large deviation 15714.1 Large-deviation exponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15714.2 Information Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15914.3 Interpretation of Information Projection . . . . . . . . . . . . . . . . . . . . . . . . 16114.4 Generalization: Sanov’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

15 Hypothesis testing asymptotics II 16415.1 (E0, E1)-Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16415.2 Equivalent forms of Theorem 15.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 16715.3* Sequential Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

IV Channel coding 173

16 Channel coding 17416.1 Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17416.2 Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17716.3 General (Weak) Converse Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 17916.4 General achievability bounds: Preview . . . . . . . . . . . . . . . . . . . . . . . . . 180

17 Channel coding: achievability bounds 18117.1 Information density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18117.2 Shannon’s achievability bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18317.3 Dependence-testing bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18417.4 Feinstein’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

18 Linear codes. Channel capacity 18718.1 Linear coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18718.2 Channels and channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

4

18.3 Bounds on Cε; Capacity of Stationary Memoryless Channels . . . . . . . . . . . . 19318.4 Examples of DMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19718.5* Information Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

19 Channels with input constraints. Gaussian channels. 20119.1 Channel coding with input constraints . . . . . . . . . . . . . . . . . . . . . . . . . 201

19.2 Capacity under input constraint C(P )?= Ci(P ) . . . . . . . . . . . . . . . . . . . . 203

19.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20519.4* Non-stationary AWGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20819.5* Stationary Additive Colored Gaussian noise channel . . . . . . . . . . . . . . . . . 20919.6* Additive White Gaussian Noise channel with Intersymbol Interference . . . . . . . 21019.7* Gaussian channels with amplitude constraints . . . . . . . . . . . . . . . . . . . . 21019.8* Gaussian channels with fading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

20 Lattice codes (by O. Ordentlich) 21220.1 Lattice Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21220.2 First Attempt at AWGN Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 21520.3 Nested Lattice Codes/Voronoi Constellations . . . . . . . . . . . . . . . . . . . . . 21620.4 Dirty Paper Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22020.5 Construction of Good Nested Lattice Pairs . . . . . . . . . . . . . . . . . . . . . . 220

21 Channel coding: energy-per-bit, continuous-time channels 22221.1 Energy per bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22221.2 What is N0? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22421.3 Capacity of the continuous-time band-limited AWGN channel . . . . . . . . . . . 22721.4 Capacity of the continuous-time band-unlimited AWGN channel . . . . . . . . . . 22821.5 Capacity per unit cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

22 Advanced channel coding. Source-Channel separation. 23522.1 Strong Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23522.2 Stationary memoryless channel without strong converse . . . . . . . . . . . . . . . 23922.3 Channel Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24022.4 Normalized Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24222.5 Joint Source Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

23 Channel coding with feedback 24623.1 Feedback does not increase capacity for stationary memoryless channels . . . . . . 24623.2* Alternative proof of Theorem 23.1 and Massey’s directed information . . . . . . . 24923.3 When is feedback really useful? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

24 Capacity-achieving codes via Forney concatenation 25824.1 Error exponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25824.2 Achieving polynomially small error probability . . . . . . . . . . . . . . . . . . . . 25924.3 Concatenated codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26024.4 Achieving exponentially small error probability . . . . . . . . . . . . . . . . . . . . 260

5

V Lossy data compression 262

25 Rate-distortion theory 26325.1 Scalar quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26325.2 Information-theoretic vector quantization . . . . . . . . . . . . . . . . . . . . . . . 26925.3* Converting excess distortion to average . . . . . . . . . . . . . . . . . . . . . . . . 272

26 Rate distortion: achievability bounds 27426.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27426.2 Shannon’s rate-distortion theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 27526.3* Covering lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

27 Evaluating R(D). Lossy Source-Channel separation. 28427.1 Evaluation of R(D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28427.2* Analog of saddle-point property in rate-distortion . . . . . . . . . . . . . . . . . . 28727.3 Lossy joint source-channel coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 29027.4 What is lacking in classical lossy compression? . . . . . . . . . . . . . . . . . . . . 295

VI Advanced topics 296

28 Applications to statistical decision theory 29728.1 Fano, LeCam and minimax risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29828.2 Mutual information method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

29 Multiple-access channel 30629.1 Problem motivation and main results . . . . . . . . . . . . . . . . . . . . . . . . . 30629.2 MAC achievability bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30829.3 MAC capacity region proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

30 Examples of MACs. Maximal Pe and zero-error capacity. 31330.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31330.2 Orthogonal MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31330.3 BSC MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31430.4 Adder MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31530.5 Multiplier MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31630.6 Contraction MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31730.7 Gaussian MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31830.8 MAC Peculiarities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

31 Random number generators 32331.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32331.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32431.3 Elias’ construction of RNG from lossless compressors . . . . . . . . . . . . . . . . 32431.4 Peres’ iterated von Neumann’s scheme . . . . . . . . . . . . . . . . . . . . . . . . . 32531.5 Bernoulli factory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32731.6 Related problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

32 Entropy method in combinatorics and geometry 331

6

32.1 Binary vectors of average weights . . . . . . . . . . . . . . . . . . . . . . . . . . . 33132.2 Shearer’s lemma & counting subgraphs . . . . . . . . . . . . . . . . . . . . . . . . 33232.3 Bregman’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33432.4 Euclidean geometry: Bollobas-Thomason and Loomis-Whitney . . . . . . . . . . . 336

Bibliography 338

7

Notations

• a ∧ b = mina, b and a ∨ b = maxa, b.

• Leb: Lebesgue measure on Euclidean spaces.

• For p ∈ [0, 1], p , 1− p.

• x+ = maxx, 0.

• int(·), cl(·), co(·) denote interior, closure and convex hull.

• N = 1, 2, . . ., Z+ = 0, 1, . . ., R+ = x : x ≥ 0.

• Standard big O notations are used: e.g., for any positive sequences an and bn, an = O(bn)if there is an absolute constant c > 0 such that an ≤ cbn; an = Ω(bn) if bn = O(an); an = Θ(bn)if both an = O(bn) and an = Ω(bn); an = o(bn) or bn = ω(an) if an ≤ εnbn for some εn → 0.

8

Part I

Information measures

9

§ 1. Information measures: entropy and divergence

Review: Random variables

• Two methods to describe a random variable (RV) X:

1. a function X : Ω→ X from the probability space (Ω,F ,P) to a target space X .

2. a distribution PX on some measurable space (X ,F).

• Convention: capital letter – RV (e.g. X); small letter – realization (e.g. x0).

• A RV X is discrete if there exists a countable set X = xj : j ≥ 1 such that∑j≥1 PX(xj) = 1. The set X is called the alphabet of X, x ∈ X are atoms, and

PX(x) is the probability mass function (pmf).

• For discrete RV, support supp(PX) = x : PX(x) > 0.

• Vector RVs: Xn1 , (X1, . . . , Xn), also denoted by just Xn.

• For a vector RV Xn and S ⊂ 1, . . . , n we denote XS = Xi, i ∈ S.

1.1 Entropy

Definition 1.1 (Entropy). Let X be a discrete RV with distribution PX . The entropy (or Shannonentropy) of X is

H(X) = E[

log1

PX(X)

]

=∑

x∈XPX(x) log

1

PX(x).

Definition 1.2 (Joint entropy). Let Xn = (X1, X2, . . . , Xn) be a random vector with n components.

H(Xn) = H(X1, . . . , Xn) = E[

log1

PX1,...,Xn(X1, . . . , Xn).]

Note: This is not really a new definition: Definition 1.2 is consistent with Definition 1.1 by treatingXn as a RV taking values on the product space.

Definition 1.3 (Conditional entropy).

H(X|Y ) = Ey∼PY [H(PX|Y=y)] = E[

log1

PX|Y (X|Y )

],

i.e., the entropy of H(PX|Y=y) averaged over PY .

10

Note:

• Q: Why such definition, why log, why entropy?The name comes from thermodynamics. The definition is justified by theorems in this course(e.g. operationally by compression), but also by a number of experiments. For example, wecan measure time it takes for ants-scouts to describe the location of the food to ants-workers.It was found that when nest is placed at a root of a full binary tree of depth d and food atone of the leaves, the time was proportional to log 2d = d – entropy of the random variabledescribing food location. It was estimated that ants communicate with about 0.7− 1 bit/min.Furthermore, communication time reduces if there are some regularities in path-description(e.g., paths like “left,right,left,right,left,right” were described faster). See [RZ86] for more.

• We agree that 0 log 10 = 0 (by continuity of x 7→ x log 1

x)

• Also write H(PX) instead of H(X) (abuse of notation, as customary in information theory).

• Basis of log — units

log2 ↔ bits

loge ↔ nats

log256 ↔ bytes

log ↔ arbitrary units, base always matches exp

Example (Bernoulli): X ∈ 0, 1, P[X = 1] = PX(1) , p and P[X = 0] = PX(0) , p

H(X) = h(p) , p log1

p+ p log

1

p

where h(·) is called the binary entropy func-tion.

Proposition 1.1. h(·) is continuous, concave on[0, 1] and

h′(p) = logp

p

with infinite slope at 0 and 1.

0 11/2

Example (Geometric): X ∈ 0, 1, 2, . . . P[X = i] = Px(i) = p · (p)i

H(X) =

∞∑

i=0

p · pi log1

p · pi =

∞∑

i=0

ppi(i log

1

p+ log

1

p

)

= log1

p+ p · log

1

p· 1− pp2

=h(p)

p

Example (Infinite entropy): Can H(X) = +∞? Yes, P[X = k] = ck ln2 k

, k = 2, 3, · · ·

11

Review: Convexity

• Convex set : A subset S of some vector space is convex if x, y ∈ S ⇒ αx+ αy ∈ Sfor all α ∈ [0, 1]. (Notation: α , 1− α.)

e.g., unit interval [0, 1]; S = probability distributions on X, S = PX : E[X] = 0.

• Convex function: f : S → R is

– convex if f(αx+ αy) ≤ αf(x) + αf(y) for all x, y ∈ S, α ∈ [0, 1].

– strictly convex if f(αx+ αy) < αf(x) + αf(y) for all x 6= y ∈ S, α ∈ (0, 1).

– (strictly) concave if −f is (strictly) convex.

e.g., x 7→ x log x is strictly convex; the mean P 7→∫xdP is convex but not

strictly convex, variance is concave (Q: is it strictly concave? Think of zero-meandistributions.).

• Jensen’s inequality : For any S-valued random variable X

– f is convex ⇒ f(EX) ≤ Ef(X)

– f is strictly convex ⇒ f(EX) < Ef(X)unless X is a constant (X = EX a.s.)

Ef(X)

f(EX)

Famous puzzle: A man says, ”I am the average height and average weight of thepopulation. Thus, I am an average man.” However, he is still considered a little overweight.Why?Answer: The weight is roughly proportional to the volume, which, for us three-dimensionabeings, is roughly proportional to the third power of the height. Let PX denote thedistribution of the height among the population. So by Jensen’s inequality, since x 7→ x3

is strictly convex on x > 0, we have (EX)3 < EX3, regardless of the distribution of X.Source: [Yos03, Puzzle 94] or online [Har].

Theorem 1.1 (Properties of H).

1. (Positivity) H(X) ≥ 0 with equality iff X is a constant (no randomness).

2. (Uniform distribution maximizes entropy) For finite X , H(X) ≤ log |X |, with equality iff X isuniform on X .

3. (Invariance under relabeling) H(X) = H(f(X)) for any bijective f .

12

4. (Conditioning reduces entropy)

H(X|Y ) ≤ H(X), with equality iff X and Y are independent.

5. (Small chain rule)H(X,Y ) = H(X) +H(Y |X) ≤ H(X) +H(Y )

6. (Entropy under functions) H(X) = H(X, f(X)) ≥ H(f(X)) with equality iff f is one-to-oneon the support of PX ,

7. (Full chain rule)

H(X1, . . . , Xn) =n∑

i=1

H(Xi|Xi−1) ≤n∑

i=1

H(Xi), (1.1)

with equality iff X1, . . . , Xn mutually independent.

Proof. 1. Expectation of positive function

2. Jensen’s inequality

3. H only depends on the values of PX , not locations:

H

( )= H

( )

4. Later (Lecture 2)

5. E[log 1PXY (X,Y ) ] = E

[log 1

PX(X)·PY |X(Y |X)

]= E

[log

1

PX(X)

]

︸︷︷︸H(X)

+E[

log1

PY |X(Y |X)

]

︸︷︷︸H(Y |X)

6. Intuition: (X, f(X)) contains the same amount of information as X. Indeed, x 7→ (x, f(x)) isone-to-one. Thus by 3 and 5:

H(X) = H(X, f(X)) = H(f(X)) +H(X|f(X)) ≥ H(f(X))

The bound is attained iff H(X|f(X)) = 0 which in turn happens iff X is a constant givenf(X).

7. Telescoping:PX1X2···Xn = PX1PX2|X1

· · ·PXn|Xn−1

then take the log.

Note: To give a preview of the operational meaning of entropy, let us play the game of 20Questions. We are allowed to make queries about some unknown discrete RV X by asking yes-noquestions. The objective of the game is to guess the realized value of the RV X. For example,X ∈ a, b, c, d with P [X = a] = 1/2, P [X = b] = 1/4, and P [X = c] = P [X = d] = 1/8. In thiscase, we can ask “X = a?”. If not, proceed by asking “X = b?”. If not, ask “X = c?”, afterwhich we will know for sure the realization of X. The resulting average number of questions is1/2 + 1/4× 2 + 1/8× 3 + 1/8× 3 = 1.75, which equals H(X) in bits. An alternative strategy is toask “X = a, b or c, d” in the first round then proceeds to determine the value in the second round,which always requires two questions and does worse on average.

It turns out (Section 8.2) that the minimal average number of yes-no questions to pin down thevalue of X is always between H(X) bits and H(X) + 1 bits. In this special case the above schemeis optimal because (intuitively) it always splits the probability in half.

13

1.2 Entropy: axiomatic characterization

One might wonder why entropy is defined as H(P ) =∑pi log 1

piand if there are other definitions.

Indeed, the information-theoretic definition of entropy is related to entropy in statistical physics.Also, it arises as answers to specific operational problems, e.g., the minimum average number of bitsto describe a random variable as discussed above. Therefore it is fair to say that it is not pulled outof thin air.

Shannon in 1948 paper has also showed that entropy can be defined axiomatically, as a functionsatisfying several natural conditions. Denote a probability distribution on m letters by P =(p1, . . . , pm) and consider a functional Hm(p1, . . . , pm). If Hm obeys the following axioms:

a) Permutation invariance

b) Expansible: Hm(p1, . . . , pm−1, 0) = Hm−1(p1, . . . , pm−1).

c) Normalization: H2(12 ,

12) = log 2.

d) Subadditivity: H(X,Y ) ≤ H(X) +H(Y ). Equivalently, Hmn(r11, . . . , rmn) ≤ Hm(p1, . . . , pm) +Hn(q1, . . . , qn) whenever

∑nj=1 rij = pi and

∑mi=1 rij = qj .

e) Additivity: H(X,Y ) = H(X) + H(Y ) if X ⊥⊥ Y . Equivalently, Hmn(p1q1, . . . , pmqn) ≤Hm(p1, . . . , pm) +Hn(q1, . . . , qn).

f) Continuity: H2(p, 1− p)→ 0 as p→ 0.

then Hm(p1, . . . , pm) =∑m

i=1 pi log 1pi

is the only possibility. The interested reader is referred to[CT06, p. 53] and the reference therein.

1.3 History of entropy

In the early days of industrial age, engineers wondered if it is possible to construct a perpetualmotion machine. After many failed attempts, a law of conservation of energy was postulated: amachine cannot produce more work than the amount of energy it consumed from the ambient world(this is also called the first law of thermodynamics). The next round of attempts was then toconstruct a machine that would draw energy in the form of heat from a warm body and convert itto equal (or approximately equal) amount of work. An example would be a steam engine. However,again it was observed that all such machines were highly inefficient, that is the amount of workproduced by absorbing heat Q was far less than Q. The remainder of energy was dissipated to theambient world in the form of heat. Again after many rounds of attempting various designs Clausiusand Kelvin proposed another law:

Second law of thermodynamics: There does not exist a machine that operates in a cycle(i.e. returns to its original state periodically), produces useful work and whose onlyother effect on the outside world is drawing heat from a warm body. (That is, everysuch machine, should expend some amount of heat to some cold body too!)1

Equivalent formulation is: There does not exist a cyclic process that transfers heat from a coldbody to a warm body (that is, every such process needs to be helped by expending some amount ofexternal work).

1Note that the reverse effect (that is converting work into heat) is rather easy: friction is an example.

14

Notice that there is something annoying about the second law as compared to the first law. Inthe first law there is a quantity that is conserved, and this is somehow logically easy to accept. Thesecond law seems a bit harder to believe in (and some engineers did not, and only their recurrentfailures to circumvent it finally convinced them). So Clausius, building on an ingenious work ofS. Carnot, figured out that there is an “explanation” to why any cyclic machine should expendheat. He proposed that there must be some hidden quantity associated to the machine, entropy of it(translated as “transformative content”), whose value must return to its original state. Furthermore,under any reversible (i.e. quasi-stationary, or “very slow”) process operated on this machine thechange of entropy is proportional to the ratio of absorbed heat and the temperature of the machine:

∆S =∆Q

T. (1.2)

If heat Q is absorbed at temperature Thot then to return to the original state, one must return someQ′ amount of heat, where Q′ can be significantly smaller than Q but never zero if Q′ is returnedat temperature 0 < Tcold < Thot.

2 Further logical arguments can convince one that for irreversiblecyclic process the change of entropy at the end of the cycle can only be positive, and hence entropycannot reduce.

There were great many experimentally verified consequences that second law produced. However,what is surprising is that the mysterious entropy did not have any formula for it (unlike, say, energy),and thus had to be computed indirectly on the basis of relation (1.2). This was changed with therevolutionary work of Boltzmann and Gibbs, who provided a microscopic explanation of the secondlaw based on statistical physics principles and showed that, e.g., for a system of n independentparticles (as in ideal gas) the entropy of a given macro-state can be computed as

S = kn∑

j=1

pj log1

pj,

where k is the Boltzmann constant, and we assumed that each particle can only be in one of `molecular states (e.g. spin up/down, or if we quantize the phase volume into ` subcubes) and pj isthe fraction of particles in j-th molecular state.

1.4* Entropy: submodularity

Recall that [n] denotes a set 1, . . . , n,(Sk

)denotes subsets of S of size k and 2S denotes all subsets

of S. A set function f : 2S → R is called submodular if for any T1, T2 ⊂ S

f(T1 ∪ T2) + f(T1 ∩ T2) ≤ f(T1) + f(T2) (1.3)

Submodularity is similar to concavity, in the sense that “adding elements gives diminishing returns”.Indeed consider T ′ ⊂ T and b 6∈ T . Then

f(T ∪ b)− f(T ) ≤ f(T ′ ∪ b)− f(T ′) .

Theorem 1.2. Let Xn be discrete RV. Then T 7→ H(XT ) is submodular.

Proof. Let A = XT1\T2, B = XT1∩T2 , C = XT2\T1

. Then we need to show

H(A,B,C) +H(B) ≤ H(A,B) +H(B,C) .

2See for example https://en.wikipedia.org/wiki/Carnot_heat_engine#Carnot.27s_theorem.

15

https://en.wikipedia.org/wiki/Carnot_heat_engine#Carnot.27s_theorem

This follows from a simple chain

H(A,B,C) +H(B) = H(A,C|B) + 2H(B) (1.4)

≤ H(A|B) +H(C|B) + 2H(B) (1.5)

= H(A,B) +H(B,C) (1.6)

Note that entropy is not only submodular, but also monotone:

T1 ⊂ T2 =⇒ H(XT1) ≤ H(XT2) .

So fixing n, let us denote by Γn the set of all non-negative, monotone, submodular set-functions on[n]. Note that via an obvious enumeration of all non-empty subsets of [n], Γn is a closed convex conein R2n−1

+ . Similarly, let us denote by Γ∗n the set of all set-functions corresponding to distributionson Xn. Let us also denote Γ∗n the closure of Γ∗n. It is not hard to show, cf. [ZY97], that Γ∗n is also aclosed convex cone and that

Γ∗n ⊂ Γ∗n ⊂ Γn .

The astonishing result of [ZY98] is that

Γ∗2 = Γ∗2 = Γ2 (1.7)

Γ∗3 ( Γ∗3 = Γ3 (1.8)

Γ∗n ( Γ∗n(Γn n ≥ 4 . (1.9)

This follows from the fundamental new information inequality not implied by the submodularity ofentropy (and thus called non-Shannon inequality). Namely, [ZY98] showed that for any 4-tupe ofdiscrete random variables:

I(X3;X4)− I(X3;X4|X1)− I(X3;X4|X2) ≤ 1

2I(X1;X2) +

1

4I(X1;X3, X4) +

1

4I(X2;X3, X4) .

(to translate into an entropy inequality, see Theorem 2.4).

1.5* Entropy: Han’s inequality and Shearer’s Lemma

Theorem 1.3 (Han’s inequality). Let Xn be discrete n-dimensional RV and denote Hk(Xn) =

1

(nk)

∑T∈([n]

k )H(XT ) the average entropy of a k-subset of coordinates. Then Hkk is decreasing in k:

1

nHn ≤ · · · ≤

1

kHk · · · ≤ H1 . (1.10)

Furthermore, the sequence Hk is increasing and concave in the sense of decreasing slope:

Hk+1 − Hk ≤ Hk − Hk−1 . (1.11)

Proof. Denote for convenience H0 = 0. Note that Hmm is an average of differences:

1

mHm =

1

m

m∑

k=1

(Hk − Hk−1)

16

Thus, it is clear that (1.11) implies (1.10) since increasing m by one adds a smaller element to theaverage. To prove (1.11) observe that from submodularity

H(X1, . . . , Xk+1) +H(X1, . . . , Xk−1) ≤ H(X1, . . . , Xk) +H(X1, . . . , Xk−1, Xk+1) .

Now average this inequality over all n! permutations of indices 1, . . . , n to get

Hk+1 + Hk−1 ≤ 2Hk

as claimed by (1.11).Alternative proof: Notice that by “conditioning decreases entropy” we have

H(Xk+1|X1, . . . , Xk) ≤ H(Xk+1|X2, . . . , Xk) .

Averaging this inequality over all permutations of indices yields (1.11).

Theorem 1.4 (Shearer’s Lemma). Let Xn be discrete n-dimensional RV and let S ⊂ [n] be arandom variable independent of Xn and taking values in subsets of [n]. Then

H(XS |S) ≥ H(Xn) ·mini∈[n]

P[i ∈ S] . (1.12)

Remark 1.1. In the special case where S is uniform over all subsets of cardinality k, (1.12) reducesto Han’s inequality 1

nH(Xn) ≤ 1k Hk. The case of n = 3 and k = 2 can be used to give an entropy

proof of the following well-known geometry result that relates the size of 3-D object to those of its2-D projections: Place N points in R3 arbitrarily. Let N1, N2, N3 denote the number of distinctpoints projected onto the xy, xz and yz-plane, respectively. Then N1N2N3 ≥ N2.

Proof. We will prove an equivalent (by taking a suitable limit) version: If C = (S1, . . . , SM ) is a list(possibly with repetitions) of subsets of [n] then

∑

j

H(XSj ) ≥ H(Xn) ·mini

deg(i) , (1.13)

where deg(i) , #j : i ∈ Sj. Let us call C a chain if all subsets can be rearranged so thatS1 ⊆ S2 · · · ⊆ SM . For a chain, (1.13) is trivial, since the minimum on the right-hand side is eitherzero (if SM 6= [n]) or equals multiplicity of SM in C, 3 in which case we have

∑

j

H(XSj ) ≥ H(XSM )#j : Sj = SM = H(Xn) ·mini

deg(i) .

For the case of C not a chain, consider a pair of sets S1, S2 that are not related by inclusion andreplace them in the collection with S1 ∩ S2, S1 ∪ S2. Submodularity (1.3) implies that the sumon the left-hand side of (1.13) does not decrease under this replacement, values deg(i) are notchanged. Since the total number of pairs that are not related by inclusion strictly decreases by thisreplacement, we must eventually arrive to a chain, for which (1.13) has already been shown.

3Note that, consequently, for Xn without constant coordinates, and if C is a chain, (1.13) is only tight if C consistsof only ∅ and [n] (with multiplicities). Thus if degrees deg(i) are known and non-constant, then (1.13) can be improved,cf. [MT10].

17

Note: Han’s inequality holds for any submodular set-function. Shearer’s lemma holds for anysubmodular set-function that is also non-negative.Example: Another submodular set-function is

S 7→ I(XS ;XSc) .

Han’s inequality for this one reads

0 =1

nIn ≤ · · · ≤

1

kIk · · · ≤ I1 ,

where Ik = 1

(nk)

∑S:|S|=k I(XS ;XSc) measures the amount of k-subset coupling in the random vector

Xn.

1.6 Divergence

Review: Measurability

In this course we will assume that all alphabets are standard Borel spaces. Some of thenice properties of standard Borel spaces:

• All complete separable metric spaces, endowed with Borel σ-algebras are standardBorel. In particular, countable alphabets and Rn and R∞ (space of sequences) arestandard Borel.

• If Xi, i = 1, . . . are standard Borel, then so is∏∞i=1Xi

• Singletons x are measurable sets

• The diagonal (x, x) : x ∈ X is measurable in X × X

• (Most importantly) for any probability distribution PXY on X × Y there exists atransition probability kernel (also called a regular branch of a conditional distribution)PY |X s.t.

PXY [E] =

∫

XPX(dx)

∫

YPY |X=x(dy)1(x, y) ∈ E .

Intuition: Divergence (also known as information divergence, Kullback-Leibler (KL) divergence,relative entropy.) D(P‖Q) gauges the dissimilarity between P and Q.

Definition 1.4 (Divergence). Let P,Q be distributions on

• A = discrete alphabet (finite or countably infinite)

D(P‖Q) ,∑

a∈AP (a) log

P (a)

Q(a),

where we agree:

(1) 0 · log0

0= 0

18

(2) ∃a : Q(a) = 0, P (a) > 0⇒ D(P‖Q) =∞

• A = Rk, P and Q have densities p and q:

D(P‖Q) =

∫Rk log p(x)p(x)

q(x)dx Lebp > 0, q = 0 = 0

+∞ otherwise

• A = general measurable space:

D(P‖Q) =

EP [log dP

dQ ] = EQ[dPdQ log dPdQ ] P Q

+∞ otherwise

Notes:

• (Radon-Nikodym theorem) Recall that for two measures P and Q, we say P is absolutelycontinuous w.r.t. Q (denoted by P Q) if Q(E) = 0 implies P (E) = 0 for all measurable E.If P Q, then there exists a function f : X → R+ such that for any measurable set E,

P (E) =

∫

EfdQ. [change of measure]

Such f is called a relative density (or a Radon-Nikodym derivative) of P w.r.t. Q, denoted bydPdQ . Usually, dP

dQ is simply the likelihood ratio:

– For discrete distributions, we can just take dPdQ(x) to be the ratio of pmfs.

– For continuous distributions, we can take dPdQ(x) to be the ratio of pdfs.

• (Infinite values) D(P‖Q) can be ∞ also when P Q, but the two cases of D(P‖Q) = +∞are consistent since D(P‖Q) = supΠD(PΠ‖QΠ), where Π is a finite partition of the underlyingspace A (Theorem 3.7).

• (Asymmetry) D(P‖Q) 6= D(Q‖P ). Therefore divergence is not a distance. However, asym-metry can be very useful. Example: P (H) = P (T ) = 1/2, Q(H) = 1. Upon observingHHHHHHH, one tends to believe it is Q but can never be absolutely sure; Upon observingHHT, know for sure it is P . Indeed, D(P‖Q) =∞, D(Q‖P ) = 1 bit (see Lecture 13).

• (Pinsker’s inequality) There are many other measures for dissimilarity, e.g., total variation(L1-distance)

TV(P,Q) , supEP [E]−Q[E] =

1

2

∫|dP − dQ| (1.14)

=1

2

∑

x

|P (x)−Q(x)| (discrete)

=1

2

∫dx|p(x)− q(x)| (continuous)

Total variation is symmetric and in fact a distance. The famous Pinsker’s (or Pinsker-Csiszar)inequality relates D and TV (see Theorem 6.5):

TV(P,Q) ≤√

1

2 log eD(P‖Q) . (1.15)

19

• (Other divergences) A general class of divergence-like measures was proposed by Csiszar.Fixing a convex function f : R+ → R with f(1) = 0 we define f -divergence Df as

Df (P‖Q) , EQ[f

(dP

dQ

)]. (1.16)

This encompasses total variation, χ2-distance, Hellinger, Tsallis etc. Inequalities betweenvarious f -divergences such as (1.15) was once an active field of research. A complete solutionwas obtained by Harremoes and Vajda [HV11] who gave a simple method for obtaining bestpossible inequalities between any pair of f -divergences.

Theorem 1.5 (H v.s. D). If distribution P is supported on a finite set A, then

H(P ) = log |A| −D(P‖ UA︸︷︷︸uniform distribution on A

).

Example (Binary divergence): A = 0, 1; P = [p, p]; Q = [q, q]

D(P‖Q) = d(p‖q) , p logp

q+ p log

p

q

Here is how d(p‖q) depends on p and q:

d (p ||q )

p 1q

d (p ||q )

q 1p

−log q

−log q−

Quadratic lower bound (homework):

d(p‖q) ≥ 2(p− q)2 log e

Example (Real Gaussian): A = R

D(N (m1, σ21)‖N (m0, σ

20)) =

1

2log

σ20

σ21

+1

2

[(m1 −m0)2

σ20

+σ2

1

σ20

− 1]

log e (1.17)

Example (Vector Gaussian): A = Rk, assuming det Σ0 6= 0,

D(Nc(m1,Σ1)‖Nc(m0,Σ0)) =1

2

(log det Σ0 − log det Σ1 + (m1 −m0)HΣ−1

0 (m1 −m0) log e

+ tr(Σ−10 Σ1 − I) log e

)(1.18)

20

Example (Complex Gaussian): A = C. The pdf of Nc(m,σ2) is1

πσ2e−|x−m|

2/σ2, or equivalently:

Nc(m,σ2) = N([

Re(m) Im(m)],

[σ2/2 0

0 σ2/2

])

D(Nc(m1, σ21)‖Nc(m0, σ

20)) = log

σ20

σ21

+( |m1 −m0|2

σ20

+σ2

1

σ20

− 1)

log e

which follows from (1.18).More generally, for vector space A = Ck, assuming det Σ0 6= 0,

D(Nc(m1,Σ1)‖Nc(m0,Σ0)) = log det Σ0 − log det Σ1 + (m1 −m0)HΣ−10 (m1 −m0) log e

+ tr(Σ−10 Σ1 − I) log e

Note: The definition of D(P‖Q) extends verbatim to measures P and Q (not necessarily probabilitymeasures), in which case D(P‖Q) can be negative. A sufficient condition for D(P‖Q) ≥ 0 is that Pis a probability measure and Q is a sub-probability measure, i.e.,

∫dQ ≤ 1 =

∫dP .

1.7 Differential entropy

The notion of differential entropy is simply the divergence with respect to the Lebesgue measure:

Definition 1.5. The differential entropy of a random vector Xk is

h(Xk) = h(PXk) , −D(PXk‖Leb). (1.19)

In particular, if Xk has probability density function (pdf) p, then h(Xk) = E log 1p(Xk)

; otherwise

h(Xk) = −∞. Conditional differential entropy h(Xk|Y ) , E log 1pXk|Y (Xk|Y )

where pXk|Y is a

conditional pdf.

Example: Gaussian. For X ∼ N(µ, σ2),

h(X) =1

2log(2πeσ2) (1.20)

More generally, for X ∼ N(µ,Σ) in Rd,

h(X) =1

2log((2πe)d det Σ) (1.21)

Warning: Even for continuous random variable X, h(X) can be positive, negative, take valuesof ±∞ or even undefined.4

Nevertheless, differential entropy shares many properties with the usual Shannon entropy:

Theorem 1.6 (Properties of differential entropy). Assume that all differential entropies appearingbelow exist and are finite (in particular all RVs have pdfs and conditional pdfs).

1. (Uniform distribution maximizes differential entropy) If P[Xn ∈ S] = 1 then h(Xn) ≤log Leb(S),5 with equality iff Xn is uniform on S.

4For an example, consider a piecewise-constant pdf taking value e(−1)nn on the n-th interval of width ∆n =cn2 e−(−1)nn.5Here Leb(S) is the same as the volume vol(S).

21

2. (Scaling and shifting) h(Xk + x) = h(Xk + x), h(αXk) = h(Xk) + k log |α| and for invertibleA, h(AXk) = h(Xk) + log | detA|

3. (Conditioning reduces differential entropy) h(X|Y ) ≤ h(X) (here Y could be arbitrary, e.g.discrete)

4. (Chain rule)

h(Xn) =

n∑

k=1

h(Xk|Xk−1) .

5. (Submodularity) The set-function T 7→ h(XT ) is submodular.

6. (Han’s inequality) The function k 7→ 1k(nk)

∑T∈([n]

k ) h(XT ) is decreasing in k.

22

§ 2. Information measures: mutual information

2.1 Divergence: main inequality

Theorem 2.1 (Information Inequality).

D(P‖Q) ≥ 0 ; D(P‖Q) = 0 iff P = Q

Proof. Let ϕ(x) , x log x, which is strictly convex, and use Jensen’s Inequality:

D(P‖Q) =∑

XP (x) log

P (x)

Q(x)=∑

XQ(x)ϕ

(P (x)

Q(x)

)≥ ϕ

(∑

XQ(x)

P (x)

Q(x)

)= ϕ(1) = 0

2.2 Conditional divergence

The main objects in our course are random variables. The main operation for creating newrandom variables, and also for defining relations between random variables, is that of a randomtransformation:

Definition 2.1. Conditional probability distribution (aka random transformation, transition prob-ability kernel, Markov kernel, channel) K(·|·) has two arguments: first argument is a measurablesubset of Y, second argument is an element of X . It must satisfy:

1. For any x ∈ X : K( · |x) is a probability measure on Y

2. For any measurable set A: x 7→ K(A|x) is a measurable function on X .

In this case we will say that K acts from X to Y. In fact, we will abuse notation and write PY |Xinstead of K to suggest what spaces X and Y are1. Furthermore, if X and Y are connected by the

random transformation PY |X we will write XPY |X−−−→ Y .

Remark 2.1. (Very technical!) Unfortunately, condition 2 (standard in probability textbooks)will frequently not be strong enough for purposes in this course. The main reason is that we want

Radon-Nikodym derivatives such asdPY |X=x

dQY(y) to be jointly measurable in (x, y). See Section 2.6*

for more.

Example:

1. deterministic system: Y = f(X)⇔ PY |X=x = δf(x)

1Another reason for writing PY |X is that from any joint distribution PXY (on standard Borel spaces) one canextract a random transformation by conditioning on X.

23

PX

PY |X

QY |X

PY

QY

6. (Data-processing for divergences) Let PY = PY |X PX

PY =∫PY |X( · |x)dPX

QY =∫PY |X( · |x)dQX

=⇒ D(PY ‖QY ) ≤ D(PX‖QX) (2.1)

Pictorially:

X Y

P X

Q Y

P Y

Q X

P Y | X=⇒ D(PX‖QX) ≥ D(PY ‖QY )

Proof. We only illustrate these results for the case of finite alphabets. General case follows by doinga careful analysis of Radon-Nikodym derivatives, introduction of regular branches of conditionalprobability etc. For certain cases (e.g. separable metric spaces), however, we can simply discretizealphabets and take the granularity of discretization to zero. This method will become clearer inLecture 3, once we understand continuity of D.

1. Ex∼PX [D(PY |X=x‖QY |X=x)] = EXY∼PXPY |X[log

PY |XQY |X

PXPX

]

2. Disintegration: EXY[log PXY

QXY

]= EXY

[log

PY |XQY |X

+ log PXQX

]

3. Apply 2. with X and Y interchanged and use D(·‖·) ≥ 0.

4. Telescoping PXn =∏ni=1 PXi|Xi−1 and QXn =

∏ni=1QXi|Xi−1 .

5. Inequality follows from monotonicity. To get conditions for equality, notice that by the chainrule for D:

D(PXY ‖QXY ) = D(PY |X‖QY |X |PX) +D(PX‖PX)︸︷︷︸=0

= D(PX|Y ‖QX|Y |PY ) +D(PY ‖QY )

and hence we get the claimed result from positivity of D.

6. This again follows from monotonicity.

Corollary 2.1.

D(PX1···Xn‖QX1 · · ·QXn) ≥∑

D(PXi‖QXi) or

= iff PXn =∏nj=1 PXj

25

Note: In general we can have D(PXY ‖QXY ) ≶ D(PX‖QX) +D(PY ‖QY ). For example, if X = Yunder P and Q, then D(PXY ‖QXY ) = D(PX‖QX) < 2D(PX‖QX). Conversely, if PX = QX andPY = QY but PXY 6= QXY we have D(PXY ‖QXY ) > 0 = D(PX‖QX) +D(PY ‖QY ).

Corollary 2.2. Y = f(X)⇒ D(PY ‖QY ) ≤ D(PX‖QX), with equality if f is 1-1.

Note: D(PY ‖QY ) = D(PX‖QX) 6⇒ f is 1-1. Example: PX = Gaussian, QX = Laplace, Y = |X|.

Corollary 2.3 (Large deviations estimate). For any subset E ⊂ X we have

d(PX [E]‖QX [E]) ≤ D(PX‖QX)

Proof. Consider Y = 1X∈E.

Note: This method will be very useful in stuyding large deviation (Section 13.1 and Section 14.1)when applied to an event E which is highly likely under P but highly unlikely under Q.

2.3 Mutual information

Definition 2.3 (Mutual information).

I(X;Y ) = D(PXY ‖PXPY )

Note:

• Intuition: I(X;Y ) measures the dependence between X and Y , or, the information about X(resp. Y ) provided by Y (resp. X)

• Defined by Shannon (in a different form), in this form by Fano.

• This definition is not restricted to discrete RVs.

• I(X;Y ) is a functional of the joint distribution PXY , or equivalently, the pair (PX , PY |X).

Theorem 2.3 (Properties of I).

1. I(X;Y ) = D(PXY ‖PXPY ) = D(PY |X‖PY |PX) = D(PX|Y ‖PX |PY )

2. Symmetry: I(X;Y ) = I(Y ;X)

3. Positivity: I(X;Y ) ≥ 0; I(X;Y ) = 0 iff X ⊥⊥ Y

4. I(f(X);Y ) ≤ I(X;Y ); f one-to-one ⇒ I(f(X);Y ) = I(X;Y )

5. “More data ⇒ More info”: I(X1, X2;Z) ≥ I(X1;Z)

Proof. 1. I(X;Y ) = E log PXYPXPY

= E logPY |XPY

= E logPX|YPX

.

2. Apply data-processing inequality twice to the map (x, y)→ (y, x) to get D(PXY ‖PXPY ) =D(PY X‖PY PX).

3. By definition and Theorem 2.1.

26

4. We will use the data-processing property of mutual information (to be proved shortly, seeTheorem 2.5). Consider the chain of data processing: (x, y) 7→ (f(x), y) 7→ (f−1(f(x)), y).ThenI(X;Y ) ≥ I(f(X);Y ) ≥ I(f−1(f(X));Y ) = I(X;Y )

5. Consider f(X1, X2) = X1.

Theorem 2.4 (I v.s. H).

1. I(X;X) =

H(X) X discrete

+∞ otherwise

2. If X is discrete, thenI(X;Y ) = H(X)−H(X|Y ).

If both X and Y are discrete, then

I(X;Y ) = H(X) +H(Y )−H(X,Y ).

3. If X, Y are real-valued vectors, have joint pdf and all three differential entropies are finitethen

I(X;Y ) = h(X) + h(Y )− h(X,Y )

If X has marginal pdf pX and conditional pdf pX|Y (x|y) then

I(X;Y ) = h(X)− h(X|Y ) .

4. If X or Y are discrete then I(X;Y ) ≤ min (H(X), H(Y )), with equality iff H(X|Y ) = 0 orH(Y |X) = 0, i.e., one is a deterministic function of the other.

Proof. 1. By definition, I(X;X) = D(PX|X‖PX |PX) = Ex∼XD(δx‖PX). If PX is discrete, then

D(δx‖PX) = log 1PX(x) and I(X;X) = H(X). If PX is not discrete, then let A = x :

PX(x) > 0 denote the set of atoms of PX . Let ∆ = (x, x) : x 6∈ A ⊂ X × X . ThenPX,X(∆) = PX(Ac) > 0 but since

(PX × PX)(E) ,∫

XPX(dx1)

∫

XPX(dx2)1(x1, x2) ∈ E

we have by taking E = ∆ that (PX × PX)(∆) = 0. Thus PX,X 6 PX × PX and thus

I(X;X) = D(PX,X‖PXPX) = +∞ .

2. Telescoping: E log PXYPXPY

= E[log 1

PX+ log 1

PY− log 1

PXY

].

3. Similarly, when PXY and PXPY have densities pXY and pXpY we have

D(PXY ‖PXPY ) , E[log

pXYpXpY

]= h(X) + h(Y )− h(X,Y )

4. Follows from 2.

27

Corollary 2.4 (Conditioning reduces entropy). X discrete: H(X|Y ) ≤ H(X), with equality iffX ⊥⊥ Y .Intuition: The amount of entropy reduction = mutual information

Example: It is important to note that conditioning reduces entropy on average, not per reliazation.

X = U ORY , where U, Yi.i.d.∼ Bern(1

2). Then X ∼ Bern(34) and H(X) = h(1

4) < 1 bit = H(X|Y = 0),i.e., conditioning on Y = 0 increases entropy. But on average, H(X|Y ) = P [Y = 0]H(X|Y =0) + P [Y = 1]H(X|Y = 1) = 1

2 bits < H(X), by the strong concavity of h(·).Note: Information, entropy and Venn diagrams:

1. The following Venn diagram illustrates the relationship between entropy, conditional entropy,joint entropy, and mutual information.

I(X;Y )H(Y |X) H(X|Y )

H(X,Y )

H(Y ) H(X)

2. If you do the same for 3 variables, you will discover that the triple intersection corresponds to

H(X1) +H(X2) +H(X3)−H(X1, X2)−H(X2, X3)−H(X1, X3) +H(X1, X2, X3) (2.2)

which is sometimes denoted by I(X;Y ;Z). It can be both positive and negative (why?).

3. In general, one can treat random variables as sets (so that the Xi corresponds to set Ei andthe pair (X1, X2) corresponds to E1 ∪ E2). Then we can define a unique signed measure µon the finite algebra generated by these sets so that every information quantity is found byreplacing

I/H → µ ;→ ∩ ,→ ∪ | → \ .As an example, we have

H(X1|X2, X3) = µ(E1 \ (E2 ∪ E3)) , (2.3)

I(X1, X2;X3|X4) = µ(((E1 ∪ E2) ∩ E3) \ E4) . (2.4)

By inclusion-exclusion, quantity (2.2) corresponds to µ(E1 ∩ E2 ∩ E3), which explains why µis not necessarily a positive measure. For an extensive discussion, see [CK81a, Chapter 1.3].

Example: Bivariate Gaussian. Let X,Y be jointly Gaus-sian. Then

I(X;Y ) =1

2log

1

1− ρ2XY

where ρXY ,E[(X−EX)(Y−EY )]

σXσY∈ [−1, 1] is the correlation

coefficient. -1 0 1ρ

I(X;Y )

28

Proof. WLOG, by shifting and scaling if necessary, we can assume EX = EY = 0 and EX2 =EY 2 = 1. Then ρ = EXY . By joint Gaussianity, Y = ρX + Z for some Z ∼ N (0, 1 − ρ2) ⊥⊥ X.Then using the divergence formula for Gaussians (1.17), we get

I(X;Y ) = D(PY |X‖PY |PX)

= ED(N (ρX, 1− ρ2)‖N (0, 1))

= E[

1

2log

1

1− ρ2+

log e

2

((ρX)2 + 1− ρ2 − 1

)]

=1

2log

1

1− ρ2.

Alternatively, use the differential entropy representation Theorem 2.4 and the entropy formula(1.20) for Gaussians:

I(X;Y ) = h(Y )− h(Y |X)

= h(Y )− h(Z)

=1

2log(2πe)− 1

2log(2πe(1− ρ2)) =

1

2log

1

1− ρ2

Note: Similar to the role of mutual information, the correlation coefficient also measures thedependency between random variables which are real-valued (more generally, on an inner-productspace) in certain sense. However, mutual information is invariant to bijections and more general: itcan be defined not just for numerical random variables, but also for apples and oranges.

Example: Additive white Gaussian noise (AWGN) channel. X ⊥⊥ N — independent Gaussian

+X Y

N

I(X;X +N) = 12 log

(1 +

σ2X

σ2N︸︷︷︸

signal-to-noise ratio (SNR)

)

Example: Gaussian vectors. X ∈ Rm,Y ∈ Rn — jointly Gaussian

I(X; Y) =1

2log

det ΣX det ΣY

det Σ[X,Y]

where ΣX , E [(X− EX)(X− EX)′] denotes the covariance matrix of X ∈ Rm, and Σ[X,Y] denotesthe covariance matrix of the random vector [X,Y] ∈ Rm+n.

In the special case of additive noise: Y = X + N for N ⊥⊥ X, we have

I(X; X + N) =1

2log

det(ΣX + ΣN)

det ΣN

since det Σ[X,X+N] = det(

ΣX ΣXΣX ΣX+ΣN

)why?= det ΣX det ΣN.

Example: Binary symmetric channel (BSC).

+X Y

N

X Y

1

0

1

01− δ

1− δ

δ

29

X ∼ Bern(1

2

), N ∼ Bern(δ)

Y = X +N

I(X;Y ) = log 2− h(δ)

Example: Addition over finite groups. X is uniform on G and independent of Z. Then

I(X;X + Z) = log |G| −H(Z)

Proof. Show that X + Z is uniform on G regardless of Z.

2.4 Conditional mutual information and conditionalindependence

Definition 2.4 (Conditional mutual information).

I(X;Y |Z) = D(PXY |Z‖PX|ZPY |Z |PZ) (2.5)

= Ez∼PZ [I(X;Y |Z = z)] . (2.6)

where the product of two random transformations is (PX|Z=zPY |Z=z)(x, y) , PX|Z(x|z)PY |Z(y|z),under which X and Y are independent conditioned on Z.

Note: I(X;Y |Z) is a functional of PXY Z .

Remark 2.2 (Conditional independence). A family of distributions can be represented by a directedacyclic graph. A simple example is a Markov chain (path graph), which represents distributionsthat factor as PXY Z : PXY Z = PXPY |XPZ|Y .

Cond. indep.notation

X → Y → Z ⇔ PXZ|Y = PX|Y · PZ|Y⇔ PZ|XY = PZ|Y

⇔ PXY Z = PX · PY |X · PZ|Y⇔ X,Y, Z form a Markov chain

⇔ X ⊥⊥ Z|Y⇔ X ← Y → Z,PXY Z = PY · PX|Y · PZ|Y⇔ Z → Y → X

Theorem 2.5 (Further properties of Mutual Information).

1. I(X;Z|Y ) ≥ 0, with equality iff X → Y → Z

2. (Kolmogorov identity or small chain rule)

I(X,Y ;Z) = I(X;Z) + I(Y ;Z|X)

= I(Y ;Z) + I(X;Z|Y )

3. (Data Processing) If X → Y → Z, then

a) I(X;Z) ≤ I(X;Y ), with equality iff X → Z → Y .

30

b) I(X;Y |Z) ≤ I(X;Y )

4. If X → Y → Z →W , then I(X;W ) ≤ I(Y ;Z)


I(Xn;Y ) =

n∑

k=1

I(Xk;Y |Xk−1)

6. f and g are one-to-one ⇒ I(f(X); g(Y )) = I(X;Y )

Proof. 1. By definition and Theorem 2.3.3.

2.PXY ZPXY PZ

=PXZPXPZ

·PY |XZ

PY |X

3. Apply Kolmogorov identity to I(Y,Z;X):

I(Y,Z;X) = I(X;Y ) + I(X;Z|Y )︸︷︷︸=0

= I(X;Z) + I(X;Y |Z)

4. I(X;W ) ≤ I(X;Z) ≤ I(Y ;Z)

5. Recursive application of Kolmogorov identity.

Note: In general, I(X;Y |Z) ≷ I(X;Y ). Examples:a) “>”: Conditioning does not always decrease M.I. To find counterexamples when X,Y, Z do

not form a Markov chain, notice that there is only one directed acyclic graph non-isomorphic toX → Y → Z, namely X → Y ← Z. Then a counterexample is

X,Zi.i.d.∼ Bern(

1

2) Y = X ⊕ Z

I(X;Y ) = 0 since X ⊥⊥ YI(X;Y |Z) = I(X;X ⊕ Z|Z) = H(X) = 1 bit

b) “<”: Z = Y . Then I(X;Y |Y ) = 0.Note: (Chain rule for I ⇒ Chain rule for H) Set Y = Xn. Then H(Xn) = I(Xn;Xn) =∑n

k=1 I(Xk;Xn|Xk−1) =

∑nk=1H(Xk|Xk−1), since H(Xk|Xn, Xk−1) = 0.

Remark 2.3 (Data processing for mutual information via data processing of divergence). Weproved data processing for mutual information in Theorem 2.5 using Kolmogorov’s identity. In fact,data processing for mutual information is implied by the data processing for divergence:

I(X;Z) = D(PZ|X‖PZ |PX) ≤ D(PY |X‖PY |PX) = I(X;Y ),

where note that for each x, we have PY |X=x

PZ|Y−−−→ PZ|X=x and PYPZ|Y−−−→ PZ . Therefore if we have a

bi-variate functional of distributions D(P‖Q) which satisfies data processing, then we can define an“M.I.-like” quantity via ID(X;Y ) , D(PY |X‖PY |PX) , Ex∼PXD(PY |X=x‖PY ) which will satisfy dataprocessing on Markov chains. A rich class of examples arises by taking D = Df (an f -divergence,defined in (1.16)). That f -divergence satisfies data-processing will be proved in Remark 4.2.

31

2.5 Strong data-processing inequalities

For many random transformations PY |X , it is possible to improve the data-processing inequality (2.1):For any PX , QX we have

D(PY ‖QY ) ≤ ηKLD(PX‖QX) ,

where ηKL < 1 and depends on the channel PY |X only. Similarly, this gives an improvement in thedata-processing inequality for mutual information: For any PU,X we have

U → X → Y =⇒ I(U ;Y ) ≤ ηKLI(U ;X) .

For example, for PY |X = BSC(δ) we have ηKL = (1 − 2δ)2. Strong data-processing inequalitiesquantify the intuitive observation that noise inside the channel PY |X must reduce the informationthat Y carries about the data U , regardless of how smart the mapping U 7→ X is.

This is an active area of research, see [PW17] for a short summary.

2.6* How to avoid measurability problems?

As we mentioned in Remark 2.1 conditions imposed by Definition 2.1 on PY |X are insufficient.Namely, we get the following two issues:

1. Radon-Nikodym derivatives such asdPY |X=x

dQY(y) may not be jointly measurable in (x, y)

2. Set x : PY |X=x QY may not be measurable.

The easiest way to avoid all such problems is the following:

Agreement A1: All conditional kernels PY |X : X → Y in these notes will be assumedto be defined by choosing a σ-finite measure µ2 on Y and measurable function ρ(y|x) ≥ 0on X × Y such that

PY |X(A|x) =

∫

Aρ(y|x)µ2(dy)

for all x and measurable sets A and∫Y ρ(y|x)µ2(dy) = 1 for all x.

Notes:

1. Given another kernel QY |X specified via ρ′(y|x) and µ′2 we may first replace µ2 and µ′2 viaµ′′2 = µ2 + µ′2 and thus assume that both PY |X and QY |X are specified in terms of the same

dominating measure µ′′2. (This modifies ρ(y|x) to ρ(y|x) dµ2

dµ′′2(y).)

2. Given two kernels PY |X and QY |X specified in terms of the same dominating measure µ2 andfunctions ρP (y|x) and ρQ(y|x), respectively, we may set

dPY |X

dQY |X,ρP (y|x)

ρQ(y|x)

outside of ρQ = 0. When PY |X=x QY |X=x the above gives a version of the Radon-Nikodymderivative, which is automatically measurable in (x, y).

32

3. Given QY specified asdQY = q(y)dµ2

we may set

A0 = x :

∫

q=0ρ(y|x)dµ2 = 0

This set plays a role of x : PY |X=x QY . Unlike the latter A0 is guaranteed to bemeasurable by Fubini theorem [C11, Prop. 6.9]. By “plays a role” we mean that it allows toprove statements like: For any PX

PXY PXQY ⇐⇒ PX [A0] = 1 .

So, while our agreement resolves the two measurability problems above, it introduces a newone. Indeed, given a joint distribution PXY on standard Borel spaces, it is always true that onecan extract a conditional distribution PY |X satisfying Definition 2.1 (this is called disintegration).However, it is not guaranteed that PY |X will satisfy Agreement A1. To work around this issue aswell, we add another agreement:

Agreement A2: All joint distributions PXY are specified by means of data: µ1,µ2 –σ-finite measures on X and Y, respectively, and measurable function λ(x, y) such that

PXY (E) ,∫

Eλ(x, y)µ1(dx)µ2(dy) .

Notes:

1. Again, given a finite or countable collection of joint distributions PXY , QX,Y , . . . satisfying A2we may without loss of generality assume they are defined in terms of a common µ1, µ2.

2. Given PXY satisfying A2 we can disintegrate it into conditional (satisfying A1) and marginal:

PY |X(A|x) =

∫

Aρ(y|x)µ2(dy) ρ(y|x) ,

λ(x, y)

p(x)(2.7)

PX(A) =

∫

Ap(x)µ1(dx) p(x) ,

∫

Yλ(x, η)µ2(dη) (2.8)

with ρ(y|x) defined arbitrarily for those x, for which p(x) = 0.

Remark 2.4. The first problem can also be resolved with the help of Doob’s version of Radon-Nikodym theorem [C11, Chapter V.4, Theorem 4.44]: If the σ-algebra on Y is separable (satisfiedwhenever Y is a Polish space, for example) and PY |X=x QY |X=x then there exists a jointlymeasurable version of Radon-Nikodym derivative

(x, y) 7→dPY |X=x

dQY |X=x(y)

33

§ 3. Sufficient statistic. Continuity of divergence and mutual information

3.1 Sufficient statistics and data-processing

Definition 3.1 (Sufficient Statistic). Let

• P θX be a collection of distributions of X parameterized by θ

• PT |X be some probability kernel. Let P θT , PT |X P θX be the induced distribution on T foreach θ.

We say that T is a sufficient statistic (s.s.) of X for θ if there exists a transition probability kernelPX|T so that P θXPT |X = P θTPX|T , i.e., PX|T can be chosen to not depend on θ.

Note:

• With T known, one can forget X (T contains all the information that is sufficient to makeinference about θ). This is because X can be simulated from T alone with knowing θ, henceX is useless.

• Obviously any one-to-one transformation of X is sufficient. Therefore the interesting case iswhen T is a low-dimensional recap of X (dimensionality reduction)

• θ need not be a random variable (the definition does not involve any distribution on θ)

Theorem 3.1. Let θ → X → T . Then the following are equivalent

1. T is a s.s. of X for θ.

2. ∀Pθ, θ → T → X.

3. ∀Pθ, I(θ;X|T ) = 0.

4. ∀Pθ, I(θ;X) = I(θ;T ), i.e., data processing inequality for M.I. holds with equality.

Proof. Apply Theorem 2.5.

Theorem 3.2 (Fisher’s factorization criterion). For all θ ∈ Θ, let P θX have a density pθ with respectto a measure µ (e.g., discrete – pmf, continuous – pdf). Let T = T (X) be a deterministic functionof X. Then T is a s.s. of X for θ iff

pθ(x) = gθ(T (x))h(x)

for some measurable functions gθ and h, ∀θ ∈ Θ.

34

Proof. We only give the proof in the discrete case (continuous case∑→

∫dµ). Let t = T (x).

“⇒”: Suppose T is a s.s. of X for θ. Then pθ(x) = Pθ(X = x) = Pθ(X = x, T = t) = Pθ(X =x|T = t)Pθ(T = t) = P (X = x|T = T (x))︸︷︷︸

h(x)

Pθ(T = T (x))︸︷︷︸gθ(T (x))

“⇐”: Suppose the factorization holds. Then

Pθ(X = x|T = t) =pθ(x)∑

x 1T (x)=tpθ(x)=

gθ(t)h(x)∑x 1T (x)=tgθ(t)h(x)

=h(x)∑

x 1T (x)=th(x),

free of θ.

Example:

1. Normal mean model. Let θ ∈ R and observations Xiind∼ N (θ, 1), i ∈ [n]. Then the sample

mean X = 1n

∑j Xj is a s.s. of Xn for θ. [Exercise: verify that P θXn factorizes.]

2. Coin flips. Let Bii.i.d.∼ Bern(θ). Then

∑ni=1Bi is a s.s. of Bn for θ.

3. Uniform distribution. Let Uii.i.d.∼ uniform[0, θ]. Then maxi∈[n] Ui is a s.s. of Un for θ.

Example: Binary hypothesis testing. θ = 0, 1. Given θ = 0 or 1, X ∼ PX or QX . Then Y – theoutput of PY |X – is a s.s. of X for θ iff D(PX|Y ‖QX|Y |PY ) = 0, i.e., PX|Y = QX|Y holds PY -a.s.Indeed, the latter means that for kernel QX|Y we have

PXPY |X = PYQX|Y and QXPY |X = QYQX|Y ,

which is precisely the definition of s.s. when θ ∈ 0, 1. This example explains the condition forequality in the data-processing for divergence:

X Y

P X

Q Y

P Y

Q X

P Y | X

Then assuming D(PY ‖QY ) <∞ we have:

D(PX‖QX) = D(PY ‖QY ) ⇐⇒ Y is a s.s. for testing PX vs. QX

Proof. Let QXY = QXPY |X , PXY = PXPY |X , then

D(PXY ‖QXY ) = D(PY |X‖QY |X |PX)︸︷︷︸

=0

+D(PX‖QX)

= D(PX|Y ‖QX|Y |PY ) +D(PY ‖QY )

≥ D(PY ‖QY )

with equality iff D(PX|Y ‖QX|Y |PY ) = 0, which is equivalent to Y being a s.s. for testing PX vs QXas desired.

35

3.2 Geometric interpretation of mutual information

Mutual information can be understood as the “weighted distance” from the conditional distributionto the output distribution. For example, for discrete X:

I(X;Y ) = D(PY |X‖PY |PX) =∑

x

D(PY |X=x‖PY )PX(x)

Theorem 3.3 (Golden formula). ∀QY such that D(PY ‖QY ) <∞

I(X;Y ) = D(PY |X‖QY |PX)−D(PY ‖QY )

Proof. I(X;Y ) = E logPY |XQYPY QY

, group PY |X and QY .

Corollary 3.1 (Mutual information as center of gravity).

I(X;Y ) = minQ

D(PY |X‖Q|PX),

achieved at Q = PY .

Note: This representation is useful to bound mutual information from above by choosing some Q.

Theorem 3.4 (mutual information as distance to product distributions).

I(X;Y ) = minQX ,QY

D(PXY ‖QXQY )

Proof. I(X;Y ) = E log PXY QXQYPXPY QXQY

, group PXY and QXQY and bound marginal divergencesD(PX‖QX) and D(PY ‖QY ) by zero.

Note: Generalization to conditional mutual information.

I(X;Z|Y ) = minQXY Z :X→Y→Z

D(PXY Z‖QXY Z)

Proof. By chain rule,

D(PXY Z‖QXQY |XQZ|Y )

=D(PXY Z‖PXPY |XPZ|Y ) +D(PX‖QX) +D(PY |X‖QY |X |PX) +D(PZ|Y ‖QZ|Y |PY )

=D(PXY Z‖PY PX|Y PZ|Y ) + . . .

=D(PXZ|Y ‖PX|Y PZ|Y |PY )︸︷︷︸

I(X;Z|Y )

+ . . .

Interpretation: The most general graphical model for the triplet (X,Y, Z) is a 3-clique (triangle).What is the information flow on the edge X → Z? To answer, notice that removing this edgerestricts possible joint distributions to a Markov chain X → Y → Z. Thus, it is natural to ask whatis the minimum distance between a given PX,Y,Z and the set of all distributions QX,Y,Z satisfyingthe Markov chain constraint. By the above calculation, optimal QX,Y,Z = PY PX|Y PZ|Y and hencethe distance is I(X;Z|Y ). It is natural to interpret this number as the information flowing on theedge X → Z.

36

3.3 Variational characterizations of divergence:Donsker-Varadhan

Why variational characterization (sup- or inf-representation): F (x) = supλ∈Λ fλ(x)

1. Regularity, e.g., recall

a) Pointwise supremum of convex functions is convex

b) Pointwise supremum of lower semicontinuous (lsc) functions is lsc

2. Give bounds by choosing a (suboptimal) λ

Theorem 3.5 (Donsker-Varadhan). Let P,Q be probability measures on X and let C denote theset of functions f : X → R such that EQ[expf(X)] <∞. If D(P‖Q) <∞ then for every f ∈ Cexpectation EP [f(X)] exists and furthermore

D(P‖Q) = supf∈C

EP [f(X)]− logEQ[expf(X)] . (3.1)

Proof. “≤”: take f = log dPdQ .

“≥”: Fix f ∈ C and define a probability measure Qf (tilted version of Q) via Qf (dx) ,expf(x)Q(dx)∫X expf(x)Q(dx)

, or equivalently,

Qf (dx) = expf(x)− ZfQ(dx) , Zf , logEQ[expf(X)] .

Then, obviously Qf Q and we have

EP [f(X)]− Zf = EP[log

dQf

dQ

]= EP

[log

dPdQf

dQdP

]= D(P‖Q)−D(P‖Qf ) ≤ D(P‖Q) .

Remark 3.1. 1. What is Donsker-Varadhan good for? By setting f(x) = ε · g(x) with ε 1and linearizing exp and log we can see that when D(P‖Q) is small, expectations under P canbe approximated by expectations over Q (change of measure): EP [g(X)] ≈ EQ[g(X)]. Thisholds for all functions g with finite exponential moment under Q. Total variation distanceprovides a similar bound, but for a narrower class of bounded functions:

|EP [g(X)]− EQ[g(X)]| ≤ ‖g‖∞TV(P,Q) .

2. More formally, inequality EP [f(X)] ≤ logEQ[exp f(X)] + D(P‖Q) is useful in estimatingEP [f(X)] for complicated distribution P (e.g. over large-dimensional vector Xn with lots ofweak inter-coordinate dependencies) by making a smart choice of Q (e.g. with iid components).

3. In the next lecture we will show that P 7→ D(P‖Q) is convex. A general method of obtainingvariational formulas like (3.1) is by Young-Fenchel duality. Indeed, (3.1) is exactly thisinequality since the Fenchel-Legendre conjugate of D(·‖Q) is given by a convex map f 7→ Zf .

Theorem 3.6 (Weak lower-semicontinuity of divergence). Let X be a metric space with Borelσ-algebra H. If Pn and Qn converge weakly (in distribution) to P , Q, then

D(P‖Q) ≤ lim infn→∞

D(Pn‖Qn) . (3.2)

37

Proof. First method: On a metric space X bounded continuous functions (Cb) are dense in the setof all integrable functions. Then in Donsker-Varadhan (3.1) we can replace C by Cb to get

D(Pn‖Qn) = supf∈Cb

EPn [f(X)]− logEQn [expf(X)] .

Recall Pn → P weakly if and only if EPnf(X)→ EP f(X) for all f ∈ Cb. Taking the limit concludesthe proof.

Second method (less mysterious): Let A be the algebra of Borel sets E whose boundary haszero (P +Q) measure, i.e.

A = E ∈ H : (P +Q)(∂E) = 0 .By the property of weak convergence Pn and Qn converge pointwise on A. Thus by (3.10) we have

D(PA‖QA) ≤ limn→∞

D(Pn,A‖Qn,A)

If we show A is (P +Q)-dense in H, we are done by (3.9). To get an idea, consider X = R. Thenopen sets are (P +Q)-dense in H (since finite measures are regular), while the algebra F generatedby open intervals is (P +Q)-dense in the open sets. Since there are at most countably many pointsa ∈ X with P (a) +Q(a) > 0, we may further approximate each interval (a, b) whose boundary hasnon-zero (P +Q) measure by a slightly larger interval from A.

Note: In general, D(P‖Q) is not continuous in either P or Q. Example: Let B1, . . . , Bni.i.d.∼ ±1

equiprobably. Then by central limit theorem, Sn = 1√n

∑ni=1Bi

D−→N (0, 1). But

D( PSn︸︷︷︸discrete

‖N (0, 1)︸︷︷︸cont’s

) =∞

for all n. Note that this is an example for strict inequality in (3.2).Note: Why do we care about continuity of information measures? Let’s take divergence as anexample.

1. Computation. For complicated P and Q direct computation of D(P‖Q) might be hard. Instead,one may want to discretize them compute numerically. Question: Is this procedure stable,i.e., as the quantization becomes finer, does this procedure guarantee to converge to the truevalue? Yes! Continuity w.r.t. discretization is guaranteed by the next theorem.

2. Estimating information measures. In many statistical setups, oftentimes we do not know P orQ, if we estimate the distribution from data (e.g., estimate P by empirical distribution Pnfrom n samples) and then plug in, does D(Pn‖Q) provide a good estimator for D(P‖Q)? Well,note from the first example that this is a bad idea if Q is continuous, since D(Pn‖Q) = ∞for n. In fact, if one convolves the empirical distribution with a tiny bit of, say, Gaussiandistribution, then it will always have a density. If we allow the variance of the Gaussian tovanish with n appropriately, we will have convergence. This leads to the idea of kernel densityestimators. All these need regularity properties of divergence.

3.4 Variational characterizations of divergence:Gelfand-Yaglom-Perez

The point of the following theorem is that divergence on general alphabets can be defined viadivergence on finite alphabets and discretization. Moreover, as the quantization becomes finer, weapproach the value of divergence.

38

Theorem 3.7 (Gelfand-Yaglom-Perez [GKY56]). Let P,Q be two probability measures on X withσ-algebra F . Then

D(P‖Q) = supE1,...,En

n∑

i=1

P [Ei] logP [Ei]

Q[Ei], (3.3)

where the supremum is over all finite F-measurable partitions:⋃nj=1Ej = X , Ej ∩ Ei = ∅, and

0 log 10 = 0 and log 1

0 =∞ per our usual convention.

Remark 3.2. This theorem, in particular, allows us to prove all general identities and inequalitiesfor the cases of discrete random variables.

Proof. “≥”: Fix a finite partition E1, . . . En. Define a function (quantizer/discretizer) f : X →1, . . . , n as follows: For any x, let f(x) denote the index j of the set Ej to which X belongs. LetX be distributed according to either P or Q and set Y = f(X). Applying data processing inequalityfor divergence yields

D(P‖Q) = D(PX‖QX)

≥ D(PY ‖QY ) (3.4)

=∑

i

P [Ei] logP [Ei]

Q[Ei].

“≤”: To show D(P‖Q) is indeed achievable, first note that if P 6 Q, then by definition, thereexists B such that Q(B) = 0 < P (B). Choosing the partition E1 = B and E2 = Bc, we have

D(P‖Q) =∞ =∑2

i=1 P [Ei] log P [Ei]Q[Ei]

. In the sequel we assume that P Q, hence the likelihood

ratio dPdQ is well-defined. Let us define a partition of X by partitioning the range of log dP

dQ : Ej = x :

log dPdQ ∈ ε · [j − n/2, j + 1− n/2), j = 1, . . . , n− 1 and En = x : log dP

dQ < ε(1− n/2) or log dPdQ ≥

εn/2).1Note that on Ej , log dPdQ ≤ ε(j + 1 − n/2) ≤ log

P (Ej)Q(Ej)

+ ε. Hence∑n−1

j=1

∫EjdP log dP

dQ ≤∑n−1j=1 εP (Ej) + P (Ej) log

P (Ej)Q(Ej)

≤ ε+∑n

j=1 εP (Ej) + P (Ej) logP (Ej)Q(Ej)

+ P (En) log 1P (En) . In other

words,∑n

j=1 P (Ej) logP (Ej)Q(Ej)

≥∫EcndP log dP

dQ − ε − P (En) log 1P (En) . Let n → ∞ and ε → 0 be

such that nε → ∞ (e.g., ε = 1/√n). The proof is complete by noting that P (En) → 0 and∫

1| log dP

dQ|≤εn

dP log dPdQ

εn↑∞−−−→∫dP log dP

dQ = D(P‖Q).

3.5 Continuity of divergence. Dependence on σ-algebra.

For a finite alphabet X it is easy to establish the continuity of entropy and divergence:

Proposition 3.1. Let X be finite. Fix a distribution Q on X with Q(x) > 0 for all x ∈ X . Thenthe fmap

P 7→ D(P‖Q)

is continuous. In particular,P 7→ H(P ) (3.5)

is continuous.

1Intuition: The main idea is to note that the loss in the inequality (3.4) is in fact D(PX‖QX) = D(PY ‖QY ) +D(PX|Y ‖QX|Y |PY ), and we want to show that the conditional divergence is small. Note that PX|Y=j = PX|X∈Ej

and QX|Y=j = QX|X∈Ej. Hence

dPX|Y =j

dQX|Y =j= dP

dQ

Q(Ej)

P (Ej)1Ej . Once we partitioned the likelihood ratio sufficiently finely,

these two conditional distribution are very close to each other.

39

Warning: Divergence is never continuous in the pair, even for finite alphabets, e.g.: d( 1n‖2−n) 6→

0.

Proof. Notice that

D(P‖Q) =∑

x

P (x) logP (x)

Q(x)

and each term is a continuous function of P (x).

Our next goal is to study continuity properties of divergence for general alphabets. First,however, we need to understand dependence on the σ-algebra of the space. Indeed, divergenceD(P‖Q) implicitly depends on the σ-algebra F defining the measurable space (X ,F). To emphasizethe dependence on F we will write

D(PF‖QF ) .

Shortly, we will prove that divergence is continuous under monotone limits:

Fn F =⇒ D(PFn‖QFn) D(PF‖QF ) (3.6)

Fn F =⇒ D(PFn‖QFn) D(PF‖QF ) (3.7)

For establishing the first result, it will be convenient to extend the definition of the divergenceD(PF‖QF ) to any algebra of sets F and two positive additive set-functions P,Q on F . For this wetake (3.3) as the definition. Note that when F is not a σ-algebra or P,Q are not σ-additive, we donot have Radon-Nikodym theorem and thus our original definition is not applicable.

Corollary 3.2 (Measure-theoretic properties of divergence). Let P,Q be probability measures onthe measurable space (X ,H). Assume all algebras below are sub-algebras of H. Then:

• (Monotonicity) If F ⊆ G then

D(PF‖QF ) ≤ D(PG‖QG) . (3.8)

• Let F1 ⊆ F2 . . . be an increasing sequence of algebras and let F =⋃nFn be their limit, then

D(PFn‖QFn) D(PF‖QF ) .

• If F is (P +Q)-dense in G then2

D(PF‖QF ) = D(PG‖QG) . (3.9)

• (Monotone convergence theorem) Let F1 ⊆ F2 . . . be an increasing sequence of algebras and letF =

∨nFn be the σ-algebra generated by them, then

D(PFn‖QFn) D(PF‖QF ) .

In particular,D(PX∞‖QX∞) = lim

n→∞D(PXn‖QXn) .

2Note: F is µ-dense in G if ∀E ∈ G, ε > 0∃E′ ∈ F s.t. µ[E∆E′] ≤ ε.

40

• (Lower-semicontinuity of divergence) If Pn → P and Qn → Q pointwise on the algebra F ,then3

D(PF‖QF ) ≤ lim infn→∞

D(Pn,F‖Qn,F ) . (3.10)

Proof. Straightforward applications of (3.3) and the observation that any algebra F is µ-dense inthe σ-algebra σF it generates, for any µ on (X ,H).4

Note: Pointwise convergence on H is weaker than convergence in total variation and stronger thanconvergence in distribution (aka “weak convergence”). However, (3.10) can be extended to thismode of convergence (see Theorem 3.6).

Finally, we address the continuity under the decreasing σ-algebra, i.e. (3.7).

Proposition 3.2. Let Fn F be a sequence of decreasing σ-algebras and P,Q two probabilitymeasures on F0. If D(PF0‖QF0) <∞ then we have

D(PFn‖QFn) D(PF‖QF ) (3.11)

The condition D(PF0‖QF0) <∞ can not be dropped, cf. the example after (3.16).

Proof. Let X−n = dPdQ

∣∣∣Fn

. Since X−n = EQ[dPdQ

∣∣∣Fn], we have that (. . . , X−1, X0) is a uniformly

integrable martingale. By the martingale convergence theorem in reversed time, cf. [C11, Theorem5.4.17], we have almost surely

X−n → X−∞ ,dP

dQ

∣∣∣∣F. (3.12)

We need to prove thatEQ[X−n logX−n]→ EQ[X−∞ logX−∞] .

We will do so by decomposing x log x as follows

x log x = x log+ x+ x log− x ,

where log+ x = max(log x, 0) and log− x = min(log x, 0). Since x log− x is bounded, we have fromthe bounded convergence theorem:

EQ[X−n log−X−n]→ EQ[X−∞ log−X−∞]

To prove a similar convergence for log+ we need to notice two things. First, the function

x 7→ x log+ x

is convex. Second, for any non-negative convex function φ s.t. E[φ(X0)] < ∞ the collectionZn = φ(E[X0|Fn]), n ≥ 0 is uniformly integrable. Indeed, we have from Jensen’s inequality

P[Zn > c] ≤ 1

cE[φ(E[X0|Fn])] ≤ E[φ(X0)]

c3Pn → P pointwise on some algebra F if ∀E ∈ F : Pn[E]→ P [E].4This may be shown by transfinite induction: to each ordinal ω associate an algebra Fω generated by monotone

limits of sets from Fω′ with ω′ < ω. Then σF = Fω0 , where ω0 is the first ordinal for which Fω is a monotone class.But F is µ-dense in each Fω by transfinite induction.

41

and thus P[Zn > c]→ 0 as c→∞. Therefore, we have again by Jensen’s

E[Zn1Zn > c] ≤ E[φ(X0)1Zn > c]→ 0 c→∞ .

Finally, since X−n log+X−n is uniformly integrable, we have from (3.12)

EQ[X−n log−X−n]→ EQ[X−∞ log−X−∞]

and this concludes the proof.

3.6 Variational characterizations and continuity of mutualinformation

Again, similarly to Proposition 3.1, it is easy to show that in the case of finite alphabets mutualinformation is continuous in the distribution:

Proposition 3.3. Let X and Y be finite alphabets. Then

PX,Y 7→ I(X;Y )

is continuous.

Proof. Apply representation

I(X;Y ) = H(X) +H(Y )−H(X,Y )

and (3.5).

Further properties of mutual information follow from I(X;Y ) = D(PXY ‖PXPY ) and correspond-ing properties of divergence, e.g.

1.I(X;Y ) = sup

fE[f(X,Y )]− logE[expf(X, Y )] ,

where Y is a copy of Y , independent of X and supremum is over bounded, or even boundedcontinuous functions.

2. If (Xn, Yn)d→ (X,Y ) converge in distribution, then

I(X;Y ) ≤ lim infn→∞

I(Xn;Yn) . (3.13)

• Good example of strict inequality: Xn = Yn = 1nZ. In this case (Xn, Yn)

d→ (0, 0) butI(Xn;Yn) = H(Z) > 0 = I(0; 0).

• Even crazier example: Let (Xp, Yp) be uniformly distributed on the unit `p-ball on the

plane: x, y : |x|p + |y|p ≤ 1. Then as p → 0, (Xp, Yp)d→ (0, 0), but I(Xp;Yp) → ∞!

(Homework)

42

3.

I(X;Y ) = supEi×Fj

∑

i,j

PXY [Ei × Fj ] logPXY [Ei × Fj ]PX [Ei]PY [Fj ]

,

where supremum is over finite partitions of spaces X and Y.5

4. (Monotone convergence I):

I(X∞;Y ) = limn→∞

I(Xn;Y ) (3.14)

I(X∞;Y∞) = limn→∞

I(Xn;Y n) (3.15)

This implies that all mutual information between two-processes X∞ and Y∞ is contained intheir finite-dimensional projections, leaving nothing for the tail σ-algebra.

5. (Monotone convergence II): Let Xtail be a random variable such that σ(Xtail) =⋂n≥1 σ(X∞n ).

ThenI(Xtail;Y ) = lim

n→∞I(X∞n ;Y ) , (3.16)

whenever the right-hand side is finite. This is a consequence of Prop. 3.2. Without the finiteness

assumption the statement is incorrect. Indeed, consider Xji.i.d.∼ Bern(1/2) and Y = X∞0 . Then

each I(X∞n ;Y ) =∞, but Xtail = const a.e. by Kolmogorov’s 0-1 law, and thus the left-handside of (3.16) is zero.

5To prove this from (3.3) one needs to notice that algebra of measurable rectangles is dense in the productσ-algebra.

43

§ 4. Extremization of mutual information: capacity saddle point

4.1 Convexity of information measures

Theorem 4.1. (P,Q) 7→ D(P‖Q) is convex.

Proof. First proof : Let X ∼ Bern(λ). Define two conditional kernels:

PY |X=0 = P0, PY |X=1 = P1

QY |X=0 = Q0, QY |X=1 = Q1

Conditioning increases divergence, hence

λD(P0‖Q0) + λD(P1‖Q1) = D(PY |X‖QY |X |PX) ≥ D(PY ‖QY ) = D(λP0 + λP1‖λQ0 + λQ1).

Second proof : (p, q) 7→ p log pq is convex on R2

+ [Verify by computing the Hessian matrix and

showing that it is positive semidefinite].1

Third proof : By the Donsker-Varadhan variational representation,

D(P‖Q) = supf∈C

EP [f(X)]− logEQ[expf(X)] .

where for fixed f , P 7→ EP [f(X)] is affine, Q 7→ logEQ[expf(X)] is concave. Therefore (P,Q) 7→D(P‖Q) is a pointwise supremum of convex functions, hence convex.

Remark 4.1. The first proof shows that for an arbitrary measure of similarity D(P‖Q) convexityof (P,Q) 7→ D(P‖Q) is equivalent to “conditioning increases divergence” property of D. Convexitycan also be understood as “mixing decreases divergence”.

Remark 4.2 (f -divergences). Any f -divergence, cf. (1.16), satisfies all the key properties of theusual divergence: positivity, monotonicity, data processing (DP), conditioning increases divergence(CID) and convexity in the pair. Indeed, by previous remark the last two are equivalent. Furthermore,proof of Theorem 2.2 showed that DP and CID are implied by monotonicity. Thus, consider PXYand QXY and note

Df (PXY ‖QXY ) = EQXY

[f

(PXYQXY

)](4.1)

= EQY EQX|Y

[f

(PYQY·PX|Y

QX|Y

)](4.2)

≥ EQY

[f

(PYQY

)], (4.3)

where inequality follows by applying Jensen’s inequality to the convex function f . Finally, positivityfollows from Jensen’s inequality and recalling that f(1) = 0 by assumption and EQY [ PYQY ] = 1.

1This is a general phenomenon: for a convex f(·) the perspective function (p, q) 7→ qf(pq

)is convex too.

44

Theorem 4.2 (Entropy). PX 7→ H(PX) is concave.

Proof. If PX is on a finite alphabet, then proof is complete by H(X) = log |X | − D(PX‖UX).Otherwise, set

PX|Y =

P0 Y = 0P1 Y = 1

, P (Y = 0) = λ

Then apply H(X|Y ) ≤ H(X).

Recall that I(X,Y ) is a function of PXY , or equivalently, (PX , PY |X). Denote I(PX , PY |X) =I(X;Y ).

Theorem 4.3 (Mutual Information).

• For fixed PY |X , PX 7→ I(PX , PY |X) is concave.

• For fixed PX , PY |X 7→ I(PX , PY |X) is convex.

Proof.

• First proof : Introduce θ ∈ Bern(λ). Define PX|θ=0 = P 0X and PX|θ=1 = P 1

X . Then θ → X → Y .Then PX = λP 0

X + λP 1X . I(X;Y ) = I(X, θ;Y ) = I(θ;Y ) + I(X;Y |θ) ≥ I(X;Y |θ), which is

our desired I(λP 0X + λP 1

X , PY |X) ≥ λI(P 0X , PY |X) + λI(P 0

X , PY |X).

Second proof : I(X;Y ) = minQD(PY |X‖Q|PX) – pointwise minimum of affine functions isconcave.

Third proof : Pick a Q and use the golden formula: I(X;Y ) = D(PY |X‖Q|PX)−D(PY ‖Q),where PX 7→ D(PY ‖Q) is convex, as the composition of the PX 7→ PY (affine) and PY 7→D(PY ‖Q) (convex).

• I(X;Y ) = D(PY |X‖PY |PX) and D is convex in the pair.

4.2* Local behavior of divergence

Due to the smoothness of the function (p, q) 7→ p log pq at (1, 1) it is natural to expect that the

functionalP 7→ D(P‖Q)

should also be smooth as P → Q. Due to non-negativity and convexity, it is then also natural toexpect that this functional decays quadratically. Next, we show that the decay is always sublinear;furthermore, under the assumption that χ2(P‖Q) <∞ it is indeed quadratic.

Proposition 4.1. When D(P‖Q) <∞, the one-sided derivative in λ = 0 vanishes:

d

dλ

∣∣∣λ=0

D(λP + λQ‖Q) = 0

If we exchange the arguments, the criterion is even simpler:

d

dλ

∣∣∣λ=0

D(Q‖λP + λQ) = 0 ⇐⇒ P Q (4.4)

45

Proof.1

λD(λP + λQ‖Q) = EQ

[1

λ(λf + λ) log(λf + λ)

]

where f =dP

dQ. As λ→ 0 the function under expectation decreases to (f − 1) log e monotonically.

Indeed, the functionλ 7→ g(λ) , (λf + λ) log(λf + λ)

is convex and equals zero at λ = 0. Thus g(λ)λ is increasing in λ. Moreover, by the convexity of

x 7→ x log x:1

λ(λf + λ)(log(λf + λ)) ≤ 1

λ(λf log f + λ1 log 1) = f log f

and by assumption f log f is Q-integrable. Thus the Monotone Convergence Theorem applies.To prove (4.4) first notice that if P 6 Q then there is a set E with p = P [E] > 0 = Q[E].

Applying data-processing for divergence to X 7→ 1E(X), we get

D(Q‖λP + λQ) ≥ d(0‖λp) = log1

1− λp

and derivative is non-zero. If P Q, then let f = dPdQ and notice simple inequalities

log λ ≤ log(λ+ λf) ≤ λ(f − 1) log e .

Dividing by λ and assuming λ < 12 we get for some absolute constants c1, c2:

∣∣∣∣1

λlog(λ+ λf)

∣∣∣∣ ≤ c1f + c2 .

Thus, by the dominated convergence theorem we get

1

λD(Q‖λP + λQ) = −

∫dQ

(1

λlog(λ+ λf)

)λ→0−−−→

∫dQ(1− f) = 0 .

Remark 4.3. More generally, under suitable technical conditions,

d

dλ

∣∣∣λ=0

D(λP + λQ‖R) = EP[log

dQ

dR

]−D(Q‖R)

and

d

dλ

∣∣∣λ=0

D(λP1 + λQ1‖λP0 + λQ0) = EQ1

[log

dP1

dP0

]−D(P1‖P0) + EP1

[1− dQ0

dP0

]log e.

The message of Proposition 4.1 is that the function

λ 7→ D(λP + λQ‖Q) ,

is o(λ) as λ→ 0. In fact, in most cases it is quadratic in λ. To make a precise statement, we needto define the concept of χ2-divergence – a version of f -divergence (1.16):

χ2(P‖Q) ,∫dQ

(dP

dQ− 1

)2

.

46

This is a very popular measure of distance between P and Q, frequently used in statistics. It hasmany important properties, but we will only mention that χ2 dominates KL-divergence:

D(P‖Q) ≤ log(1 + χ2(P‖Q)) .

Our second result about local properties of KL-divergence is the following:

Proposition 4.2 (KL is locally χ2-like). We have

lim infλ→0

1

λ2D(λP + λQ‖Q) =

log e

2χ2(P‖Q) , (4.5)

where both sides are finite or infinite simultaneously.

Proof. First, we assume that χ2(P‖Q) <∞ and prove

D(λP + λQ‖Q) =λ2 log e

2χ2(P‖Q) + o(λ2) , λ→ 0 .

To that end notice that

D(P‖Q) = EQ[g

(dP

dQ

)],

whereg(x) , x log x− (x− 1) log e .

Note that x 7→ g(x)(x−1)2 log e

=∫ 1

0sds

x(1−s)+s is decreasing in x on (0,∞). Therefore

0 ≤ g(x) ≤ (x− 1)2 log e ,

and hence

0 ≤ 1

λ2g

(λ+ λ

dP

dQ

)≤(dP

dQ− 1

)2

log e.

By the dominated convergence theorem (which is applicable since χ2(P‖Q) <∞) we have

limλ→0

1

λ2EQ[g

(λ+ λ

dP

dQ

)]=g′′(1)

2EQ

[(dP

dQ− 1

)2]

=log e

2χ2(P‖Q) .

Second, we show that unconditionally

lim infλ→0

1

λ2D(λP + λQ‖Q) ≥ log e

2χ2(P‖Q) . (4.6)

Indeed, this follows from Fatou’s lemma:

lim infλ→0

EQ[

1

λ2g

(λ+ λ

dP

dQ

)]≥ EQ

[lim infλ→0

g

(λ+ λ

dP

dQ

)]=

log e

2χ2(P‖Q) .

Therefore, from (4.6) we conclude that if χ2(P‖Q) =∞ then so is the LHS of (4.5).

47

4.3* Local behavior of divergence and Fisher information

Consider a parameterized set of distributions Pθ, θ ∈ Θ and assume Θ is an open subset of Rd.Furthermore, suppose that distribution Pθ are all given in the form of

Pθ(dx) = f(x|θ)µ(dx) ,

where µ is some common dominating measure (e.g. Lebesgue or counting). If for a fixed x functionsθ → f(x|θ) are smooth, one can define Fisher information matrix with respect to parameter θ as

JF (θ) , EX∼Pθ[V V T

], V , ∇θ log f(X|θ) . (4.7)

Under suitable regularity conditions, Fisher information matrix has several equivalent expressions:

JF (θ) = covX∼Pθ [∇θ log f(X|θ)] (4.8)

= (4 log e)

∫µ(dx)(∇θ

√f(x|θ))(∇θ

√f(x|θ))T (4.9)

= −(log e)Eθ[Hessθ(log f(X|θ))] , (4.10)

where the latter is obtained by differentiating

0 =

∫µ(dx)f(x|θ) ∂

∂θilog f(x|θ)

in θj .Trace of this matrix is called Fisher information and similarly can be expressed in a variety of

forms:

tr JF (θ) =

∫µ(dx)

‖∇θf(x|θ)‖2f(x|θ) (4.11)

= 4

∫µ(dx)‖∇θ

√f(x|θ)‖2 (4.12)

= −(log e) · EX∼Pθ

[d∑

i=1

∂2

∂θi∂θilog f(X|θ)

], (4.13)

Significance of Fisher information matrix arises from the fact that it gauges the local behaviourof divergence for smooth parametric families. Namely, we have (again under suitable technicalconditions):2

D(Pθ0‖Pθ0+ξ) =1

2 log eξTJF (θ0)ξ + o(‖ξ‖2) , (4.14)

which is obtained by integrating the Taylor expansion:

log f(x|θ0 + ξ) = log f(x|θ0) + ξT∇θ log f(x|θ0) +1

2ξTHessθ(log f(x|θ0))ξ + o(‖ξ‖2) .

Property (4.14) is of paramount importance in statistics. We should remember it as: Divergenceis locally quadratic on the parameter space, with Hessian given by the Fisher information matrix.

2To illustrate the subtlety here, consider a scalar location family, i.e. f(x|θ) = f0(x− θ) with f0 – some density.

In this case Fisher information JF (θ0) = (log e)2∫ (f ′0)2

f0does not depend on θ0 and is well-defined even for compactly

supported f0, provided f ′0 vanishes at the endpoints sufficiently fast. But at the same time the left-hand side of (4.14)is infinite for any ξ > 0. In such cases, a better interpretation for Fisher information is as the coefficient of the

expansion D(Pθ0‖ 12Pθ0 + 1

2Pθ0+ξ) = ξ2

8 log eJF + o(ξ2).

48

Remark 4.4. It can be seen that if one introduces another parametrization θ ∈ Θ by means of asmooth invertible map Θ→ Θ, then Fisher information matrix changes as

JF (θ) = ATJF (θ)A , (4.15)

where A = dθdθ

is the Jacobian of the map. So we can see that JF transforms similarly to the metrictensor in Riemannian geometry. This idea can be used to define a Riemannian metric on the spaceof parameters Θ, called the Fisher-Rao metric. This is explored in a field known as informationgeometry [AN07].

Example: Consider Θ to be the interior of a simplex of all distributions on a finite alphabet0, . . . , d. We will take θ1, . . . , θd as free parameters and set θ0 = 1−∑d

i=1 θi. So all derivativesare with respect to θ1, . . . , θd only. Then we have

Pθ(x) = f(x|θ) =

θx, x = 1, . . . , d

1−∑x 6=0 θx, x = 0

and for Fisher information matrix we get

JF (θ) = (log2 e)

diag(

1

θ1, . . . ,

1

θd) +

1

1−∑di=1 θi

1 · 1T, (4.16)

where 1 · 1T is the d× d matrix of all ones. For future reference, we also compute determinant ofJF (θ). To that end notice that det(A + xyT ) = detA · det(I + A−1xyT ) = detA · (1 + yTA−1x),where we used the identity det(I +AB) = det(I +BA). Thus, we have

det JF (θ) = (log e)2dd∏

x=0

1

θx= (log e)2d 1

1−∑dx=1 θx

d∏

x=1

1

θx. (4.17)

4.4 Extremization of mutual information

Two problems of interest

• Fix PY |X → maxPX

I(X;Y ) — channel coding (Part IV)

Note: This maximum is called “capacity” of a set of distributions PY |X=x, x ∈ X.

• Fix PX → minPY |X

I(X;Y ) — lossy compression (Part V)

Theorem 4.4 (Saddle point). Let P be a convex set of distributions on X . Suppose there existsP ∗X ∈ P such that

supPX∈P

I(PX , PY |X) = I(P ∗X , PY |X) , C

and let P ∗XPY |X−−−→ P ∗Y . Then for all PX ∈ P and for all QY , we have

D(PY |X‖P ∗Y |PX) ≤ D(PY |X‖P ∗Y |P ∗X) ≤ D(PY |X‖QY |P ∗X). (4.18)

Note: P ∗X (resp., P ∗Y ) is called a capacity-achieving input (resp., output) distribution, or a caid(resp., the caod).

49

Proof. Right inequality: obvious from C = I(P ∗X , PY |X) = minQY D(PY |X‖QY |P ∗X).Left inequality: If C =∞, then trivial. In the sequel assume that C <∞, hence I(PX , PY |X) <

∞ for all PX ∈ P. Let PXλ = λPX + λP ∗X ∈ P by convexity of P, and introduce θ ∼ Bern(λ), sothat PXλ|θ=0 = P ∗X , PXλ|θ=1 = PX , and θ → Xλ → Yλ. Then

C ≥ I(Xλ;Yλ) = I(θ,Xλ;Yλ) = I(θ;Yλ) + I(Xλ;Yλ|θ)= D(PYλ|θ‖PYλ |Pθ) + λI(PX , PY |X) + λC

= λD(PY ‖PYλ) + λD(P ∗Y ‖PYλ) + λI(PX , PY |X) + λC

≥ λD(PY ‖PYλ) + λI(PX , PY |X) + λC.

Since I(PX , PY |X) <∞, we can subtract it to obtain

λ(C − I(PX , PY |X)) ≥ λD(PY ‖PYλ).

Dividing both sides by λ, taking the lim inf and using lower semicontinuity of D, we have

C − I(PX , PY |X) ≥ lim infλ→0

D(PY ‖PYλ) ≥ D(PY ‖P ∗Y )

=⇒ C ≥ I(PX , PY |X) +D(PY ‖P ∗Y ) = D(PY |X‖PY |PX) +D(PY ‖P ∗Y ) = D(PY |X‖P ∗Y |PX).

Here is an even shorter proof:

C ≥ I(Xλ;Yλ) = D(PY |X‖PYλ |PXλ)

= λD(PY |X‖PYλ |PX) + λD(PY |X‖PYλ |P ∗X)

≥ λD(PY |X‖PYλ |PX) + λC

= λD(PX,Y ‖PXPYλ) + λC ,

where inequality is by the right part of (4.18) (already shown). Thus, subtracting λC and dividingby λ we get

D(PX,Y ‖PXPYλ) ≤ Cand the proof is completed by taking lim infλ→0 and applying the lower semincontinuity of divergence.

Corollary 4.1. In addition to the assumptions of Theorem 4.4, suppose C <∞. Then caod P ∗Y isunique. It satisfies the property that for any PY induced by some PX ∈ P (i.e. PY = PY |X PX) wehave

D(PY ‖P ∗Y ) ≤ C <∞ (4.19)

and in particular PY P ∗Y .

Proof. The statement is: I(PX , PY |X) = C ⇒ PY = P ∗Y . Indeed:

C = D(PY |X‖PY |PX) = D(PY |X‖P ∗Y |PX)−D(PY ‖P ∗Y )

≤ D(PY |X‖P ∗Y |P ∗X)−D(PY ‖P ∗Y )

= C −D(PY ‖P ∗Y )⇒ PY = P ∗Y

Statement (4.19) follows from the left inequality in (4.18) and “conditioning increases divergence”.

50

Remark 4.5. • Finiteness of C is necessary. Counterexample: Consider the identity channelY = X, where X takes values on integers. Then any distribution with infinite entropy is caidor caod.

• Non-uniqueness of caid. Unlike the caod, caid need not be unique. Let Z1 ∼ Bern(12).

Consider Y1 = X1 ⊕ Z1 and Y2 = X2. Then maxPX1X2I(X1, X2;Y1, Y2) = log 4, achieved by

PX1X2 = Bern(p)×Bern(12) for any p. Note that the caod is unique: P ∗Y1Y2

= Bern(12)×Bern(1

2).

Review: Minimax and saddlepoint

Suppose we have a bivariate function f . Then we always have the minimax inequality :

infy

supxf(x, y) ≥ sup

xinfyf(x, y).

When does it hold with equality?

1. It turns out minimax equality is implied by the existence of a saddle point (x∗, y∗),i.e.,

f(x, y∗) ≤ f(x∗, y∗) ≤ f(x∗, y) ∀x, yFurthermore, minimax equality also implies existence of saddle point if inf and supare achieved c.f. [BNO03, Section 2.6]) for all x, y [Straightforward to check. Seeproof of corollary below].

2. There are a number of known criteria establishing

infy

supxf(x, y) = sup

xinfyf(x, y)

They usually require some continuity of f , compactness of domains and concavityin x and convexity in y. One of the most general version is due to M. Sion [Sio58].

3. The mother result of all this minimax theory is a theorem of von Neumann onbilinear functions: Let A and B have finite alphabets, and g(a, b) be arbitrary, then

minPA

maxPB

E[g(A,B)] = maxPB

minPA

E[g(A,B)]

Here (x, y)↔ (PA, PB) and f(x, y)↔∑a,b PA(a)PB(b)g(a, b).

4. A more general version is: if X and Y are compact convex domains in Rn, f(x, y)continuous in (x, y), concave in x and convex in y then

maxx∈X

miny∈Y

f(x, y) = miny∈Y

maxx∈X

f(x, y)

Applying Theorem 4.4 to conditional divergence gives the following result.

Corollary 4.2 (Minimax). Under assumptions of Theorem 4.4, we have

maxPX∈P

I(X;Y ) = maxPX∈P

minQY

D(PY |X‖QY |PX)

= minQY

maxPX∈P

D(PY |X‖QY |PX)

51

Proof. This follows from saddle-point trivially: Maximizing/minimizing the leftmost/rightmostsides of (4.18) gives

minQY

maxPX∈P

D(PY |X‖QY |PX) ≤ maxPX∈P

D(PY |X‖P ∗Y |PX) = D(PY |X‖P ∗Y |P ∗X)

≤ minQY

D(PY |X‖QY |P ∗X) ≤ maxPX∈P

minQY

D(PY |X‖QY |PX).

but by definition min max ≥ max min.

4.5 Capacity = information radius

Review: Radius and diameter

Let (X, d) be a metric space. Let A be a bounded subset.

1. Radius (aka Chebyshev radius) of A: the radius of the smallest ball that covers A,i.e., rad (A) = infy∈X supx∈A d(x, y).

2. Diameter of A: diam (A) = supx,y∈A d(x, y).

3. Note that the radius and the diameter both measure how big/rich a set is.

4. From definition and triangle inequality we have

1

2diam (A) ≤ rad (A) ≤ diam (A)

5. In fact, the rightmost upper bound can frequently be improved:

• A result of Bohnenblust [Boh38] shows that in Rn equipped with any norm wealways have rad (A) ≤ n

n+1diam (A).

• For Rn with Euclidean distance Jung proved rad (A) ≤√

n2(n+1)diam (A),

attained by simplex. The best constant is sometimes called the Jung constantof the space.

• For Rn with `∞-norm the situation is even simpler: rad (A) = 12diam (A); such

spaces are called centrable.

The next simple corollary shows that capacity is just the radius of the set of distributions PY |X=x, x ∈X when distances are measured by divergence (although, we remind, divergence is not a metric).

Corollary 4.3. For fixed kernel PY |X , let P = all distributions on X and X is finite, then

maxPX

I(X;Y ) = maxx

D(PY |X=x‖P ∗Y )

= D(PY |X=x‖P ∗Y ) ∀x : P ∗X(x) > 0 .

The last corollary gives a geometric interpretation to capacity: it equals the radius of the smallestdivergence-“ball” that encompasses all distributions PY |X=x : x ∈ X. Moreover, P ∗Y is a convexcombination of some PY |X=x and it is equidistant to those.

52

The following is the information-theoretic version of “radius ≤ diameter” (in KL divergence) forarbitrary input space:

Corollary 4.4. Let PY |X=x : x ∈ X be a set of distributions. Prove that

C = supPX

I(X;Y ) ≤ infQ

supx∈X

D(PY |X=x‖Q)

︸︷︷︸radius

≤ supx,x′∈X

D(PY |X=x‖PY |X=x′)

︸︷︷︸diameter

Proof. By the golden formula Corollary 3.1, we have

I(X;Y ) = infQD(PY |X‖Q|PX) ≤ inf

Qsupx∈X

D(PY |X=x‖Q) ≤ infx′∈X

supx∈X

D(PY |X=x‖PY |X=x′).

4.6 Existence of caod (general case)

We have shown above that the solution to

C = supPX∈P

I(X;Y )

can be a) interpreted as a saddle point; b) written in the minimax form and c) that caod P ∗Y isunique. This was all done under the extra assumption that supremum over PX is attainable. Itturns out, properties b) and c) can be shown without that extra assumption.

Theorem 4.5 (Kemperman). For any PY |X and a convex set of distributions P such that

C = supPX∈P

I(PX , PY |X) <∞ (4.20)

there exists a unique P ∗Y with the property that

C = supPX∈P

D(PY |X‖P ∗Y |PX) . (4.21)

Furthermore,

C = supPX∈P

minQY

D(PY |X‖QY |PX) (4.22)

= minQY

supPX∈P

D(PY |X‖QY |PX) (4.23)

= minQY

supx∈X

D(PY |X=x‖QY ) , (if P = all PX.) (4.24)

Note: Condition (4.20) is automatically satisfied if there is any QY such that

supPX∈P

D(PY |X‖QY |PX) <∞ . (4.25)

Example: Non-existence of caid. Let Z ∼ N (0, 1) and consider the problem

C = sup

PX :E[X]=0,E[X2]=P

E[X4]=s

I(X;X + Z) . (4.26)

53

If we remove the constraint E[X4] = s the unique caid is PX = N (0, P ), see Theorem 4.6. Whens 6= 3P 2 then such PX is no longer inside the constraint set P . However, for s > 3P 2 the maximum

C =1

2log(1 + P )

is still attainable. Indeed, we can add a small “bump” to the gaussian distribution as follows:

PX = (1− p)N (0, P ) + pδx ,

where p→ 0, px2 → 0 but px4 → s− 3P 2 > 0. This shows that for the problem (4.26) with s > 3P 2

the caid does not exist, the caod P ∗Y = N (0, 1 + P ) exists and unique as Theorem 4.5 postulates.

Proof of Theorem 4.5. Let P ′Xn be a sequence of input distributions achieving C, i.e., I(P ′Xn , PY |X)→C. Let Pn be the convex hull of P ′X1

, . . . , P ′Xn. Since Pn is a finite-dimensional simplex, theconcave function PX 7→ I(PX , PY |X) attains its maximum at some point PXn ∈ Pn, i.e.,

In , I(PXn , PY |X) = maxPX∈Pn

I(PX , PY |X) .

Denote by PYn be the output distribution induced by PXn . We have then:

D(PYn‖PYn+k) = D(PY |X‖PYn+k

|PXn)−D(PY |X‖PYn |PXn) (4.27)

≤ I(PXn+k, PY |X)− I(PXn , PY |X) (4.28)

≤ C − In , (4.29)

where in (4.28) we applied Theorem 4.4 to (Pn+k, PYn+k). By the Pinsker-Csiszar inequality (1.15)

and since In C, we conclude that the sequence PYn is Cauchy in total variation:

supk≥1

TV(PYn , PYn+k)→ 0 , n→∞ .

Since the space of probability distributions is complete in total variation, the sequence must have alimit point PYn → P ∗Y . By taking a limit as k →∞ in (4.29) and applying the lower semi-continuityof divergence (Theorem 3.6) we get

D(PYn‖P ∗Y ) ≤ limk→∞

D(PYn‖PYn+k) ≤ C − In ,

and therefore, PYn → P ∗Y in the (stronger) sense of D(PYn‖P ∗Y )→ 0. By Theorem 3.3,

D(PY |X‖P ∗Y |PXn) = In +D(PYn‖P ∗Y )→ C . (4.30)

Take any PX ∈⋃k≥1 Pk. Then PX ∈ Pn for all sufficiently large n and thus by Theorem 4.4

D(PY |X‖PYn |PX) ≤ In ≤ C , (4.31)

which, by the lower semi-continuity of divergence, implies

D(PY |X‖P ∗Y |PX) ≤ C . (4.32)

To prove that (4.32) holds for arbitrary PX ∈ P, we may repeat the argument above with Pnreplaced by Pn = conv(PX ∪ Pn), denoting the resulting sequences by PXn , PYn and the limitpoint by P ∗Y , and obtain:

D(PYn‖PYn) = D(PY |X‖PYn |PXn)−D(PY |X‖PYn |PXn) (4.33)

≤ C − In , (4.34)

54

where (4.34) follows from (4.32) since PXn ∈ Pn. Hence taking limit as n→∞ we have P ∗Y = P ∗Yand therefore (4.32) holds.

Finally, to see (4.23), note that by definition capacity as a max-min is at most the min-max, i.e.,

C = supPX∈P

minQY

D(PY |X‖QY |PX) ≤ minQY

supPX∈P

D(PY |X‖QY |PX) ≤ supPX∈P

D(PY |X‖P ∗Y |PX) = C

in view of (4.30) and (4.31).

Corollary 4.5. Let X be countable and P a convex set of distributions on X . If supPX∈P H(X) <∞then

supPX∈P

H(X) = minQX

supPX∈P

∑

x

PX(x) log1

QX(x)<∞

and the optimizer Q∗X exists and is unique. If Q∗X ∈ P, then it is also the unique maximizer ofH(X).

Proof. Just apply Kemperman’s Theorem 4.5 to the identity channel Y = X.

Example: Max entropy. Assume that f : Z → R is such that∑

n∈Z exp−λf(n) < ∞ for allλ > 0. Then

maxX:E[f(X)]≤a

H(X) ≤ infλ>0

λa+ log∑

n

exp−λf(n) .

This follows from taking Q(n) = c exp−λf(n). This bound is often tight and achieved byPX(n) = c exp−λ∗f(n) with λ∗ being the minimizer, known as the Gibbs distribution for theenergy function f .

4.7 Gaussian saddle point

For additive noise, there is also a different kind of saddle point between PX and the distribution ofnoise:

Theorem 4.6. Let Xg ∼ N (0, σ2X) , Ng ∼ N (0, σ2

N ) , Xg ⊥⊥ Ng. Then:

1. “Gaussian capacity”:

C = I(Xg;Xg +Ng) =1

2log(

1 +σ2X

σ2N

)

2. “Gaussian input is the best for Gaussian noise”: For all X ⊥⊥ Ng and varX ≤ σ2X ,

I(X;X +Ng) ≤ I(Xg;Xg +Ng),

with equality iff XD=Xg.

3. “Gaussian noise is the worst for Gaussian input”: For for all N s.t. E[XgN ] = 0 andEN2 ≤ σ2

N ,I(Xg;Xg +N) ≥ I(Xg;Xg +Ng),

with equality iff ND=Ng and independent of Xg.

Interpretations:

55

1. For AWGN channel, Gaussian input is the most favorable. Indeed, immediately from thesecond statement we have

maxX:varX≤σ2

X

I(X;X +Ng) =1

2log(

1 +σ2X

σ2N

)

which is the capacity formula for the AWGN channel.

2. For Gaussian source, additive Gaussian noise is the worst in the sense that it minimizes themutual information provided by the noisy version.

Proof. WLOG, assume all random variables have zero mean. Let Yg = Xg +Ng. Define

f(x) = D(PYg |Xg=x‖PYg) = D(N (x, σ2N )‖N (0, σ2

X + σ2N )) =

1

2log(

1 +σ2X

σ2N

)

︸︷︷︸=C

+log e

2

x2 − σ2X

σ2X + σ2

N

1. Compute I(Xg;Xg +Ng) = E[f(Xg)] = C

2. Recall the inf-representation (Corollary 3.1) I(X;Y ) = minQD(PY |X‖Q|PX). Then

I(X;X +Ng) ≤ D(PYg |Xg‖PYg |PX) = E[f(X)] ≤ C <∞ .

Furthermore, if I(X;X +Ng) then uniqueness of caod, cf. Corollary 4.1, implies PY = PYg . ButPY = PX ∗ N (0, σ2

N ). Then it must be that X ∼ N (0, σ2X) simply by considering characteristic

functions:

ΨX(t) · e− 12σ2N t

2= e−

12

(σ2X+σ2

N )t2 ⇒ ΨX(t) = e−12σ2X t

2=⇒ X ∼ N (0, σ2

X)

3. Let Y = Xg +N and let PY |Xg be the respective kernel. [Note that here we only assume that Nis uncorrelated with Xg, i.e., E [NXg] = 0, not necessarily independent.] Then

I(Xg;Xg +N) = D(PXg |Y ‖PXg |PY )

= D(PXg |Y ‖PXg |Yg |PY ) + EXg ,Y logPXg |Yg(Xg|Y )

PXg(Xg)

≥ E logPXg |Yg(Xg|Y )

PXg(Xg)(4.35)

= E logPYg |Xg(Y |Xg)

PYg(Y )(4.36)

= C +log e

2E[ Y 2

σ2X + σ2

N

− N2

σ2N

](4.37)

= C +log e

2

σ2X

σ2X + σ2

N

(1− EN2

σ2N

)(4.38)

≥ C , (4.39)

where

• (4.36):PXg |YgPXg

=PYg |XgPYg

• (4.38): E[XgN ] = 0 and E[Y 2] = E[N2] + E[X2g ].

56

• (4.39): EN2 ≤ σ2N .

Finally, the conditions for equality in (4.35) say

D(PXg |Y ‖PXg |Yg |PY ) = 0

Thus, PXg |Y = PXg |Yg , i.e., Xg is conditionally Gaussian: PXg |Y=y = N (by, c2) for some constantsb and c. In other words, under PXgY , we have

Xg = bY + cZ , Z ∼ Gaussian ⊥⊥ Y.

But then Y must be Gaussian itself by Cramer’s Theorem or simply by considering characteristicfunctions:

ΨY (t) · ect2 = ec′t2 ⇒ ΨY (t) = ec

′′t2 =⇒ Y– Gaussian

Therefore, (Xg, Y ) must be jointly Gaussian and hence N = Y − Xg is Gaussian. Thus weconclude that it is only possible to attain I(Xg;Xg +N) = C if N is Gaussian of variance σ2

N

and independent of Xg.

57

§ 5. Single-letterization. Probability of error. Entropy rate.

5.1 Extremization of mutual information for memoryless sourcesand channels

Theorem 5.1. (Joint M.I. vs. marginal M.I.)

(1) If PY n|Xn =∏PYi|Xi then

I(Xn;Y n) ≤∑

I(Xi;Yi) (5.1)

with equality iff PY n =∏PYi. Consequently,

maxPXn

I(Xn;Y n) =

n∑

i=1

maxPXi

I(Xi;Yi).

(2) If X1 ⊥⊥ . . . ⊥⊥ Xn then

I(Xn;Y n) ≥∑

I(Xi;Yi) (5.2)

with equality iff PXn|Y n =∏PXi|Yi PY n-almost surely.1 Consequently,

minPY n|Xn

I(Xn;Y n) =

n∑

i=1

minPYi|Xi

I(Xi;Yi).

Proof. (1) Use I(Xn;Y n)−∑ I(Xj ;Yj) = D(PY n|Xn‖∏PYi|Xi |PXn)−D(PY n‖∏PYi)

(2) Reverse the role ofX and Y : I(Xn;Y n)−∑ I(Xj ;Yj) = D(PXn|Y n‖∏PXi|Yi |PY n)−D(PXn‖∏PXi)

Note: The moral of this result is that

1. For product channel, the MI-maximizing input is a product distribution

2. For product source, the MI-minimizing channel is a product channel

This type of result is often known as single-letterization in information theory, which tremendouslysimplifies the optimization problem over a high-dimensional (multi-letter) problem to a scalar (single-letter) problem. For example, in the simplest case where Xn, Y n are binary vectors, optimizingI(Xn;Y n) over PXn and PY n|Xn entails optimizing over 2n-dimensional vectors and 2n×2n matrices,whereas optimizing each I(Xi;Yi) individually is easy.Example:

1That is, if PXn,Y n = PY n∏PXi|Yi

as measures.

58

1. (5.1) fails for non-product channels. X1 ⊥⊥ X2 ∼ Bern(1/2) on 0, 1 = F2:

Y1 = X1 +X2

Y2 = X1

I(X1;Y1) = I(X2;Y2) = 0 but I(X2;Y 2) = 2 bits

2. Strict inequality in (5.1).

∀k Yk = Xk = U ∼ Bern(1/2) ⇒ I(Xk;Yk) = 1 bit

I(Xn;Y n) = 1 bit <∑

I(Xk;Yk) = n bits

3. Strict inequality in (5.2). X1 ⊥⊥ . . . ⊥⊥ Xn

Y1 = X2, Y2 = X3, . . . , Yn = X1 ⇒ I(Xk;Yk) = 0

I(Xn;Y n) =∑

H(Xi) > 0 =∑

I(Xk;Yk)

5.2* Gaussian capacity via orthogonal symmetry

Multi-dimensional case (WLOG assume X1 ⊥⊥ . . . ⊥⊥ Xn iid): if Z1, . . . , Zn are independent, then

maxE[∑X2k ]≤nP

I(Xn;Xn + Zn) ≤ maxE[∑X2k ]≤nP

n∑

k=1

I(Xk;Xk + Zk)

Given a distribution PX1 · · ·PXn satisfying the constraint, form the “average of marginals” distribu-tion PX = 1

n

∑nk=1 PXk , which also satisfies the single letter constraint E[X2] = 1

n

∑nk=1 E[X2

k ] ≤ P .Then from concavity in PX of I(PX , PY |X)

I(PX ;PY |X) ≥ 1

n

n∑

k=1

I(PXk , PY |X)

So PX gives the same or better MI, which shows that the extremization above ought to have the formnC(P ) where C(P ) is the single letter capacity. Now suppose Y n = Xn +ZnG where ZnG ∼ N (0, In).Since an isotropic Gaussian is rotationally symmetric, for any orthogonal transformation U ∈ O(n),the additive noise has the same distribution ZnG ∼ UZnG, so that PUY n|UXn = PY n|Xn , and

I(PXn , PY n|Xn) = I(PUXn , PUY n|UXn) = I(PUXn , PY n|Xn)

From the “average of marginal” argument above, averaging over many rotations of Xn can onlymake the mutual information larger. Therefore, the optimal input distribution PXn can be chosento be invariant under orthogonal transformations. Consequently, the (unique!) capacity achievingoutput distribution P ∗Y n must be rotationally invariant. Furthermore, from the conditions forequality in (5.1) we conclude that P ∗Y n must have independent components. Since the only productdistribution satisfying the power constraints and having rotational symmetry is an isotropic Gaussian,we conclude that PY n = (P ∗Y )n and P ∗Y = N (0, P In).

For the other direction in the Gaussian saddle point problem:

minPN :E[N2]=1

I(XG;XG +N)

This uses the same trick, except here the input distribution is automatically invariant underorthogonal transformations.

59

5.3 Information measures and probability of error

Let W be a random variable and W be our prediction. There are three types of problems:

1. Random guessing: W ⊥⊥ W .

2. Guessing with data: W → X → W .

3. Guessing with noisy data: W → X → Y → W .

We want to draw converse statements, e.g., if the uncertainty of W is too high or if the informationprovided by the data is too scarce, then it is difficult to guess the value of W .

Theorem 5.2. Let |X | = M <∞ and Pmax , maxx∈X PX(x). Then

H(X) ≤ (1− Pmax) log(M − 1) + h(Pmax) , FM (Pmax), (5.3)

with equality iff PX = (Pmax,1− Pmax

M − 1, . . . ,

1− Pmax

M − 1︸︷︷︸M−1

).

Proof. First proof : Write RHS-LHS as a divergence. Let P = (Pmax, P2, . . . , PM ) and introduceQ = (Pmax,

1−PmaxM−1 , . . . , 1−Pmax

M−1 ). Then RHS-LHS = D(P‖Q) ≥ 0, with inequality iff P = Q.Second proof : Given any P = (Pmax, P2, . . . , PM ), apply a random permutation π to the last

M − 1 atoms to obtain the distribution Pπ. Then averaging Pπ over all permutation π gives Q.Then use concavity of entropy or “conditioning reduces entropy”: H(Q) ≥ H(Pπ|π) = H(P ).

Third proof : Directly solve the convex optimization maxH(P ) : 0 ≤ pi ≤ Pmax, i = 1, . . . ,M,∑

i pi =1.

Fourth proof : Data processing inequality. Later.

Note: Similar to Shannon entropy H, Pmax is also a reasonable measure for randomness of P . In fact,log 1

Pmaxis known as the Renyi entropy of order ∞, denoted by H∞(P ). Note that H∞(P ) = logM

iff P is uniform; H∞(P ) = 0 iff P is a point mass.Note: The function FM on the RHS of (5.3) looks like

0 1/M 1p

log(M − 1)

logM

FM (p)

which is concave with maximum logM at maximizer 1/M , but not monotone. However, Pmax ≥ 1M

and FM is decreasing on [ 1M , 1]. Therefore (5.3) gives a lower bound on Pmax in terms of entropy.

60

Interpretation: Suppose one is trying to guess the value of X without any information. Thenthe best bet is obviously the most likely outcome (mode), i.e., the maximal probability of successamong all estimators is

maxX⊥⊥X

P[X = X] = Pmax (5.4)

Thus (5.3) means: It is hard to predict something of large entropy.Conceptual question: Is it true (for every predictor X ⊥⊥ X) that

H(X) ≤ FM (P[X = X]) ? (5.5)

This is not obvious from (5.3) and (5.4) since p 7→ FM (p) is not monotone. To show (5.5) considerthe data processor (X, X) 7→ 1X=X:

PXX = PXPX P[X = X] , PS d(PS‖ 1

M

)≤ D(PXX‖QXX)

QXX = UXPX

⇒Q[X = X] = 1

M⇒

= logM −H(X)

where inequality follows by the data-processing for divergence.The benefit of this proof is that it trivially generalizes to (possibly randomized) estimators X(Y ),

which may depend on some observation Y correlated with X. It is clear that the best predictor forX given Y is the maximum posterior (MAP) rule, i.e., posterior mode: X(y) = argmaxx PX|Y (x|y).

Theorem 5.3 (Fano’s inequality). Let |X | = M <∞ and X → Y → X. Then

H(X|Y ) ≤ FM (P[X = X]) = P[X 6= X] log(M − 1) + h(P[X 6= X]). (5.6)

Thus, if in addition X is uniform, then

I(X;Y ) = logM −H(X|Y ) ≥ P[X = X] logM − h(P[X 6= X]). (5.7)

Proof. This is a direct corollary of Theorem 5.2: averaging H(X|Y = y) ≤ FM (P[X = X|Y = y])over PY and use the concavity of FM .

For a standalone proof, apply data processing to PXY X = PXPY |XPX|Y vs. QXY X = UXPY PX|Yand the data processor (kernel) (X,Y ) 7→ 1X 6=X (note that PX|Y is fixed).

Remark: We can also derive Fano’s inequality as follows: Let ε = P[X 6= X]. Apply dataprocessing for M.I.

I(X;Y ) ≥ I(X; X) ≥ minPZ|XI(PX , PZ|X) : P[X = Z] ≥ 1− ε.

This minimum will not be zero since if we force X and Z to agree with some probability, then I(X;Z)cannot be too small. It remains to compute the minimum, which is a nice convex optimizationproblem. (Hint: look for invariants that the matrix PZ|X must satisfy under permutations (X,Z) 7→(π(X), π(Z)) then apply the convexity of I(PX , ·)).Theorem 5.4 (Fano inequality: general). Let X,Y ∈ X , |X | = M and let QXY = PXPY , then

I(X;Y ) ≥ d(P[X = Y ]‖Q[X = Y ])

≥ P[X = Y ] log1

Q[X = Y ]− h(P[X = Y ])

(= P[X = Y ] logM − h(P[X = Y ]) if PX or PY = uniform)

61

Proof. Apply data processing to PXY and QXY . Note that if PX or PY = uniform, then Q[X =Y ] = 1

M always.

The following result is useful in providing converses for statistics and data transmission.

Corollary 5.1 (Lower bound on average probability of error). Let W → X → Y → W and W isuniform on [M ] , 1, . . . ,M. Then

Pe , P[W 6= W ] ≥ 1− I(X;Y ) + h(Pe)

logM(5.8)

≥ 1− I(X;Y ) + log 2

logM. (5.9)

Proof. Apply Theorem 5.3 and the data processing for M.I.: I(W ; W ) ≤ I(X;Y ).

5.4 Entropy rate

Definition 5.1. The entropy rate of a process X = (X1, X2, . . .) is

H(X) , limn→∞

1

nH(Xn) (5.10)

provided the limit exists.

Stationarity is a sufficient condition for entropy rate to exist. Essentially, stationarity means

invariance w.r.t. time shift. Formally, X is stationary if (Xt1 , . . . , Xtn)D=(Xt1+k, . . . , Xtn+k) for any

t1, . . . , tn, k ∈ N.

Theorem 5.5. For any stationary process X = (X1, X2, . . .)

1. H(Xn|Xn−1) ≤ H(Xn−1|Xn−2)

2. 1nH(Xn) ≥ H(Xn|Xn−1)

3. 1nH(Xn) ≤ 1

n−1H(Xn−1)

4. H(X) exists and H(X) = limn→∞1nH(Xn) = limn→∞H(Xn|Xn−1).

5. For double-sided process X = (. . . , X−1, X0, X1, X2, . . .), H(X) = H(X1|X0−∞) provided that

H(X1) <∞.

Proof.

1. Further conditioning + stationarity: H(Xn|Xn−1) ≤ H(Xn|Xn−12 ) = H(Xn−1|Xn−2)

2. Using chain rule: 1nH(Xn) = 1

n

∑H(Xi|Xi−1) ≥ H(Xn|Xn−1)

3. H(Xn) = H(Xn−1) +H(Xn|Xn−1) ≤ H(Xn−1) + 1nH(Xn)

4. n 7→ 1nH(Xn) is a decreasing sequence and lower bounded by zero, hence has a limit H(X).

Moreover by chain rule, 1nH(Xn) = 1

n

∑ni=1H(Xi|Xi−1). Then H(Xn|Xn−1)→ H(X). Indeed,

from part 1 limnH(Xn|Xn−1) = H ′ exists. Next, recall from calculus: if an → a, then theCesaro’s mean 1

n

∑ni=1 ai → a as well. Thus, H ′ = H(X).

62

5. Assuming H(X1) <∞ we have from (3.14):

limn→∞

H(X1)−H(X1|X0−n) = lim

n→∞I(X1;X0

−n) = I(X1;X0−∞) = H(X1)−H(X1|X0

−∞)

Example: (Stationary processes)

1. X− iid source⇒ H(X) = H(X1)

2. X−mixed sources: Flip a coin with bias p at time t = 0, if head, let X = Y, if tail, let X = Z.Then H(X) = pH(Y) + pH(Z).

3. X− stationary Markov chain : X1 → X2 → X3 → · · ·

H(Xn|Xn−1) = H(Xn|Xn−1)⇒ H(X) = H(X2|X1) =∑

a,b

µ(a)Pb|a log1

Pb|a

where µ is an invariant measure (possibly non-unique; unique if the chain is ergodic).

4. X − hidden Markov chain : Let X1 → X2 → X3 → · · · be a Markov chain. Fix PY |X . Let

Xi

PY |X−−−→ Yi. Then Y = (Y1, . . .) is a stationary process. Therefore H(Y) exists but it is verydifficult to compute (no closed-form solution to date), even if X is a binary Markov chain andPY |X is a BSC.

5.5 Entropy and symbol (bit) error rate

In this section we show that the entropy rates of two processes X and Y are close whenever theycan be “coupled”. Coupling of two processes means defining them on a common probability spaceso that average distance between their realizations is small. In our case, we will require that thesymbol error rate be small, i.e.

1

n

n∑

j=1

P[Xj 6= Yj ] ≤ ε . (5.11)

Notice that if we define the Hamming distance as

dH(xn, yn) ,n∑

j=1

1xj 6= yj

then indeed (5.11) corresponds to requiring

E[dH(Xn, Y n)] ≤ nε .

Before showing our main result, we show that Fano’s inequality Theorem 5.3 can be tensorized:

Proposition 5.1. Let Xk take values on a finite alphabet X . Then

H(Xn|Y n) ≤ nF|X |(1− δ) , (5.12)

where

δ =1

nE[dH(Xn, Y n)] =

1

n

n∑

j=1

P[Xj 6= Yj ] .

63

Proof. For each j ∈ [n] consider Xj(Yn) = Yj . Then from (5.6) we get

H(Xj |Y n) ≤ FM (P[Xj = Yj) , (5.13)

where we denoted M = |X |. Then, upper-bounding joint entropy by the sum of marginals, cf. (1.1),and combining with (5.13) we get

H(Xn|Y n) ≤n∑

j=1

H(Xj |Y n) (5.14)

≤n∑

j=1

FM (P[Xj = Yj ]) (5.15)

≤ nFM (1

n

n∑

j=1

P[Xj = Yj ]) . (5.16)

where in the last step we used concavity of FM and Jensen’s inequality. Noticing that

1

n

n∑

j=1

P[Xj = Yj ] = 1− δ

concludes the proof.

Corollary 5.2. Consider two processes X and Y with entropy rates H(X) and H(Y). If

P[Xj 6= Yj ] ≤ ε

for every j and if X takes values on a finite alphabet of size M , then

H(X)−H(Y) ≤ FM (1− ε) .

If both processes have alphabets of size M then

|H(X)−H(Y)| ≤ ε logM + h(ε)→ 0 as ε→ 0

Proof. There is almost nothing to prove:

H(Xn) ≤ H(Xn, Y n) = H(Y n) +H(Xn|Y n)

and apply (5.12). For the last statement just recall the expression for FM .

5.6 Mutual information rate

Definition 5.2 (Mutual information rate).

I(X;Y) = limn→∞

1

nI(Xn;Y n)

provided the limit exists.

64

Example: Gaussian processes. Consider X,N two stationary Gaussian processes, independent ofeach other. Assume that their auto-covariance functions are absolutely summable and thus thereexist continuous power spectral density functions fX and fN . Without loss of generality, assume allmeans are zero. Let cX(k) = E [X1Xk+1]. Then fX is the Fourier transform of the auto-covariancefunction cX , i.e., fX(ω) =

∑∞k=−∞ cX(k)eiωk. Finally, assume fN ≥ δ > 0. Then recall from

Lecture 2:

I(Xn;Xn +Nn) =1

2log

det(ΣXn + ΣNn)

det ΣNn

=1

2

n∑

i=1

log σi −1

2

n∑

i=1

log λi ,

where σj , λj are the eigenvalues of the covariance matrices ΣY n = ΣXn + ΣNn and ΣNn , which areall Toeplitz matrices, e.g., (ΣXn)ij = E [XiXj ] = cX(i− j). By Szego’s theorem (see Section 5.7*):

1

n

n∑

i=1

log σi →1

2π

∫ 2π

0log fY (ω)dω

Note that cY (k) = E [(X1 +N1)(Xk+1 +Nk+1)] = cX(k) + cN (k) and hence fY = fX + fN . Thus,we have

1

nI(Xn;Xn +Nn)→ I(X;X + N) =

1

4π

∫ 2π

0log

fX(w) + fN (ω)

fN (ω)dω.

Maximizing this over fX(ω) leads to the famous water-filling solution f∗X(ω) = |T − fN (ω)|+.

5.7* Toeplitz matrices and Szego’s theorem

Theorem 5.6 (Szego). Let f : [0, 2π)→ R be the Fourier transform of a summable sequence ak,that is

f(ω) =

∞∑

k=−∞eikωak ,

∑|ak| <∞

Then for any φ : R→ R continuous on the closure of the range of f , we have

limn→∞

1

n

n∑

j=1

φ(σn,j) =1

2π

∫ 2π

0φ(f(ω))dω ,

where σn,j , j = 1, . . . , n are the eigenvalues of the Toeplitz matrix Tn = a`−mn`,m=1.

Proof sketch. The idea is to approximate φ by polynomials, while for polynomials the statementcan be checked directly. An alternative interpretation of the strategy is the following: Roughlyspeaking we want to show that the empirical distribution of the eigenvalues 1

n

∑nj=1 δσn,j converges

weakly to the distribution of f(W ), where W is uniformly distributed on [0, 2π]. To this end, let uscheck that all moments converge. Usually this does not imply weak convergence, but on compactintervals it does.

65

For example, for φ(x) = x2 we have

1

n

n∑

j=1

σ2n,j =

1

ntrT 2

n

=1

n

n∑

`,m=1

(Tn)`,m(Tn)m,`

=1

n

∑

`,m

a`−mam−`

=1

n

n−1∑

`=−n−1

(n− |`|)a`a−`

=∑

x∈(−1,1)∩ 1nZ

(1− |x|)anxa−nx ,

Substituting a` = 12π

∫ 2π0 f(ω)eiω` we get

1

n

n∑

j=1

σ2n,j =

1

(2π)2

∫∫f(ω)f(ω′)θn(ω − ω′) , (5.17)

whereθn(u) =

∑

x∈(−1,1)∩ 1nZ

(1− |x|)e−inux

is a Fejer kernel and converges to a δ-function: θn(u) → 2πδ(u) (in the sense of convergence ofSchwartz distributions). Thus from (5.17) we get

1

n

n∑

j=1

σ2n,j →

1

(2π)2

∫∫f(ω)f(ω′)2πδ(ω − ω′)dωdω′ = 1

2π

∫ 2π

0f2(ω)dω

as claimed.

66

§ 6. f-divergences: definition and properties

6.1 f-divergences

In Section 1.6 we introduced the KL divergence that measures the dissimilarity between twodistributions. This turns out to be a special case of the family of f -divergence between probabilitydistributions, introduced by Csiszar [Csi67]. Roughly speaking, all f -divergences quantify thedifference between a pair of distributions, each with different operational meaning.

Definition 6.1 (f -divergence). Let f : (0,∞)→ R be a convex function which is strictly convex1

at 1 and f(1) = 0. Let P and Q be two probability distributions on a measurable space (X ,F).The f -divergence of Q from P with P Q is defined as

Df (P‖Q) , EQ[f

(dP/dµ

dQ/dµ

)](6.1)

where µ is any dominating probability measure (e.g., µ = (P +Q)/2) of P and Q, i.e., both P µand Q µ, with the understanding that

• f(0) = f(0+),

• 0f(00) = 0, and

• 0f(a0 ) = limx↓0 xf(ax) for a > 0.

Remark 6.1. It is useful to consider the case when P 6 Q, in which case P and Q are “verydissimilar.” For example, we will show later that TV(P,Q) = 1 iff P ⊥ Q. When P Q, we have

Df (P‖Q) , EQ[f

(dP

dQ

)]. (6.2)

Similar to Definition 1.4, when X is discrete, Df (P‖Q) =∑

x∈X Q(x)f(P (x)Q(x)

); when X is Eucledian

space, Df (P‖Q) =∫X q(x)f

(p(x)q(x)

)dx.

The following are the common f -divergences:

• Kullback-Leibler (KL) divergence: aka relative entropy, f(x) = x log x,

D(P‖Q) , EQ[P

Qlog

P

Q

]= EP

[log

P

Q

].

This has been extensively discussed in Section 1.6. It is worth noting that, in general

D(P‖Q) 6= D(Q‖P ). When f(x) = − log x, we obtain Df (P‖Q) = EQ[− log P

Q

]= D(Q‖P ).

1By strict convexity at 1, we mean for all s, t ∈ (0,∞) and α ∈ (0, 1) such that αs + αt = 1, we haveαf(s) + (1− α)f(t) > f(1).

67

• Total variation: f(x) = 12 |x− 1|,

TV(P,Q) ,1

2EQ[∣∣∣∣P

Q− 1

∣∣∣∣]

=1

2

∫|dP − dQ|.

Moreover, TV(·, ·) is a metric on the space of probability distributions, and hence it is asymmetric function of P and Q.

• χ2-divergence: f(x) = (x− 1)2,

χ2(P‖Q) , EQ

[(P

Q− 1

)2]

=

∫(P −Q)2

Q=

∫P 2

Q− 1.

Note that we can also choose f(x) = x2−1. Indeed different f can lead to the same divergence.

• Squared Hellinger distance: f(x) = (1−√x)2,

H2(P,Q) , EQ

(

1−√P

Q

)2 =

∫ (√P −

√Q)2.

Note that H2(P,Q) = H2(Q,P ).

• Le Cam distance [LC86, p. 47]: f(x) = 1−x2x+2 ,

LC(P‖Q) =1

2

∫(P −Q)2

P +Q.

Note that this is also a metric.

• Jensen-Shannon divergence (symmetrized KL): f(x) = x log 2xx+1 + log 2

x+1 ,

JS(P‖Q) = D(P∥∥∥P +Q

2

)+D

(Q∥∥∥P +Q

2

).

Theorem 6.1 (Properties of f -divergences).

• Non-negativity: Df (P‖Q) ≥ 0 with equality if and only if P = Q.

• Joint convexity: (P,Q) 7→ Df (P‖Q) is a jointly convex function. Consequently, P 7→Df (P‖Q) and Q 7→ Df (P‖Q) are also convex functions.

• Conditioning increases f-divergence: Define the conditional f -divergence:

Df

(PY |X‖QY |X |PX

), EX∼PX

[Df

(PY |X‖QY |X

)],

Let PXPY |X−−−→ PY and PX

QY |X−−−→ QY , i.e.,

PX

PY |X

QY |X

PY

QY

68

ThenDf (PY ‖QY ) ≤ Df

(PY |X‖QY |X |PX

).

Note: For the last property, one can view PY and QY as the output distributions after passing PXthrough the channel transition matrices PY |X and QY |X respectively. The above relation tells usthat the average f -divergence between the corresponding channel transition rows is at least thef -divergence between the output distributions.

Proof. • Df (P‖Q) = EQ[f(PQ

)]≥ f

(EQ[PQ

])= f(1) = 0, where the inequality follows from

the Jensen’s inequality. By strict convexity at 1, equality holds if and only if P = Q.

• For any convex function f on R+, it follows that (p, q) 7→ qf(pq

)is convex on R2

+ (the

perspective of f). Since Df (P‖Q) = EQ[f(PQ

)], Df (P‖Q) is jointly convex.

• This follows directly from the joint-convexity of Df (P‖Q) and the Jensen’s inequality.

Recall the definition of f -divergences from last time. If a function f : R+ → R satisfies thefollowing properties:

• f is a convex function.

• f(1) = 0.

• f is strictly convex at x = 1, i.e. for all x, y, α such that αx + αy = 1, the inequalityf(1) < αf(x) + αf(y) is strict.

Then the functional that maps pairs of distributions to R+ defined by

Df (P‖Q) , EQ[f

(dP

dQ

)]

is an f -divergence.

6.2 Data processing inequality

The data processing inequality for KL divergence (Theorem 2.2) extends to all f -divergences.

Theorem 6.2. Consider a channel that produces Y given X based on the law PY |X (shown below).If PY is the distribution of Y when X is generated by PX and QY is the distribution of Y when Xis generated by QX , then for any f -divergence Df (·‖·),

Df (PY ‖QY ) ≤ Df (PX‖QX).

PY |X

PX

QX

PY

QY

69

One interpretation of this result is that processing the observation x makes it more difficult todetermine whether it came from PX or QX .

Proof.

Df (PX‖QX) = EQX

[f

(PXQX

)](a)= EQXY

[f

(PXYQXY

)]= EQY

[EQX|Y f

(PXYQXY

)]

Jensen’s inequality→ ≥ EQY

[f

(EQX|Y

PXYQXY

)]

= EQY

[f

(EPX|Y

PYQY

)](b)= EQY

[f

(PYQY

)]= Df (PY ‖QY ).

Note that (a) means Df (PX‖QX) = Df (PXY ‖QXY ); (b) can be alternatively understood by notingthat EQ[ PXYQXY

|Y ] is precisely the relative density PYQY

, by checking the definition of change of measure,

i.e., EP [g(Y )] = EQ[g(Y ) PXYQXY] = EQ[g(Y )E[ PXYQXY

|Y ]] for any g.

Remark 6.2. PY |X can be a deterministic map so that Y = f(X). More specifically, if f(X) =1E(X) for any event E, then Y is Bernoulli with parameter P (E) or Q(E) and the data processinginequality gives

Df (PX‖QX) ≥ Df (Bern(P (E))‖Bern(Q(E))). (6.3)

This is how we will prove the converse direction of large deviation (see Lecture 14).

Example: If X = (X1, X2) and f(X) = X1, then we have

Df (PX1X2‖QX1X2) ≥ Df (PX1‖QX1).

As seen from the proof of Theorem 6.2, this is in fact equivalent to data processing inequality.

Remark 6.3. If Df (P‖Q) is an f -divergence, then Df (P‖Q) with f(x) , xf( 1x) is also an f -

divergence and Df (P‖Q) = Df (Q‖P ). Example: Df (P‖Q) = D(P‖Q) then Df (P‖Q) = D(Q‖P ).

Proof. First we verify that f has all three properties required for Df (·‖·) to be an f -divergence.

• For x, y ∈ R+ and α ∈ [0, 1] define c = αx+ αy so that αxc + αy

c = 1. Observe that

f(αx+ αy) = cf

(1

c

)= cf

(αx

c

1

x+αy

c

1

y

)≤ cαx

cf

(1

x

)+ c

αy

cf

(1

y

)= αf(x) + αf(y).

Thus f : R+ → R is a convex function.

• f(1) = f(1) = 0.

• For x, y ∈ R+, α ∈ [0, 1], if αx+ αy = 1, then by strict convexity of f at 1,

0 = f(1) = f(1) = f

(αx

1

x+ αy

1

y

)< αxf

(1

x

)+ αyf

(1

y

)= αf(x) + αf(y).

So f is strictly convex at 1 and thus Df (·‖·) is a valid f -divergence.

Finally,

Df (P‖Q) = EQ[f

(P

Q

)]= EP

[Q

Pf

(P

Q

)]= EP

[f

(Q

P

)]= Df (Q‖P ).

70

6.3 Total variation and hypothesis testing

Recall that the choice of f(x) = 12 |x− 1| gives rise to the total variation distance,

Df (P‖Q) =1

2EQ∣∣∣∣P

Q− 1

∣∣∣∣ =1

2

∫|P −Q|,

where∫|P −Q| is a short-hand understood in the usual sense, namely,

∫|dPdµ −

dQdµ |dµ where µ is a

dominating measure, e.g., µ = P +Q, and the value of the integral does not depends on µ.We will denote total variation by TV(P,Q).

Theorem 6.3. The following definitions for total variation are equivalent:

1.TV(P,Q) = sup

EP (E)−Q(E), (6.4)

where the supremum is over all measurable set E.

2. 1−TV(P,Q) is the minimal sum of Type-I and Type-II error probabilities for testing P versusQ, and2

TV(P,Q) = 1−∫P ∧Q. (6.5)

3. Provided the diagonal (x, x) : x ∈ X is measurable,

TV(P,Q) = infPXY :

PX=P,PY =Q

P [X 6= Y ] . (6.6)

4. Let F = f : X → R, ‖f‖∞ ≤ 1. Then

TV(P,Q) =1

2supf∈F

EP f(x)− EQf(x). (6.7)

Remark 6.4 (Variational representation). The equation (6.4) and (6.7) provide sup-representationof total variation, which will be extended to general f -divergences (later). Note that (6.6) is aninf-representation of total variation in terms of couplings, meaning total variation is the Wassersteindistance with respect to Hamming distance. The benefit of variational representations is thatchoosing a particular coupling in (6.6) gives an upper bound on TV(P,Q), and choosing a particularf in (6.7) yields a lower bound.

Remark 6.5 (Operational meaning). In the binary hypothesis test for H0 : X ∼ P or H1 : X ∼ Q,Theorem 6.3 shows that 1− TV(P,Q) is the sum of false alarm and missed detection probabilities.This can be seen either from (6.4) where E is the decision region for deciding P or from (6.5) sincethe optimal test (for average probability of error) is the likelihood ratio test dP

dQ > 1. In particular,

• TV(P,Q) = 1⇔ P ⊥ Q, the probability of error is zero since essentially P and Q have disjointsupports.

• TV(P,Q) = 0⇔ P = Q and the minimal sum of error probabilities is one, meaning the bestthing to do is to flip a coin.

2Here again∫P ∧ Q is a short-hand understood per the usual sense, namely,

∫( dPdµ∧ dQ

dµ)dµ where µ is any

dominating measure.

71

6.4 Motivating example: Hypothesis testing with multiplesamples

Observation:

1. Different f -divergences have different operational significance. For example, as we saw inSection 6.3, testing two hypothesis boils down to total variation, which determines thefundamental limit (minimum average probability of error). For estimation under quadratic

loss the f -divergence LC(P‖Q) =∫ (P−Q)2

P+Q is useful.

2. Some f -divergence is easier to evaluate than others. For example, for product distributions,Hellinger distance and χ2-divergence tensorize in the sense that they are easily expressiblein terms of those of the one-dimensional marginals; however, computing the total variationbetween product measures is frequently difficult. Another example is that computing theχ2-divergence from a mixture of distributions to a simple distribution is convenient.

Therefore the punchline is that it is often fruitful to bound one f -divergence by another and thissometimes leads to tight characterizations. In this section we consider a specific useful exampleto drive this point home. Then in the next lecture we develop inequalities between f -divergencessystematically.

Consider a binary hypothesis test where data X = (X1, . . . , Xn) are i.i.d drawn from either Por Q and the goal is to test

H0 : X ∼ P⊗n vs H1 : X ∼ Q⊗n.

As mentioned before, 1−TV(P⊗n, Q⊗n) gives minimal Type-I+II probabilities of error, achieved bythe maximum likelihood test. By the data processing inequality, TV(P⊗m, Q⊗m) ≤ TV(P⊗n, Q⊗n)for m < n. From this we see that TV(P⊗n, Q⊗n) is an increasing sequence in n (and bounded by 1by definition) and hence converges. One would hope that as n→∞, TV(P⊗n, Q⊗n) converges to 1and consequently, the probability of error in the hypothesis test converges to zero. It turns out thatif the distributions P,Q are independent of n, then large deviation theory (see Corollary 15.1) gives

TV(P⊗n, Q⊗n) = 1− exp(−nC(P,Q) + o(n)), (6.8)

where the constant C(P,Q) = − log inf0≤α≤1

∫PαQ1−α is the Chernoff Information of P,Q. It

is clear from this that TV(P⊗n, Q⊗n)→ 1 as n→∞, and, in fact, exponentially fast.However, as frequently encountered in high-dimensional statistical problems, if the distributions

P = Pn and Q = Qn depend on n, then the large-deviation approach that leads to (6.8) is no longervalid. In such a situation, total variation is still relevant for hypothesis testing, but its behavior asn→∞ is not obvious nor easy to compute. In this case, understanding how a more computationallytractable f -divergence is related to total variation may give insight on hypothesis testing withoutneeding to directly compute the total variation. It turns out Hellinger distance is precisely suitedfor this task – see Theorem 6.4 below.

Recall that the squared Hellinger distance, H2(P,Q) = EQ[(

1−√

PQ

)2]

is an f -divergence

with f(x) = (1−√x)2, which provides a sandwich bound for total variation

0 ≤ 1

2H2(P,Q) ≤ TV(P,Q) ≤ H(P,Q)

√1− H2(P,Q)

4≤ 1. (6.9)

This can be proved by elementary manipulations and a systematic proof will be explained in thenext lecture. Direct consequences of this bound are:

72

• H2(P,Q) = 2, if and only if TV(P,Q) = 1.

• H2(P,Q) = 0 if and only if TV(P,Q) = 0.

• Hellinger consistency ⇐⇒ TV consistency, namely H2(Pn, Qn)→ 0 ⇐⇒ TV(Pn, Qn)→ 0.

Theorem 6.4. For any sequence of distributions Pn and Qn, as n→∞,

TV(P⊗nn , Q⊗nn )→ 0 ⇐⇒ H2(Pn, Qn) = o

(1

n

)

TV(P⊗nn , Q⊗nn )→ 1 ⇐⇒ H2(Pn, Qn) = ω

(1

n

)

Proof. Because the observations X = (X1, X2, ...Xn) are i.i.d, the joint distribution factors

H2(P⊗nn , Q⊗nn ) = 2− 2EQ⊗nn

√√√√

n∏

i=1

PnQn

(Xi)

By independence→ = 2− 2n∏

i=1

EQn

[√PnQn

(Xi)

]= 2− 2

(EQn

[√PnQn

])n

= 2− 2

(1− 1

2H2(Pn, Qn)

)n.

Then TV(P⊗nn , Q⊗nn ) → 0 if and only if H2(P⊗nn , Q⊗nn ) → 0, which happens precisely whenH2(Pn, Qn) = o( 1

n).Similarly, TV(P⊗nn , Q⊗nn )→ 1 if and only if H2(P⊗nn , Q⊗nn )→ 2 which happens precisely when

H2(Pn, Qn) = ω( 1n).

Remark 6.6. The proof of Theorem 6.4 relies on two ingredients:

1. Sandwich bound (6.9).

2. Tensorization properties of Hellinger:

H2

(n∏

i=1

Pi,n∏

i=1

Qi

)= 2− 2

n∏

i=1

(1− H2(Pi, Qi)

2

)(6.10)

Note that there are other f -divergences that are also tensorizable, e.g., χ2-divergences:

χ2

(n∏

i=1

Pi,n∏

i=1

Qi

)=

n∏

i=1

(1 + χ2(Pi, Qi)

)− 1; (6.11)

however, no sandwich inequality like (6.9) exists for TV and χ2 and hence there is no χ2-version ofTheorem 6.4. Asserting the non-existence of such inequalities requires understanding the relationshipbetween these two f -divergences (see next lecture).

73

6.5 Inequalities between f-divergences

We will discuss two methods for finding inequalities between f -divergences.

• ad hoc approach: case-by-case proof using results like Jensen’s inequality, max ≤ mean ≤ min,Cauchy-Schwarz, etc.

• systematic approach: joint range of f -divergences.

Definition 6.2. The joint range between two f -divergences Df (·‖·) and Dg(·‖·) is the range of themapping (P,Q) 7→ (Df (P‖Q), Dg(P‖Q)), i.e., the set R ⊂ R+ × R+ where (x, y) ∈ R if there existdistributions P,Q on some common measurable space such that x = Df (P‖Q) and y = Dg(P‖Q).

Df

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

Dg

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

The green region in the above figure shows what a joint range between Df (·‖·) and Dg(·‖·) mightlook like. By definition of R, the lower boundary gives the sharpest lower bound of Dg in terms ofDf , namely:

Df (P‖Q) ≥ V (Dg(P‖Q)), where V (t) , infDf (P‖Q) : Dg(P‖Q) = t;

similarly, the upper boundary gives the best upper bound. As will be discussed in the next lecture,the sandwich bound (6.9) correspond to precisely the lower and upper boundaries of the joint rangeof H2 and TV, therefore not improvable. It is important to note, however, that R may be anunbounded region and some of the boundaries may not exist, meaning it is impossible to bound oneby the other, such as χ2 versus TV.

To gain some intuition, we start with the ad hoc approach by proving Pinsker’s inequality, whichbounds total variation from above by the KL divergence.

74

Theorem 6.5 (Pinsker’s inequality).

D(P‖Q) ≥ 2 log eTV2(P,Q). (6.12)

Proof. First we show that, by the data processing inequality, it suffices to prove the result forBernoulli distributions. For any event E, let Y = 1X∈E which is Bernoulli with parameter P (E)or Q(E). By data processing inequality, D(P‖Q) ≥ d(P (E)‖Q(E)). If Pinsker’s inequality is truefor all Bernoulli random variables, we have

√1

2D(P‖Q) ≥ TV(Bern(P (E)),Bern(Q(E)) = |P (E)−Q(E)|

Taking the supremum over E gives√

12D(P‖Q) ≥ supE |P (E) − Q(E)| = TV(P,Q), in view of

Theorem 6.3.The binary case follows easily from Taylor’s theorem (with integral remainder form):

d(p‖q) =

∫ p

q

p− tt(1− t)dt ≥ 4

∫ p

q(p− t)dt = 2(p− q)2

and TV(Bern(p),Bern(q)) = |p− q|.

Remark 6.7. Pinsker’s inequality is known to be sharp in the sense that the constant “2” in (1.15)is not improvable, i.e., there exist Pn, Qn such that LHS

RHS → 2 as n → ∞. (Why? Think aboutthe local quadratic behavior in Proposition 4.2) Nevertheless, this does not mean that (1.15) itselfis not improvable because it might be possible to subtract some higher-order term from the RHS.This is indeed the case and there are many refinements of Pinsker’s inequality. But what is the bestinequality? Settling this question rests on characterizing the joint range and the lower boundary.This is the topic of next lecture.

75

§ 7. Inequalities between f-divergences via joint range

In the last lecture we proved the Pinkser’s inequality that D(P‖Q) ≥ 2TV2(P,Q) in an ad hocmanner. The downside of ad hoc approaches is that it is hard to tell whether those inequalities canbe improved or not. However, the key step when we proved the Pinkser’s inequality, reduction tothe case for Bernoulli random variables, is inspiring: is it possible to reduce inequalities betweenany two f -divergences to the binary case?

7.1 Inequalities via joint range

A systematic method is to prove those inequalities via their joint range. For example, to provea lower bound of D(P‖Q) by a function of TV(P,Q) that D(P‖Q) ≥ F (TV(P,Q)) for someF : [0, 1] 7→ [0,∞], the best choice, by definition, is the following:

F (ε) , inf(P,Q):TV(P,Q)=ε

D(P‖Q).

The problem boils to the characterization of the region (TV(P,Q), D(P‖Q)) : P,Q ⊆ R2, theirjoint range, whose lower boundary is the function F .

0.0 0.2 0.4 0.6 0.8TV0.0

0.5

1.0

1.5

D

Figure 7.1: Joint range of TV and D.

76

Definition 7.1 (Joint range). Consider two f -divergences Df (P‖Q) and Dg(P‖Q). Their jointrange is a subset of R2 defined by

R , (Df (P‖Q), Dg(P‖Q)) : P,Q are probability measures on some measurable space ,Rk , (Df (P‖Q), Dg(P‖Q)) : P,Q are probability measures on [k] .

The region R seems difficult to characterize since we need to consider P,Q over all measurablespaces; on the other hand, the region Rk for small k is easy to obtain. The main theorem we willprove is the following [HV11]:

Theorem 7.1 (Harremoes-Vajda ’11).

R = co(R2).

It is easy to obtain a parametric formula of R2. By Theorem 7.1, the region R is no more thanthe convex hull of R2.

Theorem 7.1 implies that R is a convex set; however, as a warmup, it is instructive to prove con-vexity of R directly, which simply follows from the arbitrariness of the alphabet size of distributions.Given any two points (Df (P0‖Q0), Dg(P0‖Q0)) and (Df (P1‖Q1), Dg(P1‖Q1)) in the joint range, itis easy to construct another pair of distributions (P,Q) by alphabet extension that produces anyconvex combination of those two points.

Theorem 7.2. R is convex.

Proof. Given any two pairs of distributions (P0, Q0) and (P1, Q1) on some space X and given anyα ∈ [0, 1], we define another pair of distributions (P,Q) on X × 0, 1 by

P = α(P0 × δ0) + α(P1 × δ1),

Q = α(Q0 × δ0) + α(Q1 × δ1).

In other words, we construct a random variable Z = (X,B) with B ∼ Bern(α), where PX|B=i = Piand QX|B=i = Qi. Then

Df (P‖Q) = EQ[f

(P

Q

)]= EB

[EQZ|B

[f

(P

Q

)]]= αDf (P0‖Q0) + αDf (P1‖Q1),

Dg(P‖Q) = EQ[g

(P

Q

)]= EB

[EQZ|B

[g

(P

Q

)]]= αDg(P0‖Q0) + αDg(P1‖Q1).

Therefore, α(Df (P0‖Q0), Dg(P0‖Q0)) + α(Df (P1‖Q1), Dg(P1‖Q1)) ∈ R and thus R is convex.

Theorem 7.1 is proved by the following two lemmas:

Lemma 7.1 (non-constructive/existential). R = R4.

Lemma 7.2 (constructive/algorithmic).

Rk+1 = co(R2 ∪Rk) for any k ≥ 2

and henceRk = co(R2), for any k ≥ 3.

77

7.1.1 Representation of f-divergences

To prove Lemma 7.1 and Lemma 7.2, we first express f -divergences by means of expectation overthe likelihood ratio.

Lemma 7.3. Given two f -divergences Df (·‖·) and Dg(·‖·), their joint range is

R =

(E[f(X)] + f(0)(1− E[X])E[g(X)] + g(0)(1− E[X])

): X ≥ 0,E[X] ≤ 1

,

Rk =

(E[f(X)] + f(0)(1− E[X])E[g(X)] + g(0)(1− E[X])

):

X ≥ 0,E[X] ≤ 1, X takes at most k − 1 values,

or X ≥ 0,E[X] = 1, X takes at most k values

,

where f(0) , limx→0 xf(1/x) and g(0) , limx→0 xg(1/x).

In the statement of Lemma 7.3, we remark that f(0) and g(0) are both well-defined (possibly+∞) by the convexity of x 7→ xf(1/x) and x 7→ xg(1/x) (from the last lecture).

Before proving above lemma, we look at the following two examples to understand the correspon-dence between a point in the joint range and a random variable. The first example is the simple casethat P Q, when the likelihood ratio of P and Q (or Radon-Nikodym derivative defined on theunion of the spaces of P and Q) is well-define. Example: Consider the following two distributionsP,Q on [3]:

1 2 3

P 0.34 0.34 0.32

Q 0.85 0.1 0.05

Then Df (P‖Q) = 0.85f(0.4) + 0.1f(3.4) + 0.05f(6.4), which is E[f(X)] where X is the likelihoodratio of P and Q taking 3 values with the following pmf:

x 0.4 3.4 6.4

µ(x) 0.85 0.1 0.05

On the other direction, given the above pmf of a non-negative, unit-mean random variable X ∼ µthat takes 3 values, we can construct a pair of distribution by Q(x) = µ(x) and P (x) = xµ(x).

In general cases P is not necessarily absolutely continuous w.r.t. Q, and the likelihood ratio Xmay not always exist. However, it is still well-defined on the event Q > 0. Example: Considerthe following two distributions P,Q on [2]:

1 2

P 0.4 0.6

Q 0 1

Then Df (P‖Q) = f(0.6) + 0f(0.40 ), where 0f(p0) is understood as

0f(p

0

)= lim

x→0xf(px

)= p lim

x→0

x

pf(px

)= pf(0).

Therefore Df (P‖Q) = f(0.6) + 0.4f(0) = E[f(X)] + f(0)(1−E[X]) where X is defined on Q > 0:x 0.6

µ(x) 1

78

On the other direction, given above pmf of a non-negative random variable X ∼ µ with E[X] ≤ 1that takes 1 value, we let Q(x) = µ(x), let P (x) = xµ(x) on Q > 0 and let P have an extra pointmass 1− E[X].

Proof of Lemma 7.3. We first prove it for R. Given any pair of distributions (P,Q) that produces apoint of R, let p, q denote the densities of P,Q under some dominating measure µ, respectively. Let

X =p

qon q > 0 , µX = Q, (7.1)

then X ≥ 0 and E[X] = P [q > 0] ≤ 1. Then

Df (P‖Q) =

∫

q>0f

(p

q

)dQ+

∫

q=0

q

pf

(p

q

)dP =

∫

q>0f

(p

q

)dQ+ f(0)P [q = 0]

= E[f(X)] + f(0)(1− E[X]),

Analogously,Dg(P‖Q) = E[g(X)] + g(0)(1− E[X]),

On the other direction, given any random variable X ≥ 0 and E[X] ≤ 1 where X ∼ µ, let

dQ = dµ, dP = Xdµ+ (1− E[X])δ∗, (7.2)

where ∗ is an arbitrary symbol outside the support of X. Then

(Df (P‖Q)Dg(P‖Q)

)=

(E[f(X)] + f(0)(1− E[X])E[g(X)] + g(0)(1− E[X])

).

Now we consider Rk. Given two probability measures P,Q on [k], the likelihood ratio defined in(7.1) takes at most k values. If P Q then E[X] = 1; if P 6 Q then X takes at most k − 1 values.

On the other direction, if E[X] = 1 then the construction of P,Q in (7.2) are on the samesupport of X; if E[X] < 1 then the support of P is increased by one.

7.1.2 Proof of Theorem 7.1

Aside: Fenchel-Eggleston-Caratheodory’s theorem: Let S ⊆ Rd and x ∈ co(S). Then thereexists a set of d+ 1 points S′ = x1, x2, . . . , xd+1 ∈ S such that x ∈ co(S′). If S is connected, thend points are enough.

Proof of Lemma 7.1. It suffices to prove that

R ⊆ R4.

Let S , (x, f(x), g(x)) : x ≥ 0 which is a connected set. For any pair of distributions (P,Q) thatproduces a point of R, we construct a random variable X as in (7.1), then (E[X],E[f(X)],E[g(X)]) ∈co(S). By Fenchel-Eggleston-Caratheodory’s theorem,1 there exists (xi, f(xi), g(xi)) and the corre-sponding weight αi for i = 1, 2, 3 such that

(E[X],E[f(X)],E[g(X)]) =3∑

i=1

αi(xi, f(xi), g(xi)).

1To prove Theorem 7.1, it suffices to invoke the basic Caratheodory’s theorem, which proves a weaker version ofLemma 7.1 that R = R5.

79

We construct another random variable X ′ that takes value xi with probability αi. Then X takes 3values and

(E[X],E[f(X)],E[g(X)]) = (E[X ′],E[f(X ′)],E[g(X ′)]). (7.3)

By Lemma 7.3 and (7.3),


)=

(E[f(X)] + f(0)(1− E[X])E[g(X)] + g(0)(1− E[X])

)=

(E[f(X ′)] + f(0)(1− E[X ′])E[g(X ′)] + g(0)(1− E[X ′])

)∈ R4.

We observe from Lemma 7.3 that Df (P‖Q) only depends on the distribution of X for someX ≥ 0 and E[X] ≤ 1. To find a pair of distributions that produce a point in Rk it suffices to find arandom variable X ≥ 0 taking k values with E[X] = 1, or taking k − 1 values with E[X] ≤ 1. InExample 7.1.1 where (P,Q) produces a point in R3, we want to show that it also belongs to co(R2).The decomposition of a point in R3 is equivalent to the decomposition of the likelihood ratio X that

µX = αµ1 + αµ2.

A solution of such decomposition is that µX = 0.5µ1 + 0.5µ2 where µ1, µ2 has the following pmf:

x 0.4 3.4

µ1(x) 0.8 0.2

x 0.4 6.4

µ2(x) 0.9 0.1

Then by (7.2) we obtain two pairs of distributions

P1 0.32 0.68

Q1 0.8 0.2

P2 0.36 0.64

Q2 0.9 0.1

We obtain that (Df (P‖Q)Dg(P‖Q)

)= 0.5

(Df (P1‖Q1)Dg(P1‖Q1)

)+ 0.5


).

Proof of Lemma 7.2. It suffices to prove the first statement, namely, Rk+1 = co(R2 ∪Rk) for anyk ≥ 2. By the same argument as in the proof of Theorem 7.2 we have co(Rk) ⊆ Rk+1 and notethat R2 ∪Rk = Rk. We only need to prove that

Rk+1 ⊆ co(R2 ∪Rk).

Given any pair of distributions (P,Q) that produces a point of (Df (P‖Q), Dg(P‖Q)) ∈ Rk+1, weconstruct a random variable X as in (7.1) that takes at most k + 1 values. Let µ denote thedistribution of X. We consider two cases that Eµ[X] < 1 and Eµ[X] = 1 separately.

• Eµ[X] < 1. Then X takes at most k values since otherwise P Q. Denote the smallest valueof X by x and then x < 1. Suppose µ(x) = q and then µ can be represented as

µ = qδx + qµ′,

where µ′ is supported on at most k − 1 values of X other than x. Let µ2 = δx. We need toconstruct another probability measure µ1 such that

µ = αµ1 + αµ2,

– Eµ′ [X] ≤ 1. Let µ1 = µ′ and let α = q.

80

– Eµ′ [X] > 1. Let µ1 = pδx + pµ′ where p =Eµ′ [X]−1

Eµ′ [X]−x such that Eµ1 [X] = 1. Let

α =Eµ[X]−x

1−x .

• Eµ[X] = 1.2 Denote the smallest value of X by x and the largest value by y, respectively, andthen x ≤ 1, y ≥ 1. Suppose µ(x) = r and µ(y) = s and then µ can be represented as

µ = rδx + (1− r − s)µ′ + sδy,

where µ′ is supported on at most k− 1 values of X other than x, y. Let µ2 = βδx + βδy whereβ = y−1

y−x such that Eµ2 [X] = 1. We need to construct another probability measure µ1 suchthat

µ = αµ1 + αµ2,

– Eµ′ [X] ≤ 1. Let µ1 = pδy + pµ′ where p =1−Eµ′ [X]

y−Eµ′ [X] such that Eµ1 [X] = 1. Let α = r/β.

– Eµ′ [X] > 1. Let µ1 = pδx + pµ′ where p =Eµ′ [X]−1

Eµ′ [X]−x such that Eµ1 [X] = 1. Let α = s/β.

Applying the construction in (7.2) with µ1 and µ2, we obtain two pairs of distributions (P1, Q1)supported on k values and (P2, Q2) supported on two values, respectively. Then


)=

(Eµ[f(X)] + f(0)(1− Eµ[X])Eµ[g(X)] + g(0)(1− Eµ[X])

)

= α

(Eµ1 [f(X)] + f(0)(1− Eµ1 [X])Eµ1 [g(X)] + g(0)(1− Eµ1 [X])

)+ α

(Eµ2 [f(X)] + f(0)(1− Eµ2 [X])Eµ2 [g(X)] + g(0)(1− Eµ2 [X])

)

= α


)+ α


).

Remark 7.1. Theorem 7.1 can be viewed as a consequence of Krein-Milman’s theorem. Considerthe space of PX : X ≥ 0,E[X] ≤ 1, which has only two types of extreme points:

1. X = x for 0 ≤ x ≤ 1;

2. X takes two values x1, x2 with probability α1, α2, respectively, and E[X] = 1.

For the first case, let P = Bern(x) and Q = δ1; for the second case, let P = Bern(α2x2) andQ = Bern(α2).

7.2 Examples

7.2.1 Hellinger distance versus total variation

The upper and lower bound we mentioned in the last lecture is the following [Tsy09, Sec. 2.4]:

1

2H2(P,Q) ≤ TV(P,Q) ≤ H(P,Q)

√1−H2(P,Q)/4. (7.4)

2Many thanks to Pengkun Yang for correcting the error in the original proof.

81

Their joint range R2 has a parametric formula

(2(1−√pq −√pq), |p− q|) : 0 ≤ p ≤ 1, 0 ≤ q ≤ 1

and is the gray region in Fig. 7.2. The joint rangeR is the convex hull ofR2 (grey region, non-convex)and exactly described by (7.4); so (7.4) is not improvable. Indeed, with t ranges from 0 to 1,

• the upper boundary is achieved by P = Bern(1+t2 ), Q = Bern(1−t

2 ),

• the lower boundary is achieved by P = (1− t, t, 0), Q = (1− t, 0, t).

0.0 0.5 1.0 1.5 2.00.0

0.2

0.4

0.6

0.8

1.0

H^2

TV

Figure 7.2: Joint range of TV and H2.

7.2.2 KL divergence versus total variation

Pinsker’s inequality states that

D(P‖Q) ≥ 2 log eTV2(P,Q). (7.5)

There are various kinds of improvements of Pinsker’s inequality. Now we know that the best lowerbound is the lower boundary of Fig. 7.1, which is exactly the boundary of R2. Although there is noknown close-form expression, a paremetric formula of the lower boundary (see Fig. 7.1) not hard towrite it down [FHT03, Theorem 1]:

Dt = 1

2 t(

1−(coth(t)− 1

t

)2)

TVt = −t2csch2(t) + t coth(t) + log(tcsch(t)), t ≥ 0. (7.6)

Here is a corollary (weaker bound) that we will use later:

D(P‖Q) ≥ TV(P,Q) log1 + TV(P,Q)

1− TV(P,Q).

Consequences:

• The original Pinsker’s inequality shows that D → 0⇒ TV→ 0.

82

• TV→ 1⇒ D →∞. Thus D = O(1)⇒ TV is bounded away from one. This is not obtainablefrom Pinsker’s inequality.

Also from Fig. 7.1 we know that it is impossible to have an upper bound of D(P‖Q) using a functionof TV(P,Q) due to the lack of upper boundary.

For more examples see [Tsy09, Sec. 2.4].

7.3 Joint range between various divergences

Joint range between various pairs of f -divergences are summarizes as follows in terms of inequalities,each of which is tight.

• KL vs TV: see (7.6).

There is partial comparison in the other direction (“reverse Pinsker”):

D(µ‖π) ≤ log

(1 +

2

π∗TV(µ, π)2

)≤ 2 log e

π∗ TV(µ, π)2 , π∗ = minxπ(x)

• KL vs Hellinger:

D(P ||Q) ≥ 2 log2

2−H2(P,Q). (7.7)

There is a partial result in the opposite direction (log-Sobolev inequality for Bonami-Becknersemigroup)

D(µ‖π) ≤log( 1

π∗− 1)

1− 2π∗(H2(µ, π)−H4(µ, π)/4) , π∗ = min

xπ(x)

• KL vs χ2:0 ≤ D(P ||Q) ≤ log(1 + χ2(P ||Q)) . (7.8)

(i.e. no lower-bound on KL in terms of χ2 is possible).

• TV and Hellinger: see (7.4). Another bound [Gil10]:

TV(P,Q) ≤√−2 ln

(1− H2(P,Q)

2

)

• Le Cam and Hellinger [LC86, p. 48]:

1

2H2(P,Q) ≤ LC(P,Q) ≤ H2(P,Q). (7.9)

• Le Cam and Jensen-Shannon [Top00]:

LC(P ||Q) log e ≤ JS(P ||Q) ≤ LC(P ||Q) · 2 log 2 (7.10)

• χ2 and TV [HV11, Eq. (12)]:χ2(P‖Q) ≥ 4TV(P,Q)2 . (7.11)

83

Part II

Lossless data compression

84

§ 8. Variable-length Lossless Compression

The principal engineering goal of data compression is to represent a given sequence a1, a2, . . . , anproduced by a source as a sequence of bits of minimal possible length with possible algorithmicconstraints. Of course, reducing the number of bits is generally impossible, unless the source satisfiescertain restrictions, that is, only a small subset of all sequences actually occur in practice. Is thisthe case for real world data?

As a simple demonstration, one may take two English novels and compute empirical frequenciesof each letter. It will turn out to be the same for both novels (approximately). Thus, we can seethat there is some underlying structure in English texts restricting possible output sequences. Thestructure goes beyond empirical frequencies of course, as further experimentation (involving digrams,word frequencies etc) may reveal. Thus, the main reason for the possibility of data compression isthe experimental (empirical) law: real-world sources produce very restricted sets of sequences.

How do we model these restrictions? Further experimentation (with language, music, images)reveals that frequently, the structure may be well described if we assume that sequences are generatedprobabilistically [Sha48, Sec. III]. This is one of the main contributions of Shannon: another empiricallaw states that real-world sources may be described probabilistically with increasing precision startingfrom i.i.d., 1st order Markov, 2nd order Markov etc. Note that sometimes one needs to find anappropriate basis in which this “law” holds – this is the case of images (i.e. rasterized sequenceof pixels won’t appear to have local probabilistic laws, because of forgetting the 2-D constraints;wavelets and local Fourier transform provide much better bases).1

So our initial investigation will be about representing one random variable X ∼ PX in terms ofbits efficiently. Types of compression:

• Lossless:P (X 6= X) = 0. variable-length code, uniquely decodable codes, prefix codes, Huffman codes

• Almost lossless:P (X 6= X) ≤ ε. fixed-length codes

• Lossy:X →W → X s.t. E[(X − X)2] ≤ distortion.

8.1 Variable-length, lossless, optimal compressor

Coding paradigm:

Remark 8.1.

• Codeword: f(x) ∈ 0, 1∗; Codebook: f(x) : x ∈ X ⊂ 0, 1∗1Of course, one should not take these “laws” too far. In regards to language modeling, (finite-state) Markov

assumption is too simplistic to truly generate all proper sentences, cf. Chomsky [Cho56].

85

Compressorf : X→0,1∗

Decompressorg: 0,1∗→X

X X0, 1∗

• Since 0, 1∗ = ∅, 0, 1, 00, 01, . . . is countable, lossless compression is only possible fordiscrete R.V.;

• If we want g f = 1X (lossless), then f must be injective;

• WLOG, we can relabel X such that X = N = 1, 2, . . . and order the pmf decreasingly:PX(i) ≥ PX(i+ 1).

• Note that at this point we do not impose any conditions on the codebook (such as prefixor unique-decodability). This is sometimes called single-shot compression setting. Originalresults for this setting can be found in [KV14].

Length function:l : 0, 1∗ → N

e.g., l(01001) = 5.Objectives: Find the best compressor f to minimize

E[l(f(X))]

sup l(f(X))

median l(f(X))

It turns out that there is a compressor f∗ that minimizes all objectives simultaneously!Main idea: Assign longer codewords to less likely symbols, and reserve the shorter codewords

for more probable symbols.Aside: It is useful to introduce the partial order of stochastic dominance: For real-valued RV X

and Y , we say Y stochastically dominates (or, is stochastically larger than) X, denoted by Xst.≤ Y ,

if P [Y ≤ t] ≤ P [X ≤ t] for all t ∈ R. In other words, Xst.≤ Y iff the CDF of X is larger than the

CDF of Y pointwise. In particular, if X is dominated by Y stochastically, so are their means,medians, supremum, etc.

Theorem 8.1 (optimal f∗). Consider the compressor f∗ defined by

Then

1. length of codeword:l(f∗(i)) = blog2 ic

86

2. l(f∗(X)) is stochastically the smallest: for any lossless f ,

l(f∗(X))st.≤ l(f(X))

i.e., for any k, P[l(f(X)) ≤ k] ≤ P[l(f∗(X)) ≤ k].

Proof. Note that

|Ak| , |x : l(f(x)) ≤ k| ≤k∑

i=0

2i = 2k+1 − 1 = |x : l(f∗(x)) ≤ k| , |A∗k|.

where the inequality is because of f is lossless and hence |Ak| must be at least the total number ofbinary strings of length at most k. Then

P[l(f(X)) ≤ k] =∑

x∈Ak

PX(x) ≤∑

x∈A∗k

PX(x) = P[l(f∗(X)) ≤ k],

since |Ak| ≤ |A∗k| and A∗k contains all 2k+1 − 1 most likely symbols.

The following lemma (homework) is useful in bounding the expected code length of f∗. It saysif the random variable is integer-valued, then its entropy can be controlled using its mean.

Lemma 8.1. For any Z ∈ N s.t. E[Z] <∞, H(Z) ≤ E[Z]h( 1E[Z]), where h(·) is the binary entropy

function.

Theorem 8.2 (Optimal average code length: exact expression). Suppose X ∈ N and PX(1) ≥PX(2) ≥ . . .. Then

E[l(f∗(X))] =∞∑

k=1

P[X ≥ 2k].

Proof. Recall that expectation of U ∈ Z+ can be written as E [U ] =∑

k≥1 P [U ≥ k]. Then byTheorem 8.1, E[l(f∗(X))] = E [blog2Xc] =

∑k≥1 P [blog2Xc ≥ k] =

∑k≥1 P [log2X ≥ k].

Theorem 8.3 (Optimal average code length v.s. entropy).

H(X) bits− log2[e(H(X) + 1)] ≤ E[l(f∗(X))] ≤ H(X) bits

Note: Theorem 8.3 is the first example of a coding theorem, which relates the fundamental limitE[l(f∗(X))] (operational quantity) to the entropy H(X) (information measure).

Proof. Define L(X) = l(f∗(X))).RHS: observe that since the pmf are ordered decreasingly by assumption, PX(m) ≤ 1/m, so

L(m) ≤ log2m ≤ log2(1/PX(m)). Taking expectation yields E[L(X)] ≤ H(X).

87

LHS:

H(X) = H(X,L) = H(X|L) +H(L)

≤ E[L] + h

(1

1 + E[L]

)(1 + E[L]) (Lemma 8.1)

= E[L] + log2(1 + E[L]) + E[L] log

(1 +

1

E[L]

)

≤ E[L] + log2(1 + E[L]) + log2 e (x log(1 + 1/x) ≤ log e,∀x > 0)

≤ E[L] + log2(e(1 +H(X))) (by RHS)

where we have used H(X|L = k) ≤ k bits, since given l(f∗(X))) = k, X has at most 2k choices.

Note: (Memoryless source) If X = Sn is an i.i.d. sequence, then

nH(S) ≥ E[l(f∗(Sn))] ≥ nH(S)− log n+O(1).

For iid sources, the exact asymptotic behavior is found in [SV11, Theorem 4] as:

E[`(f∗(Sn))] = nH(S)− 1

2log n+O(1) ,

unless the source is uniform (in which case it is nH(S) +O(1).Theorem 8.3 relates the mean of l(f∗(X)) ≤ k to that of log2

1PX(X) (entropy). The next result

relates their CDFs.

Theorem 8.4 (Code length distribution of f∗). ∀τ > 0, k ≥ 0,

P[log2

1

PX(X)≤ k

]≤ P [l(f∗(X)) ≤ k] ≤ P

[log2

1

PX(X)≤ k + τ

]+ 2−τ+1

Proof. LHS (achievability): Use PX(m) ≤ 1/m. Then similarly as in Theorem 8.3, L(m) =blog2mc ≤ log2m ≤ log2

1PX(m) . Hence L(X) ≤ log2

1PX(X) a.s.

RHS (converse): By truncation,

P [L ≤ k] = P[L ≤ k, log2

1

PX(X)≤ k + τ

]+ P

[L ≤ k, log2

1

PX(X)> k + τ

]

≤ P[log2

1

PX(X)≤ k + τ

]+∑

x∈XPX(x)1l(f∗(x))≤k1PX(x)≤2−k−τ

≤ P[log2

1

PX(X)≤ k + τ

]+ (2k+1 − 1) · 2−k−τ

So far our discussion applies to an arbitrary random variable X. Next we consider the source asa random process (S1, S2, . . .) and introduce blocklength n. We apply our results to X = Sn, that is,by treating the first n symbols as a supersymbol. The following corollary states that the limitingbehavior of l(f∗(Sn)) and log 1

PSn (Sn) always coincide.

Corollary 8.1. Let (S1, S2, . . .) be some random process and U be some random variable. Then

1

nlog2

1

PSn(Sn)

D−→U ⇔ 1

nl(f∗(Sn))

D−→U (8.1)

and1√n

(log2

1

PSn(Sn)−H(Sn)

)D−→V ⇔ 1√

n(l(f∗(Sn))−H(Sn))

D−→V (8.2)

88

Proof. The proof is simply logic. First recall: convergence in distribution is equivalent to convergence

of CDF at all continuity point, i.e., UnD−→U ⇔ P [Un ≤ u]→ P [U ≤ u] for all u at which point the

CDF of U is continuous (i.e., not an atom of U).To get (8.1), apply Theorem 8.4 with k = un and τ =

√n:

P[

1

nlog2

1

PX(X)≤ u

]≤ P

[1

nl(f∗(X)) ≤ u

]≤ P

[1

nlog2

1

PX(X)≤ u+

1√n

]+ 2−

√n+1.

To get (8.2), apply Theorem 8.4 with k = H(Sn) +√nu and τ = n1/4:

P[

1√n

(log

1

PSn(Sn)−H(Sn)

)≤ u

]≤ P

[l(f∗(Sn))−H(Sn)√

n≤ u

]

≤ P[

1√n

(log

1

PSn(Sn)−H(Sn)

)≤ u+ n−1/4

]+ 2−n

1/4+1

Remark 8.2 (Memoryless source). Now let us consider Sn that are i.i.d. Then the importantobservation is that the log likihood becomes an i.i.d. sum:

log1

PSn(Sn)=

n∑

i=1

log1

PS(Si)︸︷︷︸i.i.d.

.

1. By the Law of Large Numbers (LLN), we know that 1n log 1

PSn (Sn)

P−→E log 1PS(S) = H(S).

Therefore in (8.1) the limiting distribution U is degenerate, i.e., U = H(S), and we have1n l(f

∗(Sn))P−→E log 1

PS(S) = H(S). [Note: convergence in distribution to a constant ⇔ conver-

gence in probability to a constant]

2. By the Central Limit Theorem (CLT), if V (S) , Var[

log 1PS(S)

]<∞,2 then we know that V

in (8.2) is Gaussian, i.e.,

1√nV (S)

(log

1

PSn(Sn)− nH(S)

)D−→N (0, 1).

Consequently, we have the following Gaussian approximation for the probability law of theoptimal code length

1√nV (S)

(l(f∗(Sn))− nH(S))D−→N (0, 1),

or, in shorthand,

l(f∗(Sn)) ∼ nH(S) +√nV (S)N (0, 1) in distribution.

Gaussian approximation tells us the speed of 1n l(f

∗(Sn)) to entropy and give us a goodapproximation at finite n. In the next section we apply our bounds to approximate thedistribution of l(f∗(Sn)) in a concrete example:

2V (S) is often known as the varentropy of S.

89

8.1.1 Compressing iid ternary source

Consider the source outputing n ternary letters each independent and distributed as

PX = [.445 .445 .11] .

For iid source it can be shown

E[`(f∗(Xn))] = nH(X)− 1

2log(2πeV n) +O(1) ,

where we denoted the varentropy of X by

V (X) , Var

[log

1

PX(X)

].

The Gaussian approximation to `(f∗(X)) is defined as

nH(X)− 1

2log 2πeV n+

√nV Z ,

where Z ∼ N (0, 1).On Fig. 8.1, 8.2, 8.3 we plot the distribution of the length of the optimal compressor for different

values of n and compare with the Gaussian approximation.Upper/lower bounds on the expectation:

H(Xn)− log(H(Xn) + 1)− log e ≤ E[`(f∗(Xn))] ≤ H(Xn)

Here are the numbers for different n

n = 20 21.5 ≤ 24.3 ≤ 27.8n = 100 130.4 ≤ 134.4 ≤ 139.0n = 500 684.1 ≤ 689.2 ≤ 695.0

In all cases above E[`(f∗(X))] is close to a midpoint between the two.

90

0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Optimal compression: CDF, n = 20, PX = [0.445 0.445 0.110]

Rate

P

True CDFLower boundUpper boundGaussian approximation

0.8 0.9 1 1.1 1.2 1.3 1.4 1.50

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Optimal compression: PMF, n = 20, PX = [0.445 0.445 0.110]

Rate

P

True PMFGaussian approximation

Figure 8.1: CDF and PMF of optimal compressor

91

1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.550

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Rate

P


1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.550

0.01

0.02

0.03

0.04

0.05

0.06


Rate

P


Figure 8.2: CDF and PMF, Gaussian is shifted to the true E[`(f∗(X))]

92

1.3 1.35 1.4 1.45 1.5 1.550

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Rate

P


1.3 1.35 1.4 1.45 1.5 1.550

0.005

0.01

0.015

0.02

0.025

0.03


Rate

P


Figure 8.3: CDF and PMF of optimal compressor

93

8.2 Uniquely decodable codes, prefix codes and Huffman codes

Huffman

code

all lossless codes

uniquely decodable codes

prefix codes

We have studied f∗, which achieves the stochastically smallest code length among all variable-length compressors. Note that f∗ is obtained by ordering the pmf and assigning shorter codewords tomore likely symbols. In this section we focus on a specific class of compressors with good algorithmicproperties which lead to low complexity and short delay when decoding from a stream of compressedbits. This part is more combinatorial in nature.

We start with a few definitions. Let A+ =⋃n≥1An denotes all non-empty finite-length strings

consisting of symbols from the alphabet A. Throughout this lecture A is a countable set.

Definition 8.1 (Extension of a code). The (symbol-by-symbol) extension of f : A → 0, 1∗ isf : A+ → 0, 1∗ where f(a1, . . . , an) = (f(a1), . . . , f(an)) is defined by concatenating the bits.

Definition 8.2 (Uniquely decodable codes). f : A → 0, 1∗ is uniquely decodable if its extensionf : A+ → 0, 1∗ is injective.

Definition 8.3 (Prefix codes). f : A → 0, 1∗ is a prefix code3 if no codeword is a prefix of another(e.g., 010 is a prefix of 0101).

Example: A = a, b, c.

• f(a) = 0, f(b) = 1, f(c) = 10 – not uniquely decodable, since f(ba) = f(c) = 10.

• f(a) = 0, f(b) = 10, f(c) = 11 – uniquely decodable and prefix.

• f(a) = 0, f(b) = 01, f(c) = 011, f(d) = 0111 – uniquely decodable but not prefix, since as longas 0 appears, we know that the previous codeword has terminated.

Remark 8.3.

1. Prefix codes are uniquely decodable.

3Also known as prefix-free/comma-free/self-punctuatingf/instantaneous code.

94

2. Similar to prefix-free codes, one can define suffix-free codes. Those are also uniquely decodable(one should start decoding in reverse direction).

3. By definition, any uniquely decodable code does not have the empty string as a codeword.Hence f : X → 0, 1+ in both Definition 8.2 and Definition 8.3.

4. Unique decodability means that one can decode from a stream of bits without ambiguity, butone might need to look ahead in order to decide the termination of a codeword. (Think of thelast example). In contrast, prefix codes allow the decoder to decode instantaneously withoutlooking ahead.

5. Prefix code ↔ binary tree (codewords are leaves) ↔ strategy to ask “yes/no” questions

Theorem 8.5 (Kraft-McMillan).

1. Let f : A → 0, 1∗ be uniquely decodable. Set la = l(f(a)). Then f satisfies the Kraftinequality ∑

a∈A2−la ≤ 1. (8.3)

2. Conversely, for any set of code length la : a ∈ A satisfying (8.3), there exists a prefix codef , such that la = l(f(a)). Moreover, such an f can be computed efficiently.

Note: The consequence of Theorem 8.5 is that as far as compression efficiency is concerned, we canforget about uniquely decodable codes that are not prefix codes.

Proof. We prove the Kraft inequality for prefix codes and uniquely decodable codes separately.The purpose for doing a separate proof for prefix codes is to illustrate the powerful technique ofprobabilistic method. The idea is from [AS08, Exercise 1.8, p. 12].

Let f be a prefix code. Let us construct a probability space such that the LHS of (8.3) is theprobability of some event, which cannot exceed one. To this end, consider the following scenario:Generate independent Bern(1

2) bits. Stop if a codeword has been written, otherwise continue. Thisprocess terminates with probability

∑a∈A 2−la . The summation makes sense because the events

that a given codeword is written are mutually exclusive, thanks to the prefix condition.Now let f be a uniquely decodable code. The proof uses generating function as a device for

counting. (The analogy in coding theory is the weight enumerator function.) First assume A isfinite. Then L = maxa∈A la is finite. Let Gf (z) =

∑a∈A z

la =∑L

l=0Al(f)zl, where Al(f) denotesthe number of codewords of length l in f . For k ≥ 1, define fk : Ak → 0, 1+ as the symbol-by-

symbol extension of f . Then Gfk(z) =∑

ak∈Ak zl(fk(ak)) =

∑a1· · ·∑ak

zla1+···+lak = [Gf (z)]k =∑kLl=0Al(f

k)zl. By the unique decodability of f , fk is lossless. Hence Al(fk) ≤ 2l. Therefore we have

Gf (1/2)k = Gfk(1/2) ≤ kL for all k. Then∑

a∈A 2−la = Gf (1/2) ≤ limk→∞(kL)1/k → 1. If A is

countably infinite, for any finite subset A′ ⊂ A, repeating the same argument gives∑

a∈A′ 2−la ≤ 1.

The proof is complete by the arbitrariness of A′.Conversely, given a set of code lengths la : a ∈ A s.t.

∑a∈A 2−la ≤ 1, construct a prefix code

f as follows: First relabel A to N and assume that 1 ≤ l1 ≤ l2 ≤ . . .. For each i, define

ai ,i−1∑

k=1

2−lk

95

with a1 = 0. Then ai < 1 by Kraft inequality. Thus we define the codeword f(i) ∈ 0, 1+ as thefirst li bits in the binary expansion of ai. Finally, we prove that f is a prefix code by contradiction:Suppose for some j > i, f(i) is the prefix of f(j), since lj ≥ li. Then aj − ai ≤ 2−li , since they agreeon the most significant li bits. But aj − ai = 2−li + 2−li+1 + . . . > 2−li , which is a contradiction.

Open problems:

1. Find a probabilistic proof of Kraft inequality for uniquely decodable codes.

2. There is a conjecture of Ahslwede that for any sets of lengths for which∑

2−la ≤ 34 there exists

a fix-free code (i.e. one which is simultaneously prefix-free and suffix-free). So far, existencehas only been shown when the Kraft sum is ≤ 5

8 , cf. [Yek04].

In view of Theorem 8.5, the optimal average code length among all prefix (or uniquely decodable)codes is given by the following optimization problem

L∗(X) , min∑

a∈APX(a)la (8.4)

s.t.∑

a∈A2−la ≤ 1

la ∈ N

This is an integer programming (IP) problem, which, in general, is computationally hard to solve.It is remarkable that this particular IP can be solved in near-linear time, thanks to the Huffmanalgorithm. Before describing the construction of Huffman codes, let us give bounds to L∗(X) interms of entropy:

Theorem 8.6.H(X) ≤ L∗(X) ≤ H(X) + 1 bit. (8.5)

Proof. “≤” Consider the following length assignment la =⌈log2

1PX(a)

⌉,4 which satisfies Kraft

since∑

a∈A 2−la ≤ ∑a∈A PX(a) = 1. By Theorem 8.5, there exists a prefix code f such that

l(f(a)) =⌈log2

1PX(a)

⌉and El(f(X)) ≤ H(X) + 1.

“≥” We give two proofs for the converse. One of the commonly used ideas to deal withcombinatorial optimization is relaxation. Our first idea is to drop the integer constraints in (8.4)and relax it into the following optimization problem, which obviously provides a lower bound

L∗(X) , min∑

a∈APX(a)la (8.6)

s.t.∑

a∈A2−la ≤ 1 (8.7)

This is a nice convex programming problem, since the objective function is affine and the feasible setis convex. Solving (8.6) by Lagrange multipliers (Exercise!) yields the minimum is equal to H(X)(achieved at la = log2

1PX(a)).

4Such a code is called a Shannon code.

96

Another proof is the following: For any f satisfying Kraft inequality, define a probability measureQ(a) = 2−la∑

a∈A 2−la . Then

El(f(X))−H(X) = D(P‖Q)− log∑

a∈A2−la

≥ 0

Next we describe the Huffman code, which achieves the optimum in (8.4). In view of the factthat prefix codes and binary trees are one-to-one, the main idea of Huffman code is to build thebinary tree bottom-up: Given a pmf PX(a) : a ∈ A,

1. Choose the two least-probable symbols in the alphabet

2. Delete the two symbols and add a new symbol (with combined probabilities). Add the newsymbol as the parent node of the previous two symbols in the binary tree.

The algorithm terminates in |A| − 1 steps. Given the binary tree, the code assignment can beobtained by assigning 0/1 to the branches. Therefore the time complexity is O(|A|) (sorted pmf) orO(|A| log |A|) (unsorted pmf).Example: A = a, b, c, d, e, PX = 0.25, 0.25, 0.2, 0.15, 0.15.

Huffman tree:

0.45

cb

0 10.55

0.3

ed

0 1a

0 1

0 1

Codebook:

f(a) = 00

f(b) = 10

f(c) = 11

f(d) = 010

f(e) = 011

Theorem 8.7 (Optimality of Huffman codes). The Huffman code achieves the minimal averagecode length (8.4) among all prefix (or uniquely decodable) codes.

Proof. [CT06, Sec. 5.8].

Remark 8.4 (Drawbacks of Huffman codes).

1. Does not exploit memory. Solution: block Huffman coding. Shannon’s original idea from1948 paper: in compressing English text, instead of dealing with letters and exploiting thenonequiprobability of the English alphabet, working with pairs of letters to achieve morecompression (more generally, n-grams). Indeed, compressing the block (S1, . . . , Sn) using itsHuffman code achieves H(S1, . . . , Sn) within one bit, but the complexity is |A|n!

2. Non-universal (constructing the Huffman code needs to know the source distribution). Thisbrings us the question: Is it possible to design universal compressor which achieves entropy fora class of source distributions? And what is the price to pay? See Homework and Lecture 11.

There are much more elegant solutions, e.g.,

1. Arithmetic coding: sequential encoding, linear complexity in compressing (S1, . . . , Sn) –Section 11.1.

97

2. Lempel-Ziv algorithm: low-complexity, universal, provably optimal in a very strong sense –Section 11.7.

To sum up: Comparison of average code length (in bits):

H(X)− log2[e(H(X) + 1)] ≤ E[l(f∗(X))] ≤ H(X) ≤ E[l(fHuffman(X))] ≤ H(X) + 1.

98

§ 9. Fixed-length (almost lossless) compression. Slepian-Wolf.

9.1 Fixed-length code, almost lossless, AEP

Coding paradigm:

Compressorf : X→0,1k

Decompressorg: 0,1k→X∪e

X X ∪ e0, 1k

Note: If we want g f = 1X , then k ≥ log2 |X |. But, the transmission link is erroneous anyway...and it turns out that by tolerating a little error probability ε, we gain a lot in terms of code length!

Indeed, the key idea is to allow errors: Instead of insisting on g(f(x)) = x for all x ∈ X ,consider only lossless decompression for a subset S ⊂ X :

g(f(x)) =

x x ∈ Se x 6∈ S

and the probability of error:

P [g(f(X)) 6= X] = P [g(f(X)) = e] = P [X ∈ S] .

Definition 9.1. A compressor-decompressor pair (f, g) is called a (k, ε)-code if:

f : X → 0, 1k

g : 0, 1k → X ∪ e

such that g(f(x)) ∈ x, e and P [g(f(X)) = e] ≤ ε.Fundamental limit:

ε∗(X, k) , infε : ∃(k, ε)-code for XThe following result connects the respective fundamental limits of fixed-length almost lossless

compression and variable-length lossless compression (Lecture 8):

Theorem 9.1 (Fundamental limit of error probabiliy).

ε∗(X, k) = P [l(f∗(X)) ≥ k] = 1− sum of 2k − 1 largest masses of X.

Proof. The proof is essentially tautological. Note 1 + 2 + · · · + 2k−1 = 2k − 1. Let S = 2k −1 most likely realizations of X. Then

ε∗(X, k) = P [X 6∈ S] = P [l(f∗(X)) ≥ k] .

Optimal codes:

99

• Variable-length: f∗ encodes the 2k−1 symbols with the highest probabilities to φ, 0, 1, 00, . . . , 1k−1.

• Fixed-length: The optimal compressor f maps the elements of S into (00 . . . 00), . . . , (11 . . . 10)and the rest to (11 . . . 11). The decompressor g decodes perfectly except for outputting e uponreceipt of (11 . . . 11).

Note: In Definition 9.1 we require that the errors are always detectable, i.e., g(f(x)) = x or e.Alternatively, we can drop this requirement and allow undetectable errors, in which case we can ofcourse do better since we have more freedom in designing codes. It turns out that we do not gainmuch by this relaxation. Indeed, if we define

ε∗(X, k) = infP [g(f(X)) 6= X] : f : X → 0, 1k, g : 0, 1k → X ∪ e,

then ε∗(X, k) = 1−sum of 2k largest masses of X. This follows immediately from P [g(f(X)) = X] =∑x∈S PX(x) where S , x : g(f(x)) = x satisfies |S| ≤ 2k, because f takes no more than 2k values.

Compared to Theorem 9.1, we see that ε∗(X, k) and ε∗(X, k) do not differ much. In particular,ε∗(X, k + 1) ≤ ε∗(X, k) ≤ ε∗(X, k).

Corollary 9.1 (Shannon). Let Sn be i.i.d. Then

limn→∞

ε∗(Sn, nR) =

0 R > H(S)1 R < H(S)

limn→∞

ε∗(Sn, nH(S) +√nV (S)γ) = 1− Φ(γ).

where Φ(·) is the CDF of N (0, 1), H(S) = E[log 1PS(S) ] is the entropy, V (S) = Var[log 1

PS(S) ] is thevarentropy which is assumed to be finite.

Proof. Combine Theorem 9.1 with Corollary 8.1.

Theorem 9.2 (Converse).

ε∗(X, k) ≥ ε∗(X, k) ≥ P[log2

1

PX(X)> k + τ

]− 2−τ , ∀τ > 0.

Proof. Identical to the converse of Theorem 8.4. Let S = x : g(f(x)) = x. Then |S| ≤ 2k and

P [X ∈ S] ≤ P[log2

1PX(X) ≤ k + τ

]+ P

[X ∈ S, log2

1

PX(X)> k + τ

]

︸︷︷︸≤2−τ

Two achievability bounds

Theorem 9.3.

ε∗(X, k) ≤ P[log2

1

PX(X)≥ k

]. (9.1)

100

Proof. Construction: use those 2k − 1 symbols with the highest probabilities.The analysis is essentially the same as the lower bound in Theorem 8.4 from Lecture 8. Note

that the mth largest mass PX(m) ≤ 1m . Therefore

ε∗(X, k) =∑

m≥2k

PX(m) =∑

1m≥2kPX(m) ≤∑

11

PX (m)≥2k

PX(m) = E1log2

1PX (X)

≥k.

Theorem 9.4.

ε∗(X, k) ≤ P[log2

1

PX(X)> k − τ

]+ 2−τ , ∀τ > 0. (9.2)

Note: In fact, Theorem 9.3 is always stronger than Theorem 9.4. Still, we present the proof ofTheorem 9.4 and the technology behind it – random coding – a powerful technique for provingexistence (achievability) which we heavily rely on in this course. To see that Theorem 9.3 givesa better bound, note that even the first term in (9.2) exceeds (9.1). Nevertheless, the method ofproof for this weaker bound will be useful for generalizations.

Proof. Construction: random coding (Shannon’s magic). For a given compressor f , the optimaldecompressor which minimizes the error probability is the maximum a posteriori (MAP) decoder,i.e.,

g∗(w) = argmaxx

PX|f(X)(x|w) = argmaxx:f(x)=w

PX(x),

which can be hard to analyze. Instead, let us consider the following (suboptimal) decompressor g:

g(w) =

x, ∃! x ∈ X s.t. f(x) = w and log21

PX(x) ≤ k − τ,(exists unique high-probability x that is mapped to w)

e, o.w.

Note that log21

PX(x) ≤ k − τ ⇐⇒ PX(x) ≥ 2−(k−τ). We call those x “high-probability”.

Denote f(x) = cx and the codebook C = cx : x ∈ X ⊂ 0, 1k. It is instructive to think of Cas a hashing table.

Error probability analysis: There are two ways to make an error ⇒ apply union bound. Beforeproceeding, define

J(x, C) ,x′ ∈ X : cx′ = cx, x

′ 6= x, log2

1

PX(x′)≤ k − τ

to be the set of high-probability inputs whose hashes collide with that of x. Then we have thefollowing estimate for probability of error:

P [g(f(X)) = e] = P[

log2

1

PX(X)> k − τ

∪ J(X, C) 6= ∅

]

≤ P[log2

1

PX(X)> k − τ

]+ P [J(X, C) 6= φ]

The first term does not depend on the codebook C, while the second term does. The idea nowis to randomize over C and show that when we average over all possible choices of codebook, thesecond term is smaller than 2−τ . Therefore there exists at least one codebook that achieves thedesired bound. Specifically, let us consider C which is uniformly distributed over all codebooks and

101

independently of X. Equivalently, since C can be represented by an |X | × k binary matrix, whoserows correspond to codewords, we choose each entry to be independent fair coin flips.

Averaging the error probability (over C and over X), we have

EC [P [J(X, C) 6= φ]] = EC,X[1∃x′ 6=X:log2

1PX (x′)≤k−τ,cx′=cX

]

≤ EC,X

∑

x′ 6=X1

log21

PX (x′)≤k−τ1cx′=cX

(union bound)

= 2−kEX

∑

x′ 6=X1PX(x′)≥2−k+τ

≤ 2−k∑

x′∈X1PX(x′)≥2−k+τ

≤ 2−k2k−τ = 2−τ .

Remark 9.1 (Why the proof works). Compressor f(x) = cx can be thought as hashing x ∈ X to arandom k-bit string cx ∈ 0, 1k.

Here: high-probability x ⇔ log21

PX(x) ≤ k − τ ⇔ PX(x) ≥ 2−k+τ . Therefore the cardinality of

high-probability x’s is at most 2k−τ 2k = number of strings. Hence the chance of collision issmall.

Remark 9.2. The random coding argument is a canonical example of probabilistic method : Toprove the existence of an object with certain property, we construct a probability distribution(randomize) and show that on average the property is satisfied. Hence there exists at least onerealization with the desired property. The downside of this argument is that it is not constructive,i.e., does not give us an algorithm to find the object.

Remark 9.3. This is a subtle point: Notice that in the proof we choose the random codebookto be uniform over all possible codebooks. In other words, C = cx : x ∈ X consists of iid k-bitstrings. In fact, in the proof we only need pairwise independence, i.e., cx ⊥⊥ cx′ for any x 6= x′

(Why?). Now, why should we care about this? In fact, having access to external randomness isalso a lot of resources. It is more desirable to use less randomness in the random coding argument.Indeed, if we use zero randomness, then it is a deterministic construction, which is the best situation!Using pairwise independent codebook requires significantly less randomness than complete randomcoding which needs |X |k bits. To see this intuitively, note that one can use 2 independent randombits to generate 3 random bits that is pairwise independent but not mutually independent, e.g.,b1, b2, b1 ⊕ b2. This observation is related to linear compression studied in the next section, wherethe codeword we generated are not iid, but related through a linear mapping.

102

Remark 9.4 (AEP for memoryless sources). Consider iid Sn. By WLLN,

1

nlog

1

PSn(Sn)

P−→H(S). (9.3)

For any δ > 0, define the set

T δn =

sn :

∣∣∣∣1

nlog

1

PSn(sn)−H(S)

∣∣∣∣ ≤ δ.

For example: Sni.i.d.∼ Bern(p), since PSn(sn) = pw(sn)qn−w(sn), the typical set corresponds to those

sequences whose Hamming is close to the expectation: T δn = sn ∈ 0, 1n : w(sn) ∈ [p± δ′]n,where δ′ is a constant depending on δ.

As a consequence of (9.3),

1. P[Sn ∈ T δn

]→ 1 as n→∞.

2. |T δn | ≤ 2(H(S)+δ)n |S|n.

In other words, Sn is concentrated on the set T δn which is exponentially smaller than the wholespace. In almost loss compression we can simply encode this set losslessly. Although this is differentthan the optimal encoding, Corollary 9.1 indicates that in the large-n limit the optimal compressoris no better.

The property (9.3) is often referred as the Asymptotic Equipartition Property (AEP), in thesense that the random vector is concentrated on a set wherein each reliazation is roughly equallylikely up to the exponent. Indeed, Note that for any sn ∈ T δn , its likelihood is concentrated aroundPSn(sn) ∈ 2−(H(S)±δ)n, called δ-typical sequences.

103

Next we still consider fixed-blocklength code and study the fundamental limit of error probabilityε∗(X, k) for the following coding paradigms:

• Linear Compression

• Compression with Side Information

– side info available at both sides

– side info available only at decompressor

– multi-terminal compressor, single decompressor

9.2 Linear Compression

From Shannon’s theorem:

ε∗(X,nR) −→ 0 or 1 R ≶ H(S)

Our goal is to find compressor with structures. The simplest one can think of is probably linearoperation, which is also highly desired for its simplicity (low complexity). But of course, we have tobe on a vector space where we can define linear operations. In this part, we assume X = Sn, whereeach coordinate takes values in a finite field (Galois Field), i.e., Si ∈ Fq, where q is the cardinalityof Fq. This is only possible if q = pn for some prime p and n ∈ N. So Fq = Fpn .

Definition 9.2 (Galois Field). F is a finite set with operations (+, ·) where

• a+ b associative and commutative

• a · b associative and commutative

• 0, 1 ∈ F s.t. 0 + a = 1 · a = a.

• ∀a,∃ − a, s.t. a+ (−a) = 0

• ∀a 6= 0,∃a−1, s.t. a−1a = 1

• distributive: a · (b+ c) = (a · b) + (a · c)

Example:

• Fp = Z/pZ, where p is prime

• F4 = 0, 1, x, x+ 1 with addition and multiplication as polynomials mod (x2 + x+ 1) overF2[x].

Linear Compression Problem: x ∈ Fnq , w = Hx where H : Fnq → Fkq is linear represented by a

matrix H ∈ Fk×nq .

w1...wk

=

h11 . . . h1n

......

hk1 . . . hkn

x1...xn

Compression is achieved if k ≤ n, i.e., H is a fat matrix. Of course, we have to tolerate some error(almost lossless). Otherwise, lossless compression is only possible with k ≥ n, which not interesting.

104

Theorem 9.5 (Achievability). Let X ∈ Fnq be a random vector. ∀τ > 0, ∃ linear compressor

H : Fnq → Fkq and decompressor g : Fkq → Fnq ∪ e, s.t.

P [g(HX) 6= X] ≤ P[logq

1

PX(X)> k − τ

]+ q−τ

Remark 9.5. Consider the Hamming space q = 2. In comparison with Shannon’s random codingachievability, which uses k2n bits to construct a completely random codebook, here for linear codeswe need kn bits to randomly generate the matrix H, and the codebook is a k-dimensional linearsubspace of 6the Hamming space.

Proof. Fix τ . As pointed in the proof of Shannon’s random coding theorem (Theorem 9.4), giventhe compressor H, the optimal decompressor is the MAP decoder, i.e., g(w) = argmaxx:Hx=w PX(x),which outputs the most likely symbol that is compatible with the codeword received. Instead, let usconsider the following (suboptimal) decoder for its ease of analysis:

g(w) =

x ∃!x ∈ Fnq : w = Hx, x− h.p.e otherwise

where we used the short-hand:

x− h.p. (high probability) ⇔ logq1

PX(x)< k − τ ⇔ PX(x) ≥ q−k+τ .

Note that this decoder is the same as in the proof of Theorem 9.4. The proof is also mostly thesame, except now hash collisions occur under the linear map H. By union bound,

P [g(f(X)) = e] ≤ P[logq

1

PX(x)> k − τ

]+ P

[∃x′ − h.p. : x′ 6= X,Hx′ = HX

]

(union bound) ≤ P[logq

1

PX(x)> k − τ

]+∑

x

PX(x)∑

x′−h.p.,x′ 6=x1Hx′ = Hx

Now we use random coding to average the second term over all possible choices of H. Specifically,choose H as a matrix independent of X where each entry is iid and uniform on Fq. For distinct x0

and x1, the collision probability is

PH [Hx1 = Hx0] = PH [Hx2 = 0] (x2 , x1 − x0 6= 0)

= PH [H1 · x2 = 0]k (iid rows)

where H1 is the first row of the matrix H, and each row of H is independent. This is the probabilitythat Hi is in the orthogonal complement of x2. On Fnq , the orthogonal complement of a givennon-zero vector has cardinality qn−1. So the probability for the first row to lie in this subspace isqn−1/qn = 1/q, hence the collision probability 1/qk. Averaging over H gives

EH∑

x′−h.p.,x′ 6=x1Hx′ = Hx =

∑

x′−h.p.,x′ 6=xPH [Hx′ = Hx] = |x′ : x′ − h.p., x′ 6= x|q−k ≤ qk−τq−k = q−τ

Thus the bound holds.

Remark 9.6. 1. Compared to Theorem 9.4, which is obtained by randomizing over all possiblecompressors, Theorem 9.5 is obtained by randomizing over only linear compressors, and thebound we obtained is identical. Therefore restricting on linear compression almost does notlose anything.

105

2. Note that in this case it is not possible to make all errors detectable.

3. Can we loosen the requirement on Fq to instead be a commutative ring? In general, no, sincezero divisors in the commutative ring ruin the key proof ingredient of low collision probabilityin the random hashing. E.g. in Z/6Z

P

H

10...0

= 0

= 6−k but P

H

20...0

= 0

= 3−k,

since 0 · 2 = 3 · 2 = 0 in Z/6Z.

9.3 Compression with side information at both compressor anddecompressor

Compressor DecompressorX X ∪ e0, 1k

Y

Definition 9.3 (Compression wih Side Information). Given PXY ,

• f : X × Y → 0, 1k

• g : 0, 1k × Y → X ∪ e

• P[g(f(X,Y ), Y ) 6= X] < ε

• Fundamental Limit: ε∗(X|Y, k) = infε : ∃(k, ε)− S.I. code

Note: The side information Y need not be discrete. The source X is, of course, discrete.Note that conditioned on Y = y, the problem reduces to compression without side information

where the source X is distributed according to PX|Y=y. Since Y is known to both the compressorand decompressor, they can use the best code tailored for this distribution. Recall ε∗(X, k) definedin Definition 9.1, the optimal probability of error for compressing X using k bits, which can also bedenoted by ε∗(PX , k). Then we have the following relationship

ε∗(X|Y, k) = Ey∼PY [ε∗(PX|Y=y, k)],

which allows us to apply various bounds developed before.

Theorem 9.6.

P[log

1

PX|Y (X|Y )> k + τ

]− 2−τ ≤ ε∗(X|Y, k) ≤ P

[log2

1

PX|Y (X|Y )> k − τ

]+ 2−τ , ∀τ > 0

106

Corollary 9.2. (X,Y ) = (Sn, Tn) where the pairs (Si, Ti)i.i.d.∼ PST

limn→∞

ε∗(Sn|Tn, nR) =

0 R > H(S|T )

1 R < H(S|T )

Proof. Using the converse Theorem 9.2 and achievability Theorem 9.4 (or Theorem 9.3) for com-pression without side information, we have

P[log

1

PX|Y (X|y)> k + τ

∣∣∣Y = y

]− 2−τ ≤ ε∗(PX|Y=y, k) ≤ P

[log

1

PX|Y (X|y)> k

∣∣∣Y = y

]

By taking the average over all y ∼ PY , we get the theorem. For the corollary

1

nlog

1

PSn|Tn(Sn|Tn)=

1

n

n∑

i=1

log1

PS|T (Si|Ti)P−→H(S|T )

as n→∞, using the WLLN.

9.4 Slepian-Wolf (Compression with side information atdecompressor only)

Consider the compression with side information problem, except now the compressor has no accessto the side information.

Compressor DecompressorX X ∪ e0, 1k

Y

Definition 9.4 (S.W. code). Given PXY ,

• f : X → 0, 1k

• g : 0, 1k × Y → X ∪ e

• P[g(f(X), Y ) 6= X] ≤ ε

• Fundamental Limit: ε∗SW(X|Y, k) = infε : ∃(k, ε)-S.W. code

Now the very surprising result: Even without side information at the compressor, we can stillcompress down to the conditional entropy!

Theorem 9.7 (Slepian-Wolf, ’73).

ε∗(X|Y, k) ≤ ε∗SW(X|Y, k) ≤ P[log

1

PX|Y (X|Y )≥ k − τ

]+ 2−τ

107

Corollary 9.3.

limn→∞

ε∗SW(Sn|Tn, nR) =

0 R > H(S|T )

1 R < H(S|T )

Remark 9.7. Definition 9.4 does not include the zero-undected-error condition (that is g(f(x), y) =x or e). In other words, we allow for the possibility of undetected errors. Indeed, if we require thiscondition, the side-information savings will be mostly gone. Indeed, assuming PX,Y (x, y) > 0 forall (x, y) it is clear that under zero-undetected-error condition, if f(x1) = f(x2) = c then g(c) = e.Thus except for c all other elements in 0, 1k must have unique preimages. Similarly, one canshow that Slepian-Wolf theorem does not hold in the setting of variable-length lossless compression(i.e. average length is H(X) not H(X|Y ).)

Proof. LHS is obvious, since side information at the compressor and decompressor is better thanonly at the decompressor.

For the RHS, first generate a random codebook with iid uniform codewords: C = cx ∈ 0, 1k :x ∈ X independently of (X,Y ), then define the compressor and decoder as

f(x) = cx

g(w, y) =

x ∃!x : cx = w, x− h.p.|y0 o.w.

where we used the shorthand x − h.p.|y ⇔ log21

PX|Y (x|y) < k − τ . The error probability of this

scheme, as a function of the code book C, is

E(C) = P[log

1

PX|Y (X|Y )≥ k − τ or J(X,C|Y ) 6= ∅

]

≤ P[log

1

PX|Y (X|Y )≥ k − τ

]+ P [J(X,C|Y ) 6= ∅]

= P[log

1

PX|Y (X|Y )≥ k − τ

]+∑

x,y

PXY (x, y)1J(x,C|y)6=∅.

where J(x,C|y) , x′ 6= x : x′ − h.p.|y, cx = cx′.Now averaging over C and applying the union bound: use |x′ : x′ − h.p.|y| ≤ 2k−τ and

P[cx′ = cx] = 2−k for any x 6= x′,

PC [J(x,C|y) 6= ∅] ≤ EC

∑

x′ 6=x1x′−h.p.|y1cx′=cx

= 2k−τP[cx′ = cx]

= 2−τ

Hence the theorem follows as usual from two terms in the union bound.

9.5 Multi-terminal Slepian Wolf

Distributed compression: Two sources are correlated. Compress individually, decompress jointly.What are those rate pairs that guarantee successful reconstruction?

108

Dec

om

pre

ssorg

Compressor f1

Compressor f2

0, 1k1

0, 1k2

X

Y

(X, Y )

Definition 9.5. Given PXY ,

• (f1, f2, g) is (k1, k2, ε)-code if f1 : X → 0, 1k1 , f2 : Y → 0, 1k2 , g : 0, 1k1 × 0, 1k2 →X ×Y, s.t. P[(X, Y ) 6= (X,Y )] ≤ ε, where (X, Y ) = g(f1(X), f2(Y )).

• Fundamental limit: ε∗SW(X,Y, k1, k2) = infε : ∃(k1, k2, ε)-code.

Theorem 9.8. (X,Y ) = (Sn, Tn) - iid pairs

limn→∞

ε∗SW(Sn, Tn, nR1, nR2) =

0 (R1, R2) ∈ int(RSW)

1 (R1, R2) 6∈ RSW

where RSW denotes the Slepian-Wolf rate region

RSW =

a ≥ H(S|T )

(a, b) : b ≥ H(T |S)

a+ b ≥ H(S, T )

Note: The rate region RSW typically looks like:

R1

R2

H(S)H(S|T )

H(T |S)

H(T )Achievable

Region

Since H(T )−H(T |S) = H(S)−H(S|T ) = I(S;T ), the slope is −1.

Proof. Converse: Take (R1, R2) 6∈ RSW. Then one of three cases must occur:

1. R1 < H(S|T ). Then even if encoder and decoder had full Tn, still can’t achieve this (fromcompression with side info result – Corollary 9.2).

2. R2 < H(T |S) (same).

3. R1 +R2 < H(S, T ). Can’t compress below the joint entropy of the pair (S, T ).

109

Achievability: First note that we can achieve the two corner points. The point (H(S), H(T |S))can be approached by almost lossless compressing S at entropy and compressing T with side infor-mation S at the decoder. To make this rigorous, let k1 = n(H(S) + δ) and k2 = n(H(T |S) + δ). ByCorollary 9.1, there exist f1 : Sn → 0, 1k1 and g1 : 0, 1k1 → Sn s.t. P [g1(f1(Sn)) 6= Sn] ≤εn → 0. By Theorem 9.7, there exist f2 : T n → 0, 1k2 and g2 : 0, 1k1 × Sn → T ns.t. P [g2(f2(Tn), Sn) 6= Tn] ≤ εn → 0. Now that Sn is not available, feed the S.W. decompressorwith g(f(Sn)) and define the joint decompressor by g(w1, w2) = (g1(w1), g2(w2, g1(w1))) (see below):

f1 g1

f2 g2

Sn Sn

Tn Tn

Apply union bound:

P [g(f1(Sn), f2(Tn)) 6= (Sn, Tn)]

= P [g1(f1(Sn)) 6= Sn] + P [g2(f2(Tn), g(f1(Sn))) 6= Tn, g1(f1(Sn)) = Sn]

≤ P [g1(f1(Sn)) 6= Sn] + P [g2(f2(Tn), Sn) 6= Tn]

≤ 2εn → 0.

Similarly, the point (H(S), H(T |S)) can be approached.To achieve other points in the region, use the idea of time sharing: If you can achieve with

vanishing error probability any two points (R1, R2) and (R′1, R′2), then you can achieve for λ ∈ [0, 1],

(λR1 + λR′1, λR2 + λR′2) by dividing the block of length n into two blocks of length λn and λn andapply the two codes respectively

(Sλn1 , T λn1 )→[λnR1

λnR2

]using (R1, R2) code

(Snλn+1, Tnλn+1)→

[λnR′1λnR′2

]using (R′1, R

′2) code

(Exercise: Write down the details rigorously yourself!) Therefore, all convex combinations of pointsin the achievable regions are also achievable, so the achievable region must be convex.

9.6* Source-coding with a helper (Ahlswede-Korner-Wyner)

Yet another variation of distributed compression problem is compressing X with a helper, seefigure below. Note that the main difference from the previous section is that decompressor is onlyrequired to produce the estimate of X, using rate-limited help from an observer who has access toY . Characterization of rate pairs R1, R2 is harder than in the previous section.

Theorem 9.9 (Ahlswede-Korner-Wyner). Consider i.i.d. source (Xn, Y n) ∼ PX,Y with X discrete.

If rate pair (R1, R2) is achievable with vanishing probability of error P[Xn 6= Xn]→ 0, then thereexists an auxiliary random variable U taking values on alphabet of cardinality |Y| + 1 such thatPX,Y,U = PX,Y PU |X,Y and

R1 ≥ H(X|U), R2 ≥ I(Y ;U) . (9.4)

110

Dec

om

pre

ssorg

Compressor f1

Compressor f2

0, 1k1

0, 1k2

X

Y

X

Furthermore, for every such random variable U the rate pair (H(X|U), I(Y ;U)) is achievable withvanishing error.

Proof. We only sketch some crucial details.First, note that iterating over all possible random variables U (without cardinality constraint)

the set of pairs (R1, R2) satisfying (9.4) is convex. Next, consider a compressor W1 = f1(Xn) andW2 = f2(Y n). Then from Fano’s inequality (5.7) assuming P[Xn 6= Xn] = o(1) we have

H(Xn|W1,W2)) = o(n) .

Thus, from chain rule and conditioning-decreases-entropy, we get

nR1 ≥ I(Xn;W1|W2) ≥ H(Xn|W2)− o(n) (9.5)

=n∑

k=1

H(Xk|W2, Xk−1)− o(n) (9.6)

≥n∑

k=1

H(Xk|W2, Xk−1, Y k−1

︸︷︷︸,Uk

)− o(n) (9.7)

On the other hand, from (5.2) we have

nR2 ≥ I(W2;Y n) =

n∑

k=1

I(W2;Yk|Y k−1) (9.8)

=

n∑

k=1

I(W2, Xk−1;Yk|Y k−1) (9.9)

=

n∑

k=1

I(W2, Xk−1, Y k−1;Yk) (9.10)

where (9.9) follows from I(W2, Xk−1;Yk|Y k−1) = I(W2;Yk|Y k−1) + I(Xk−1;Yk|W2, Y

k−1) and thefact that (W2, Yk) ⊥⊥ Xk−1|Y k−1; and (9.10) from Y k−1 ⊥⊥ Yk. Comparing (9.7) and (9.10) wenotice that denoting Uk = (W2, X

k−1, Y k−1) we have

(R1, R2) ≥ 1

n

n∑

k=1

(H(Xk|Uk), I(Uk;Yk))

and thus (from convexity) the rate pair must belong to the region spanned by all pairs (H(X|U), I(U ;Y )).

111

To show that without loss of generality the auxiliary random variable U can be taken to be|Y|+ 1 valued, one needs to invoke Caratheodory’s theorem on convex hulls. We omit the details.

Finally, showing that for each U the mentioned rate-pair is achievable, we first notice that if therewere side information at the decompressor in the form of the i.i.d. sequence Un correlated to Xn,then Slepian-Wolf theorem implies that only rate R1 = H(X|U) would be sufficient to reconstructXn. Thus, the question boils down to creating a correlated sequence Un at the decompressor byusing the minimal rate R2. This is the content of the so called covering lemma, see Theorem 26.5below: It is sufficient to use rate I(U ;Y ) to do so. We omit further details.

112

§ 10. Compressing stationary ergodic sources

We have studyig the compression of i.i.d. sequence Si, for which

1

nl(f∗(Sn))

P−→H(S) (10.1)

limn→∞

ε∗(Sn, nR) =

0 R > H(S)1 R < H(S)

(10.2)

In this lecture, we shall examine similar results for ergodic processes and we first state the maintheory as follows:

Theorem 10.1 (Shannon-McMillan). Let S1, S2, . . . be a stationary and ergodic discrete process,then

1

nlog

1

PSn(Sn)

P−→H, also a.s. and in L1 (10.3)

where H = limn→∞1nH(Sn) is the entropy rate.

Corollary 10.1. For any stationary and ergodic discrete process S1, S2, . . . , (10.1) – (10.2) holdwith H(S) replaced by H.

Proof. Shannon-McMillan (we only need convergence in probability) + Theorem 8.4 + Theorem 9.1which tie together the respective CDF of the random variable l(f∗(Sn)) and log 1

PSn (sn) .

In Lecture 9 we learned the asymptotic equipartition property (AEP) for iid sources. Here wegeneralize it to stationary ergodic sources thanks to Shannon-McMillan.

Corollary 10.2 (AEP for stationary ergodic sources). Let S1, S2, . . . be a stationary and ergodicdiscrete process. For any δ > 0, define the set

T δn =

sn :

∣∣∣∣1

nlog

1

PSn(sn)−H

∣∣∣∣ ≤ δ.

Then

1. P[Sn ∈ T δn

]→ 1 as n→∞.

2. 2n(H−δ)(1 + o(1)) ≤ |T δn | ≤ 2(H+δ)n(1 + o(1)).

Note:

• Convergence in probability for stationary ergodic Markov chains [Shannon 1948]

• Convergence in L1 for stationary ergodic processes [McMillan 1953]

113

• Convergence almost surely for stationary ergodic processes [Breiman 1956] (Either of the lasttwo results implies the convergence Theorem 10.1 in probability.)

• For a Markov chain, existence of typical sequences can be understood by thinking of Markovprocess as sequence of independent decisions regarding which transitions to take. It is thenclear that Markov process’s trajectory is simply a transformation of trajectories of an i.i.d.process, hence must similarly concentrate similarly on some typical set.

10.1 Bits of ergodic theory

Let’s start with a dynamic system view and introduce a few definitions:

Definition 10.1 (Measure preserving transformation). τ : Ω → Ω is measure preserving (moreprecisely, probability preserving) if

∀E ∈ F , P (E) = P (τ−1E).

The set E is called τ -invariant if E = τ−1E. The set of all τ -invariant sets forms a σ-algrebra(check!) denoted Finv.Definition 10.2 (stationary process). A process Sn, n = 0, . . . is stationary if there exists ameasure preserving transformation τ : Ω→ Ω such that:

Sj = Sj−1 τ = S0 τ j

Therefore a stationary process can be described by the tuple (Ω,F ,P, τ, S0) and Sk = S0 τk.Notes:

1. Alternatively, a random process (S0, S1, S2, . . . ) is stationary if its joint distribution is invariantwith respect to shifts in time, i.e., PSmn = PSm+t

n+t, ∀n,m, t. Indeed, given such a process we

can define a m.p.t. as follows:

(s0, s1, . . . )τ−→ (s1, s2, . . . ) (10.4)

So τ is a shift to the right.

2. An event E ∈ F is shift-invariant if

(s1, s2, . . . ) ∈ E ⇒ (s0, s1, s2, . . . ) ∈ E, s0

or equivalently E = τ−1E (check!). Thus τ -invariant events are also called shift-invariant,when τ is interpreted as (10.4).

3. Some examples of shift-invariant events are ∃n : xi = 0∀i ≥ n, lim supxi < 1 etc. A nonshift-invariant event is A = x0 = x1 = · · · = 0, since τ(1, 0, 0, . . .) ∈ A but (1, 0, . . .) 6∈ A.

4. Also recall that the tail σ-algebra is defined as

Ftail ,⋂

n≥1

σSn, Sn+1, . . . .

It is easy to check that all shift-invariant events belong to Ftail. The inclusion is strict, as forexample the event

∃n : xi = 0,∀ odd i ≥ nis in Ftail but not shift-invariant.

114

Proposition 10.1 (Poincare recurrence). Let τ be measure-preserving for (Ω,F ,P). Then for anymeasurable A with P[A] > 0 we have

P[⋃

k≥1

τ−kA|A] = P[τk(ω) ∈ A occurs infinitely often|A] = 1 .

Proof. Let B =⋃k≥1 τ

−kA. It is sufficient to show that P[A ∩B] = P[A] or equivalently

P[A ∪B] = P[B] . (10.5)

To that end notice that τ−1A ∪ τ−1B = B and thus

P[τ−1(A ∪B)] = P[B] ,

but the left-hand side equals P[A ∪B] by the measure-preservation of τ , proving (10.5).

Note: Consider τ mapping initial state of the conservative (Hamiltonian) mechanical system to itsstate after passage of a given unit of time. It is known that τ preserves Lebesgue measure in phasespace (Liouville’s theorem). Thus Poincare recurrence leads to rather counter-intuitive conclusions.For example, opening the barrier separating two gases in a cylinder allows them to mix. Poincarerecurrence says that eventually they will return back to the original separated state (with each gasoccupying roughly its half of the cylinder).

Definition 10.3 (Ergodicity). A transformation τ is ergodic if ∀E ∈ Finv we have P[E] = 0 or 1.A process Si is ergodic if all shift invariant events are deterministic, i.e., for any shift invariantevent E, P [S∞1 ∈ E] = 0 or 1.

Example:

• Sk = k2: ergodic but not stationary

• Sk = S0: stationary but not ergodic (unless S0 is a constant). Note that the singleton setE = (s, s, . . .) is shift invariant and P [S∞1 ∈ E] = P [S0 = s] ∈ (0, 1) – not deterministic.

• Sk i.i.d. is stationary and ergodic (by Kolmogorov’s 0-1 law, tail events have no randomness)

• (Sliding-window construction of ergodic processes)If Si is ergodic, then Xi = f(Si, Si+1, . . . ) is also ergodic. It is called a B-process if Siis i.i.d.Example, Si ∼ Bern(1

2) i.i.d., Xk =∑∞

n=0 2−n−1Sk+n = 2Xk−1 mod 1. The marginaldistribution of Xi is uniform on [0, 1]. Note that Xk’s behavior is completely deterministic:given X0, all the future Xk’s are determined exactly. This example shows that certaindeterministic maps exhibit ergodic/chaotic behavior under iterative application: althoughthe trajectory is completely deterministic, its time-averages converge to expectations and ingeneral “look random”.

• There are also stronger conditions than ergodicity. Namely, we say that τ is mixing (or strongmixing) if

P[A ∩ τ−nB]→ P[A]P[B] .

We say that τ is weakly mixing if

n∑

k=1

1

n

∣∣P[A ∩ τ−nB]− P[A]P[B]∣∣→ 0 .

Strong mixing implies weak mixing, which implies ergodicity (check!).

115

• Si: finite irreducible Markov chain with recurrent states is ergodic (in fact strong mixing),regardless of initial distribution.Toy example: kernel P (0|1) = P (1|0) = 1 with initial dist. P (S0 = 0) = 0.5. This process onlyhas two sample paths: P [S∞1 = (010101 . . .)] = P [S∞1 = (101010 . . .)] = 1

2 . It is easy to verifythis process is ergodic (in the sense defined above!). Note however, that in Markov-chainliterature a chain is called ergodic if it is irreducible, aperiodic and recurrent. This exampledoes not satisfy this definition (this clash of terminology is a frequent source of confusion).

• (optional) Si: stationary zero-mean Gaussian process with autocovariance function R(n) =E[S0S

∗n].

limn→∞

1

n+ 1

n∑

t=0

R[t] = 0⇔ Si ergodic⇔ Si weakly mixing

limn→∞

R[n] = 0⇔ Si mixing

Intuitively speaking, an ergodic process can have infinite memory in general, but the memoryis weak. Indeed, we see that for a stationary Gaussian process ergodicity means the correlationdies (in the Cesaro-mean sense).

The spectral measure is defined as the (discrete time) Fourier transform of the autocovariancesequence R(n), in the sense that there exists a unique probability measure µ on [−1

2 ,12 ] such

that R(n) = E exp(i2nπX) where X ∼ µ. The spectral criteria can be formulated as follows:

Si ergodic⇔ spectral measure has no atoms (CDF is continuous)

Si B-process⇔ spectral measure has density

Detailed exposition on stationary Gaussian processes can be found in [Doo53, Theorem 9.3.2,pp. 474, Theorem 9.7.1, pp. 493–494].1

10.2 Proof of Shannon-McMillan

We shall show the convergence in L1, which implies convergence in probability automatically. Inorder to prove Shannon-McMillan, let’s first introduce the Birkhoff-Khintchine’s convergence theoremfor ergodic processes, the proof of which is presented in the next subsection.

Theorem 10.2 (Birkhoff-Khintchine’s Ergodic Theorem). If Si stationary and ergodic, ∀ functionf ∈ L1, i.e., E |f(S1, . . . )| <∞,

limn→∞

1

n

n∑

k=1

f(Sk, . . . ) = E f(S1, . . . ). a.s. and in L1

In the special case where f depends on finitely many coordinates, say, f = f(S1, . . . , Sm), we have

limn→∞

1

n

n∑

k=1

f(Sk, . . . , Sk+m−1) = E f(S1, . . . , Sm). a.s. and in L1

1Thanks Prof. Bruce Hajek for the pointer.

116

Interpretation: time average converges to ensemble average.Example: Consider f = f(S1)

• Si is iid. Then Theorem 10.2 is SLLN (strong LLN).

• Si is such that Si = S1 for all i – non-ergodic. Then Theorem 10.2 fails unless S1 is aconstant.

Definition 10.4. Si : i ∈ N is an mth order Markov chain if PSt+1|St1 = PSt+1|Stt−m+1for all t ≥ m.

It is called time homogeneous if PSt+1|Stt−m+1= PSm+1|Sm1 .

Remark 10.1. Showing (10.3) for an mth order time homogeneous Markov chain Si is a directapplication of Birkhoff-Khintchine.

1

nlog

1

PSn(Sn)=

1

n

n∑

t=1

log1

PSt|St−1(St|St−1)

=1

nlog

1

PSm(Sm)+

1

n

n∑

t=m+1

log1

PSt|St−1t−m

(Sl|Sl−1l−m)

=1

nlog

1

PS1(Sm1 )︸︷︷︸→0

+1

n

n∑

t=m+1

log1

PSm+1|Sm1 (St|St−1t−m)

︸︷︷︸→H(Sm+1|Sm1 ) by Birkhoff-Khintchine

, (10.6)

where we applied Theorem 10.2 with f(s1, s2, . . .) = log 1PSm+1|Sm1

(sm+1|sm1 ) .

Now let’s prove (10.3) for a general stationary ergodic process Si which might have infinitememory. The idea is to approximate the distribution of that ergodic process by an m-th orderMC (finite memory) and make use of (10.6); then let m→∞ to make the approximation accurate(Markov approximation).

Proof of Theorem 10.1 in L1. To show that (10.3) converges in L1, we want to show that

E∣∣∣ 1n

log1

PSn(Sn)−H

∣∣∣→ 0, n→∞.

To this end, fix an m ∈ N. Define the following auxiliary distribution for the process:

Q(m)(S∞1 ) = PSm1 (Sm1 )

∞∏

t=m+1

PSt|St−1t−m

(St|St−1t−m)

stat.= PSm1 (Sm1 )

∞∏

t=m+1

PSm+1|Sm1 (St|St−1t−m)

Note that under Q(m), Si is an mth-order time-homogeneous Markov chain.By triangle inequality,

E∣∣∣ 1n

log1

PSn(Sn)−H

∣∣∣ ≤E∣∣∣ 1n

log1

PSn(Sn)− 1

nlog

1

Q(m)Sn (Sn)

∣∣∣︸︷︷︸

,A

+ E∣∣∣ 1n

log1

Q(m)Sn (Sn)

−Hm

∣∣∣︸︷︷︸

,B

+ |Hm −H|︸︷︷︸,C

117

where Hm , H(Sm+1|Sm1 ).Now

• C = |Hm − H| → 0 as m → ∞ by Theorem 5.4 (Recall that for stationary processes:H(Sm+1|Sm1 )→ H from above).

• As shown in Remark 10.1, for any fixed m, B → 0 in L1 as n → ∞, as a consequence ofBirkhoff-Khintchine. Hence for any fixed m, EB → 0 as n→∞.

• For term A,

E[A] =1

nEP∣∣∣ log

dPSn

dQ(m)Sn

∣∣∣ ≤ 1

nD(PSn‖Q(m)

Sn ) +2 log e

en

where

1

nD(PSn‖Q(m)

Sn ) =1

nE

[log

PSn(Sn)

PSm(Sm)∏nt=m+1 PSm+1|S1

m(St|St−1t−m)

]

stat.=

1

n(−H(Sn) +H(Sm) + (n−m)Hm)

→ Hm −H as n→∞

and the next Lemma 10.1.

Combining all three terms and sending n→∞, we obtain for any m,

lim supn→∞

E∣∣∣ 1n

log1

PSn(Sn)−H

∣∣∣ ≤ 2(Hm −H).

Sending m→∞ completes the proof of L1-convergence.

Lemma 10.1.

EP[∣∣∣∣log

dP

dQ

∣∣∣∣]≤ D(P‖Q) +

2 log e

e.

Proof. |x log x| − x log x ≤ 2 log ee , ∀x > 0, since LHS is zero if x ≥ 1, and otherwise upper bounded

by 2 sup0≤x≤1 x log 1x = 2 log e

e .

10.3* Proof of Birkhoff-Khintchine

Proof of Theorem 10.2. ∀ function f ∈ L1, ∀ε, there exists a decomposition f = f + h such that fis bounded, and h ∈ L1, ‖h‖1 ≤ ε.Let us first focus on the bounded function f . Note that in the bounded domain L1 ⊂ L2, thusf ∈ L2. Furthermore, L2 is a Hilbert space with inner product (f, g) = E[f(S∞1 )g(S∞1 )].

For the measure preserving transformation τ that generates the stationary process Si, definethe operator T (f) = f τ . Since τ is measure preserving, we know that ‖Tf‖22 = ‖f‖22, thus T is aunitary and bounded operator.

Define the operator

An(f) =1

n

n∑

k=1

f τk

118

Intuitively:

An =1

n

n∑

k=1

T k =1

n(I − Tn)(I − T )−1

Then, if f ⊥ ker(I − T ) we should have Anf → 0, since only components in the kernel can blow up.This intuition is formalized in the proof below.

Let’s further decompose f into two parts f = f1 +f2, where f1 ∈ ker(I−T ) and f2 ∈ ker(I−T )⊥.Observations:

• if g ∈ ker(I − T ), g must be a constant function. This is due to the ergodicity. Considerindicator function 1A, if 1A = 1A τ = 1τ−1A, then P[A] = 0 or 1. For a general case, supposeg = Tg and g is not constant, then at least some set g ∈ (a, b) will be shift-invariant andhave non-trivial measure, violating ergodicity.

• ker(I − T ) = ker(I − T ∗). This is due to the fact that T is unitary:

g = Tg ⇒ ‖g‖2 = (Tg, g) = (g, T ∗g)⇒ (T ∗g, g) = ‖g‖‖T ∗g‖ ⇒ T ∗g = g

where in the last step we used the fact that Cauchy-Schwarz (f, g) ≤ ‖f‖ · ‖g‖ only holds withequality for g = cf for some constant c.

• ker(I − T )⊥ = ker(I − T ∗)⊥ = [Im(I − T )], where [Im(I − T )] is an L2 closure.

• g ∈ ker(I − T )⊥ ⇐⇒ E[g] = 0. Indeed, only zero-mean functions are orthogonal to constants.

With these observations, we know that f1 = m is a const. Also, f2 ∈ [Im(I − T )] so we furtherapproximate it by f2 = f0 + h1, where f0 ∈ Im(I − T ), namely f0 = g − g τ for some functiong ∈ L2, and ‖h1‖1 ≤ ‖h1‖2 < ε. Therefore we have

Anf1 = f1 = E[f ]

Anf0 =1

n(g − g τn)→ 0 a.s. and L1

since E[∑

n≥1(gτn

n )2] = E[g2]∑ 1

n2 <∞ =⇒ 1ng τn

a.s.−−→0.The proof is completed by showing

P[lim sup

nAn(h+ h1) ≥ δ

]≤ 2ε

δ. (10.7)

Indeed, then by taking ε→ 0 we will have shown

P[lim sup

nAn(f) ≥ E[f ] + δ

]= 0

as required.

Proof of (10.7) makes use of the Maximal Ergodic Lemma stated as follows:

Theorem 10.3 (Maximal Ergodic Lemma). Let (P, τ) be a probability measure and a measure-preserving transformation. Then for any f ∈ L1(P) we have

P[supn≥1

Anf > a

]≤

E[f1supn≥1 Anf>a]a

≤ ‖f‖1a

where Anf = 1n

∑n−1k=0 f τk.

119

Note: This is a so-called “weak L1” estimate for a sublinear operator supnAn(·). In fact, thistheorem is exactly equivalent to the following result:

Lemma 10.2 (Estimate for the maximum of averages). Let Zn, n = 1, . . . be a stationary processwith E[|Z|] <∞ then

P[supn≥1

|Z1 + . . .+ Zn|n

> a

]≤ E[|Z|]

a∀a > 0

Proof. The argument for this Lemma has originally been quite involved, until a dramatically simpleproof (below) was found by A. Garcia.

Define

Sn =n∑

k=1

Zk (10.8)

Ln = max0, Z1, . . . , Z1 + · · ·+ Zn (10.9)

Mn = max0, Z2, Z2 + Z3, . . . , Z2 + · · ·+ Zn (10.10)

Z∗ = supn≥1

Snn

(10.11)

It is sufficient to show thatE[Z11Z∗>0] ≥ 0 . (10.12)

Indeed, applying (10.12) to Z1 = Z1 − a and noticing that Z∗ = Z∗ − a we obtain

E[Z11Z∗>a] ≥ aP[Z∗ > a] ,

from which Lemma follows by upper-bounding the left-hand side with E[|Z1|].In order to show (10.12) we first notice that Ln > 0 Z∗ > 0. Next we notice that

Z1 +Mn = maxS1, . . . , Sn

and furthermoreZ1 +Mn = Ln on Ln > 0

Thus, we haveZ11Ln>0 = Ln −Mn1Ln>0

where we do not need indicator in the first term since Ln = 0 on Ln > 0c. Taking expectation weget

E[Z11Ln>0] = E[Ln]− E[Mn1Ln>0] (10.13)

≥ E[Ln]− E[Mn] (10.14)

= E[Ln]− E[Ln−1] = E[Ln − Ln−1] ≥ 0 , (10.15)

where we used Mn ≥ 0, the fact that Mn has the same distribution as Ln−1, and Ln ≥ Ln−1,respectively. Taking limit as n→∞ in (10.15) we obtain (10.12).

120

10.4* Sinai’s generator theorem

It turns out there is a way to associate to every probability-preserving transformation (p.p.t.) τa number, called Kolmogorov-Sinai entropy. This number is invariant to isomorphisms of p.p.t.’s(appropriately defined).

Definition 10.5. Fix a probability-preserving transformation τ acting on probability space (Ω,F ,P).Kolmogorov-Sinai entropy of τ is defined as

H(τ) , supX0

limn→∞

1

nH(X0, X0 τ, . . . , X0 τn−1) ,

where supremum is taken over all finitely-valued random variables X0 : Ω → X and measurablewith respect to F .

Note that every random variable X0 generates a stationary process adapted to τ , that is

Xk , X0 τk .

In this way, Kolmogorov-Sinai entropy of τ equals the maximal entropy rate among all stationaryprocesses adapted to τ . This quantity may be extremely hard to evaluate, however. One help comesin the form of the famous criterion of Y. Sinai. We need to elaborate on some more concepts before:

• σ-algebra G ⊂ F is P-dense in F , or sometimes we also say G = F mod P or even G = Fmod 0, if for every E ∈ F there exists E′ ∈ G s.t.

P[E∆E′] = 0 .

• Partition A = Ai, i = 1, 2, . . . measurable with respect to F is called generating if

∞∨

n=0

στ−nA = F mod P .

• Random variable Y : Ω→ Y with a countable alphabet Y is called a generator of (Ω,F ,P, τ) if

σY, Y τ, . . . , Y τn, . . . = F mod P

Theorem 10.4 (Sinai’s generator theorem). Let Y be the generator of a p.p.t. (Ω,F ,P, τ). LetH(Y) be the entropy rate of the process Y = Yk = Y τk, k = 0, . . .. If H(Y) is finite, thenH(τ) = H(Y).

Proof. Notice that since H(Y) is finite, we must have H(Y n0 ) <∞ and thus H(Y ) <∞. First, we

argue that H(τ) ≥ H(Y). If Y has finite alphabet, then it is simply from the definition. Otherwiselet Y be Z+-valued. Define a truncated version Ym = min(Y,m), then since Ym → Y as m→∞ wehave from lower semicontinuity of mutual information, cf. (3.13), that

limm→∞

I(Y ; Ym) ≥ H(Y ) ,

and consequently for arbitrarily small ε and sufficiently large m

H(Y |Y ) ≤ ε ,

121

Then, consider the chain

H(Y n0 ) = H(Y n

0 , Yn

0 ) = H(Y n0 ) +H(Y n

0 |Y n0 )

= H(Y n0 ) +

n∑

i=0

H(Yi|Y n0 , Y

i−10 )

≤ H(Y n0 ) +

n∑

i=0

H(Yi|Yi)

= H(Y n0 ) + nH(Y |Y ) ≤ H(Y n

0 ) + nε

Thus, entropy rate of Y (which has finite-alphabet) can be made arbitrarily close to the entropyrate of Y, concluding that H(τ) ≥ H(Y).

The main part is showing that for any stationary process X adapted to τ the entropy rate isupper bounded by H(Y). To that end, consider X : Ω→ X with finite X and define as usual theprocess X = X τk, k = 0, 1, . . .. By generating property of Y we have that X (perhaps aftermodification on a set of measure zero) is a function of Y∞0 . So are all Xk. Thus

H(X0) = I(X0;Y∞0 ) = limn→∞

I(X0;Y n0 ) ,

where we used the continuity-in-σ-algebra property of mutual information, cf. (3.14). Rewriting thelatter limit differently, we have

limn→∞

H(X0|Y n0 ) = 0 .

Fix ε > 0 and choose m so that H(X0|Y m0 ) ≤ ε. Then consider the following chain:

H(Xn0 ) ≤ H(Xn

0 , Yn

0 ) = H(Y n0 ) +H(Xn

0 |Y n0 )

≤ H(Y n0 ) +

n∑

i=0

H(Xi|Y ni )

= H(Y n0 ) +

n∑

i=0

H(X0|Y n−i0 )

≤ H(Y n0 ) +m log |X |+ (n−m)ε ,

where we used stationarity of (Xk, Yk) and the fact that H(X0|Y n−i0 ) < ε for i ≤ n −m. After

dividing by n and passing to the limit our argument implies

H(X) ≤ H(Y) + ε .

Taking here ε→ 0 completes the proof.

Alternative proof: Suppose X0 is taking values on a finite alphabet X and X0 = f(Y∞0 ). Then(this is a measure-theoretic fact) for every ε > 0 there exists m = m(ε) and a function fε : Ym+1 → Xs.t.

P[f(Y∞0 ) 6= fε(Ym

0 )] ≤ ε .(This is just another way to say that

⋃n σY n

0 is P-dense in σ(Y∞0 ).) Define a stationary processX as

Xj , fε(Ym+jj ) .

122

Notice that since Xn0 is a function of Y n+m

0 we have

H(Xn0 ) ≤ H(Y n+m

0 ) .

Dividing by m and passing to the limit we obtain that for entropy rates

H(X) ≤ H(Y) .

Finally, to relate X to X notice that by construction

P[Xj 6= Xj ] ≤ ε .Since both processes take values on a fixed finite alphabet, from Corollary 5.2 we infer that

|H(X)−H(X)| ≤ ε log |X |+ h(ε) .

Altogether, we have shown that

H(X) ≤ H(Y) + ε log |X |+ h(ε) .

Taking ε→ 0 we conclude the proof.

Examples:

• Let Ω = [0, 1], F–Borel σ-algebra, P = Leb and

τ(ω) = 2ω mod 1 =

2ω, ω < 1/2

2ω − 1, ω ≥ 1/2

It is easy to show that Y (ω) = 1ω < 1/2 is a generator and that Y is an i.i.d. Bernoulli(1/2)process. Thus, we get that Kolmogorov-Sinai entropy is H(τ) = log 2.

• Let Ω be the unit circle S1, F – Borel σ-algebra, P be the normalized length and

τ(ω) = ω + γ

i.e. τ is a rotation by the angle γ. (When γ2π is irrational, this is known to be an ergodic

p.p.t.). Here Y = 1|ω| < 2πε is a generator for arbitrarily small ε and hence

H(τ) ≤ H(X) ≤ H(Y0) = h(ε)→ 0 as ε→ 0 .

This is an example of a zero-entropy p.p.t.

Remark 10.2. Two p.p.t.’s (Ω1, τ1,P1) and (Ω0, τ0,P0) are called isomorphic if there exists fi :Ωi → Ω1−i defined Pi-almost everywhere and such that 1) τ1−i fi = f1−i τi; 2) fi f1−i is identityon Ωi (a.e.); 3) Pi[f−1

1−iE] = P1−i[E]. It is easy to see that Kolmogorov-Sinai entropies of isomorphicp.p.t.s are equal. This observation was made by Kolmogorov in 1958. It was revoluationary, since itallowed to show that p.p.t.s corresponding shifts of iid Bern(1/2) and iid Bern(1/3) procceses arenot isomorphic. Before, the only invariants known were those obtained from studying the spectrumof a unitary operator

Uτ : L2(Ω,P)→ L2(Ω,P) (10.16)

φ(x) 7→ φ(τ(x)) . (10.17)

However, the spectrum of τ corresponding to any non-constant i.i.d. process consists of the entireunit circle, and thus is unable to distinguish Bern(1/2) from Bern(1/3).2

2To see the statement about the spectrum, let Xi be iid with zero mean and unit variance. Then consider φ(x∞1 )defined as 1√

m

∑mk=1 e

iωkxk. This φ has unit energy and as m→∞ we have ‖Uτφ− eiωφ‖L2 → 0. Hence every eiω

belongs to the spectrum of Uτ .

123

§ 11. Universal compression

In this lecture we will discuss how to produce compression schemes that do not require aprioriknowledge of the distribution. Here, compressor is a map X n → 0, 1∗. Now, however, there is noone fixed probability distribution PXn on X n. The plan for this lecture is as follows:

1. We will start by discussing the earliest example of a universal compression algorithm (ofFitingof). It does not talk about probability distributions at all. However, it turns out to beasymptotically optimal simulatenously for all i.i.d. distributions and with small modificationsfor all finite-order Markov chains.

2. Next class of universal compressors is based on assuming that a the true distribution PXn

belongs to a given class. These methods proceed by choosing a good model distribution QXn

serving as the minimax approximation to each distribution in the class. The compressionalgorithm for a single distribution QXn is then designed as in previous chapters.

3. Finally, an entirely different idea are algorithms of Lempel-Ziv type. These automaticallyadapt to the distribution of the source, without any prior assumptions required.

Throughout this section instead of describing each compression algorithm, we will merely specifysome distribution QXn and apply one of the following constructions:

• Sort all xn in the order of decreasing QXn(xn) and assign values from 0, 1∗ as in Theorem 8.1,this compressor has lengths satisfying

`(f(xn)) ≤ log1

QXn(xn).

• Set lengths to be

`(f(xn)) , dlog1

QXn(xn)e

and apply Kraft’s inequality Theorem 8.5 to construct a prefix code.

• Use arithmetic coding (see next section).

The important conclusion is that in all these cases we have

`(f(xn)) ≤ log1

QXn(xn)+ const ,

and in this way we may and will always replace lengths with log 1QXn (xn) . In this way, the only job

of a universal compression algorithm is to specify QXn.

124

Remark 11.1. Furthermore, if we only restrict attention to prefix codes, then any code f : X n →0, 1∗ defines a distribution QXn(xn) = 2−`(f(xn)) (we assume the code’s tree is full). In thisway, for prefix-free codes results on redundancy, stated in terms of optimizing the choice of QXn ,imply tight converses too. For one-shot codes without prefix constraints the optimal answersare slightly different, however. (For example, the optimal universal code for all i.i.d. sources

satisfies E[`(f(Xn))] ≈ H(Xn) + |X |−32 log n in contrast with |X |−1

2 log n for prefix-free codes,cf. [BF14, KS14].)

11.1 Arithmetic coding

Constructing an encoder table from QXn may require a lot of resources if n is large. Arithmeticcoding provides a convenient workaround by allowing the encoder to output bits sequentially. Noticethat to do so, it requires that not only QXn but also its marginalizations QX1 , QX2 , · · · be easilycomputable. (This is not the case, for example, for Shtarkov distributions (11.10)-(11.11), which arenot compatible for different n.)

Let us agree upon some ordering on the alphabet of X (e.g. a < b < · · · < z) and extend thisorder lexicographically to X n (that is for x = (x1, . . . , xn) and y = (y1, . . . , yn), we say x < y ifxi < yi for the first i such that xi 6= yi, e.g., baba < babb). Then let

Fn(xn) =∑

yn<xn

QXn(yn) .

Associate to each xn an interval Ixn = [Fn(xn), Fn(xn) + QXn(xn)). These intervals are disjointsubintervals of [0, 1). Now encode

xn 7→ largest dyadic interval contained in Ixn .

Recall that dyadic intervals are intervals of the type [m2−k, (m+ 1)2−k] where a is an odd integer.Clearly each dyadic interval can be associated with a binary string in 0, 1∗. We set f(xn) to bethat string. The resulting code is a prefix code satisfying

`(f(xn)) ≤⌈

log2

1

QXn(xn)

⌉+ 1 .

(This is an exercise.)Observe that

Fn(xn) = Fn−1(xn−1) +QXn−1(xn−1)∑

y<xn

QXn|Xn−1(y|xn−1)

and thus Fn(xn) can be computed sequentially if QXn−1 and QXn|Xn−1 are easy to compute.This method is the method of choice in many modern compression algorithms because it allowsto dynamically incorporate the learned information about the stream, in the form of updatingQXn|Xn−1 (e.g. if the algorithm detects that an executable file contains a long chunk of English text,it may temporarily switch to QXn|Xn−1 modeling the English language).

11.2 Combinatorial construction of Fitingof

Fitingof suggested that a sequence xn ∈ X n should be prescribed information Φ0(xn) equal tothe logarithm of the number of all possible permutations obtainable from xn (i.e. log-size of the

125

type-class containing xn). From Stirling’s approximation this can be shown to be

Φ0(xn) = nH(xT ) +O(log n) T ∼ Unif[n] (11.1)

= nH(Pxn) +O(log n) , (11.2)

where Pxn is the empirical distribution of the sequence xn:

Pxn(a) ,1

n

n∑

i=1

1xi = a . (11.3)

Then Fitingof argues that it should be possible to produce a prefix code with

`(f(xn)) = Φ0(xn) +O(log n) . (11.4)

This can be done in many ways. In the spirit of what we will do next, let us define

QXn(xn) , exp−Φ0(xn)cn , (11.5)

where cn is a normalization constant cn. Counting the number of different possible empiricaldistributions (types), we get

cn = O(n−(|X |−1)) ,

and thus, by Kraft inequality, there must exist a prefix code with lengths satisfying (11.4). Now

taking expectation over Xni.i.d.∼ PX we get

E[`(f(Xn))] = nH(PX) + (|X | − 1) log n+O(1) ,

for every i.i.d. source on X .

11.2.1 Universal compressor for all finite-order Markov chains

Fitingof’s idea can be extended as follows. Define now the 1-st order information content Φ1(xn)to be the log of the number of all sequences, obtainable by permuting xn with extra restrictionthat the new sequence should have the same statistics on digrams. Asymptotically, Φ1 is just theconditional entropy

Φ1(xn) = nH(xT |xT−1 mod n) +O(log n), T ∼ Unif[n] .

Again, it can be shown that there exists a code such that lengths

`(f(xn)) = Φ1(xn) +O(log n) .

This implies that for every 1-st order stationary Markov chain X1 → X2 → · · · → Xn we have

E[`(f(Xn))] = nH(X2|X1) +O(log n) .

This can be further continued to define Φ2(xn) and build a universal code, asymptoticallyoptimal for all 2-nd order Markov chains etc.

126

11.3 Optimal compressors for a class of sources. Redundancy.

So we have seen that we can construct compressor f : X n → 0, 1∗ that achieves

E[`(f(Xn))] ≤ H(Xn) + o(n) ,

simultaneously for all i.i.d. sources (or even all r-th order Markov chains). What should we do next?Krichevsky suggested that the next barrier should be to optimize regret, or redundancy :

E[`(f(Xn))]−H(Xn)→ min

simultaneously for a class of sources. We proceed to rigorous definitions.Given a collection PXn|θ : θ ∈ Θ of sources, and a compressor f : X n → 0, 1∗ we define its

redundancy assupθ0

E[`(f(Xn))|θ = θ0]−H(Xn|θ = θ0) .

Replacing here lengths with log 1QXn

we define redundancy of the distribution QXn as

supθ0

D(PXn|θ=θ0‖QXn) .

Thus, the question of designing the best universal compressor (in the sense of optimizing worst-casedeviation of the average length from the entropy) becomes the question of finding solution of:

Q∗Xn = argminQXn

supθ0

D(PXn|θ=θ0‖QXn) .

We therefore get to the following definition

Definition 11.1 (Redundancy in universal compression). Given a class of sources PXn|θ=θ0 , θ0 ∈Θ, n = 1, . . . we define its minimax redundancy as

R∗n , minQXn

supθ0

D(PXn|θ=θ0‖QXn) . (11.6)

Note that under condition of finiteness of R∗n, Theorem 4.5 gives the maximin and capacityrepresentation

R∗n = supPθ

minQXn

D(PXn|θ‖QXn |Pθ) (11.7)

= supPθ

I(θ;Xn) . (11.8)

Thus redundancy is simply the capacity of the channel θ → Xn. This result, obvious in hindsight,was rather surprising in the early days of universal compression.

Finding exact QXn-minimizer in (11.6) is a daunting task even for the simple class of all i.i.d.Bernoulli sources (i.e. Θ = [0, 1], PXn|θ = Bernn(θ)). It turns out, however, that frequently theapproximate minimizer has a rather nice structure: it matches the Jeffreys prior.

Remark 11.2. (Shtarkov, Fitingof and individual sequence approach) There is a connection betweenthe combinatorial method of Fitingof and the method of optimality for a class. Indeed, following

Shtarkov we may want to choose distribution Q(S)Xn so as to minimize the worst-case redundancy for

each realization xn (not average!):

minQXn

maxxn

supθ0

logPXn|θ(x

n|θ0)

QXn(xn)(11.9)

127

This leads to Shtarkov’s distribution (also known as the normalized maximal likehood (NML) code):

Q(S)Xn(xn) = c sup

θ0

PXn|θ(xn|θ0) , (11.10)

where c is the normalization constant. If class PXn|θ, θ ∈ Θ is chosen to be all i.i.d. distributionson X then

i.i.d. Q(S)Xn(xn) = c exp−nH(Pxn) , (11.11)

and thus compressing w.r.t. Q(S)Xn recovers Fitingof’s construction Φ0 up to O(log n) differences

between nH(Pxn) and Φ0(xn). If we take PXn|θ to be all 1-st order Markov chains, then we getconstruction Φ1 etc. Note also, that the problem (11.9) can also be written as minimization of theregret for each individual sequence (under log-loss, with respect to a parameter class PXn|θ):

minQXn

maxxn

log

1

QXn(xn)− inf

θ0log

1

PXn|θ(xn|θ0)

. (11.12)

The gospel is that if there is a reason to believe that real-world data xn is likely to be generated byone of the models PXn|θ, then using minimizer of (11.12) will resu lt in the compressor that bothlearns the right model and compresses with respect to it.

11.4* Approximate minimax solution: Jeffreys prior

In this section we will only consider the simple setting of a class of sources consisting of alli.i.d. distributions on a given finite alphabet. We will show that the prior, asymptoticall solvingcapacity question (11.8), is given by the Dirichlet-distribution with parameters set to 1/2, namelythe pdf

P ∗θ = const1√∏dj=0 θj

.

First, we give the formal setting as follows:

• Fix X – finite alphabet of size |X | = d+ 1, which we will enumerate as X = 0, . . . , d.

• Θ = (θj , j = 1, . . . , d) :∑d

j=1 θj ≤ 1, θj ≥ 0 – is the collection of all probability distributionson X . Note that Θ is a d-dimensional simplex. We will also define

θ0 , 1−d∑

j=1

θj .

• The source class is

PXn|θ(xn|θ) ,

n∏

j=1

θxj = exp

−n

∑

a∈Xθa log

1

Pxn(a)

,

where as before Pxn is the empirical distribution of xn, cf. (11.3).

128

In order to derive the caod Q∗Xn we first propose a guess that the caid Pθ in (11.8) is somedistribution with smooth density on Θ (this can only be justified by an apriori belief that the caidin such a natural problem should be something that involves all θ’s). Then, we define

QXn(xn) ,∫

ΘPXn|θ(x

n|θ′)Pθ(θ′)dθ′ . (11.13)

Before proceeding further, we recall the following method of approximating exponential integrals(called Laplace method). Suppose that f(θ) has a unique minimum at the interior point θ of Θand that Hessian Hessf is uniformly lower-bounded by a multiple of identity (in particular, f(θ) isstrongly convex). Then taking Taylor expansion of π and f we get

∫

Θπ(θ)e−nf(θ)dθ =

∫(π(θ) +O(‖t‖))e−n(f(θ)− 1

2tTHessf(θ)t+o(‖t‖2))dt (11.14)

= π(θ)e−nf(θ)

∫

Rde−x

THessf(θ)x dx√nd

(1 +O(n−1/2)) (11.15)

= π(θ)e−nf(θ)

(2π

n

) d2 1√

det Hessf(θ)(1 +O(n−1/2)) (11.16)

where in the last step we computed Gaussian integral.Next, we notice that

PXn|θ(xn|θ′) = e−n(D(Pxn‖PX|θ=θ′ )+H(Pxn )) log e ,

and therefore, denotingθ(xn) , Pxn

we get from applying (11.16) to (11.13)

logQXn(xn) = −nH(θ) +d

2log

2π

n log e+ log

Pθ(θ)√det JF (θ)

+O(n−12 ) ,

where we used the fact that Hessθ′D(P‖PX|θ=θ′) = 1log eJF (θ′) with JF – Fisher information matrix,

see (4.14). From here, using the fact that under Xn ∼ PXn|θ=θ′ the random variable θ = θ′+O(n−1/2)we get by linearizing JF (·) and Pθ(·)

D(PXn|θ=θ′‖QXn) = n(E[H(θ)]−H(X|θ = θ′)) +d

2log n− log

Pθ(θ′)√

det JF (θ′)+ const +O(n−

12 ) ,

(11.17)where const is some constant (independent of prior Pθ or θ′). The first term is handled by the nextLemma.

Lemma 11.1. Let Xni.i.d.∼ P on finite alphabet X and let P be the empirical type of Xn then

E[D(P‖P )] =|X | − 1

2nlog e+ o

(1

n

).

Proof. Notice that√n(P − P ) converges in distribution to N (0,Σ), where Σ = diag(P ) − PP T ,

where P is an |X |-by-1 column vector. Thus, computing second-order Taylor expansion of D(·‖P ),cf. (4.16), we get the result.

129

Continuing (11.17) we get in the end

D(PXn|θ=θ′‖QXn) =d

2log n− log

Pθ(θ′)√

det JF (θ′)+ const +O(n−

12 ) (11.18)

under the assumption of smoothness of prior Pθ and that θ′ is not too close to the boundary.Consequently, we can see that in order for the prior Pθ be the saddle point solution, we should have

Pθ(θ′) ∼

√det JF (θ′) ,

provided that such density is normalizable. Prior proportional to square-root of the determinant ofFisher information matrix is known as Jeffreys prior. In our case, using the explicit expression forFisher information (4.17) we get

P ∗θ = Beta(1/2, 1/2, · · · , 1/2) = cd1√∏dj=0 θj

, (11.19)

where cd is the normalization constant. The corresponding redundancy is then

R∗n =d

2log

n

2πe− log cd + o(1) . (11.20)

Remark 11.3. In statistics Jeffreys prior is justified as being invariant to smooth reparametrization,as evidenced by (4.15). For example, in answering “will the sun rise tomorrow”, Laplace proposedto estimate the probability by modeling sunrise as i.i.d. Bernoulli process with a uniform prior onθ ∈ [0, 1]. However, this is clearly not very logical, as one may equally well postulate uniformity ofα = θ10 or β =

√θ. Jeffreys prior θ ∼ 1√

θ(1−θ)is invariant to reparametrization in the sense that if

one computed√

det JF (α) under α-parametrization the result would be exactly the pushforward ofthe 1√

θ(1−θ)along the map θ 7→ θ10.

Making the arguments in this subsection rigorous is far from trivial, see [CB90, CB94] for details.

11.5 Sequential probability assignment: Krichevsky-Trofimov

From (11.19) it is not hard to derive the (asymptotically) optimal universal probability assignmentQXn . For simplicity we consider Bernoulli case, i.e. d = 1 and θ ∈ [0, 1] is the 1-dimensionalparameter. Then,1

P ∗θ =1

π√θ(1− θ)

(11.21)

Q(KT )Xn (xn) =

(2t0 − 1)!! · (2t1 − 1)!!

2nn!, ta = #j ≤ n : xj = a (11.22)

This assignment can now be used to create a universal compressor via one of the methods outlinedin the beginning of this lecture. However, what is remarkable is that it has a very nice sequentialinterpretation (as does any assignment obtained via QXn =

∫PθPXn|θ with Pθ not depending on n).

1This is obtained from the identity∫ 1

0

θa(1−θ)b√θ(1−θ)

dθ = π 1·3···(2a−1)·1·3···(2b−1)

2a+b(a+b)!for integer a, b ≥ 0. This identity can

be derived by change of variable z = θ1−θ and using the standard keyhole contour on the complex plain.

130

Q(KT )Xn|Xn−1(1|xn−1) =

t1 + 12

n, t1 = #j ≤ n− 1 : xj = 1 (11.23)

Q(KT )Xn|Xn−1(0|xn−1) =

t0 + 12

n, t0 = #j ≤ n− 1 : xj = 0 (11.24)

This is the famous “add 1/2” rule of Krichevsky and Trofimov. Note that this sequential assignmentis very convenient for use in prediction as well as in implementing an arithmetic coder.

Remark 11.4 (Laplace “add 1” rule). A slightly less optimal choice of QXn results from Laplaceprior: just take Pθ to be uniform on [0, 1]. Then, in the Bernoulli (d = 1) case we get

Q(Lap)Xn =

1(nw

)(n+ 1)

, w = #j : xj = 1 . (11.25)

The corresponding successive probability is given by

Q(Lap)Xn|Xn−1(1|xn−1) =

t1 + 1

n+ 1, t1 = #j ≤ n− 1 : xj = 1 .

We notice two things. First, the distribution (11.25) is exactly the same as Fitingof’s (11.5). Second,this distribution “almost” attains the optimal first-order term in (11.20). Indeed, when Xn is iidBern(θ) we have for the redundancy:

E

[log

1

Q(Lap)Xn (Xn)

]−H(Xn) = log(n+ 1) + E

[log

(n

W

)]− nh(θ) , W ∼ Bino(n, θ) . (11.26)

From Stirling’s expansion we know that as n → ∞ this redundancy evaluates to 12 log n + O(1),

uniformly in θ over compact subsets of (0, 1). However, for θ = 0 or θ = 1 the Laplace redun-dancy (11.26) clearly equals log(n+1). Thus, supremum over θ ∈ [0, 1] is achieved close to endpointsand results in suboptimal redundancy log n+O(1). Jeffrey’s prior (11.21) fixes the problem at theendpoints.

11.6 Individual sequence and universal prediction

The problem of selecting one QXn serving as good prior for a whole class of distributions can alsobe interpreted in terms of so-called “universal prediction”. We discuss this connection next.

Consider the following problem: a sequence xn is observed sequentially and our goal is to predictprobability distribution of the next letter given the past observations. The experiment proceeds asfollows:

1. A string xn ∈ X n is selected by the nature.

2. Having observed samples x1, . . . , xt−1 we are requested to output probability distributionQt(·|xt−1) on X n.

3. After that nature reveals the next sample xt and our loss for t-th prediction is evaluated as

log1

Qt(xt|xt−1).

131

Goal (informal): Come up with the algorithm to minimize the average per-letter loss:

`(Qt, xn) ,1

n

n∑

t=1

log1

Qt(xt|xt−1).

Note that to make this goal formal, we need to explain how xn is generated. Consider first anaive requirement that the worst-case loss is minimized:

minQtnt=1

maxxn

`(Qt, xn) .

This is clearly trivial. Indeed, at any step t the distribution Pt must have at least one atom withweight ≤ 1

|X | , and hence for any predictor

maxxn

`(Qt, xn) ≥ log |X | ,

which is clearly achieved iff Qt(·) ≡ 1|X | , i.e. if predictor makes absolutely no prediction. This is of

course very natural: in the absence of whatsoever prior information on xn it is impossible to predictanything.

The exciting idea, originated by Feder, Merhav and Gutman, cf. [FMG92, MF98], is to replaceloss with regret, i.e. the gap to the best possible static oracle. More exactly, suppose a non-causaloracle can look at the entire string xn and output a constant Qt = Q1. From non-negativity ofdivergence this non-causal oracle achieves:

`oracle(xn) = min

Q

1

n

n∑

t=1

log1

Q(xt)= H(Pxn) .

Can causal (but time-varying) predictor come close to this performance? In other words, we defineregret as

reg(Qt, xn) , `(Qt, xn)−H(Pxn)

and ask to minimize the worst-case regret:

reg∗n , minQt

maxxn

reg(Qt, xn) . (11.27)

Excitingly, non-trivial predictors emerge as solutions to the above problem, which furthermoredo not rely on any assumptions on the prior distribution of xn.

We next consider the case of X = 0, 1 for simplicity. To solve (11.27), first notice that designinga sequence Qt(·|xt−1 is equivalent to defining one joint distribution QXn and then factorizing thelatter as QXn(xn) =

∏tQt(xt|xt−1). (At this point, one should recall the worst-case redundancy of

Shtarkov, cf. Remark 11.2.) Then the problem (11.27) becomes simply

reg∗n = minQX

nmaxxn

1

nlog

1

QXn(xn)−H(Pxn) .

We may lower-bound the max over xn with the average over the Xn ∼ Bern(θ)n and obtain (alsoapplying Lemma 11.1):

reg∗n ≥1

nR∗n +

|X | − 1

2n+ o(

1

n) ,

where R∗n is the universal compression redundancy defined in (11.6), whose asymptotics we derivedin (11.20).

132

On the other hand, taking Q(KT )Xn from Krichevsky-Trofimov (11.22) we find after some algebra

and Stirling’s expansions:

maxxn− logQ

(KT )Xn (xn)− nH(Pxn) =

1

2log n+O(1) .

In all, we conclude that,

reg∗n =1

nR∗n +O(1) =

|X | − 1

2

log n

n+O(1/n) ,

and remarkable, the regret converges to zero. i.e. the causal predictor can approach the perforamnceof a non-causal oracle. Explicit (asymptotically optimal) sequential prediction rules are given byKrichevsky-Trofimov’s “add 1/2” rules (11.24). We note that the resulting rules are also independentof n (“horizon-free”). This is a very desirable property not shared by the sequential predictors (alsoasymptotically optimal) derived from factorizing the Shtarkov’s distribution (11.10).

11.7 Lempel-Ziv compressor

So given a class of sources PXn|θ, θ ∈ Θ we have shown how to produce an asymptotically optimalcompressors by using Jeffreys’ prior. Although we have done so only for i.i.d. class, it can beextended to handle a class of all r-th order Markov chains with minimal modifications. However,the resulting sequential probability becomes rather complex. Can we do something easier at theexpense of losing optimal redundancy?

In principle, the problem is rather straightforward: as we observe a stationary process, we mayestimate with better and better precision the conditional probability PXn|Xn−1

n−rand then use it as

the basis for arithmetic coding. As long as P converges to the actual conditional probability, wewill get to the entropy rate of H(Xn|Xn−1

n−r ). Note that Krichevsky-Trofimov assignment (11.24)is clearly learning the distribution too: as n grows, the estimator QXn|Xn−1 converges to the truePX (provided sequence is i.i.d.). So in some sense the converse is also true: any good universalcompression scheme is inherently learning the true distribution.

The main drawback of the learn-then-compress approach is the following. Once we extend theclass of sources to include those with memory, we invariably are lead to the problem of learning thejoint distribution PXr−1

0of r-blocks. However, the number of samples required to obtain a good

estimate of PXr−10

is exponential in r. Thus learning may proceed rather slowly. Lempel-Ziv family

of algorithms works around this in an ingeniously elegant way:

• First, estimating probabilities of rare substrings takes longest, but it is also the least useful,as these substrings almost never appear at the input.

• Second, and most crucial, observation is that a great estimate of PXr(xr) is given by thereciprocal of the time till the last observation of xr in the incoming stream.

• Third, there is a prefix code2 mapping any integer n to binary string of length roughly log2 n:

fint : Z+ → 0, 1+, `(fint(n)) = log2 n+O(log log n) . (11.28)

Thus, by encoding the pointer to the last observation of xr via such a code we get a string oflength roughly logPXr(xr) automatically.

2For this just notice that∑k≥1 2− log2 k−2 log2 log(k+1) <∞ and use Kraft’s inequality.

133

There are a number of variations of these basic ideas, so we will only attempt to give a roughexplanation of why it works, without analyzing any particular algorithm.

We proceed to formal details. First, we need to establish a Kac’s lemma.

Lemma 11.2 (Kac). Consider a finite-alphabet stationary ergodic process . . . , X−1, X0, X1 . . .. LetL = inft > 0 : X−t = X0 be the last appearance of symbol X0 in the sequence X−1

−∞. Then for anyu such that P[X0 = u] > 0 we have

E[L|X0 = u] =1

P[X0 = u].

In particular, mean recurrence time E[L] = |supp(PX)|.

Proof. Note that from stationarity the following probability

P[∃t ≥ k : Xt = u]

does not depend on k ∈ Z. Thus by continuity of probability we can take k = −∞ to get

P[∃t ≥ 0 : Xt = u] = P[∃t ∈ Z : Xt = u] .

However, the last event is shift-invariant and thus must have probability zero or one by ergodicassumption. But since P[X0 = u] > 0 it cannot be zero. So we conclude

P[∃t ≥ 0 : Xt = u] = 1 . (11.29)

Next, we have

E[L|X0 = u] =∑

t≥1

P[L ≥ t|X0 = u] (11.30)

=1

P[X0 = u]

∑

t≥1

P[L ≥ t,X0 = u] (11.31)

=1

P[X0 = u]

∑

t≥1

P[X−t+1 6= u, . . . ,X−1 6= u,X0 = u] (11.32)

=1

P[X0 = u]

∑

t≥1

P[X0 6= u, . . . ,Xt− 2 6= u,Xt−1 = u] (11.33)

=1

P[X0 = u]P[∃t ≥ 0 : Xt = u] (11.34)

=1

P[X0 = u], (11.35)

where (11.30) is the standard expression for the expectation of a Z+-valued random variable, (11.33)is from stationarity, (11.34) is because the events corresponding to different t are disjoint, and (11.35)is from (11.29).

The following proposition serves to explain the basic principle behind operation of Lempel-Ziv:

Theorem 11.1. Consider a finite-alphabet stationary ergodic process . . . , X−1, X0, X1 . . . withentropy rate H. Suppose that X−1

−∞ is known to the decoder. Then there exists a sequence ofprefix-codes fn(xn−1

0 , x−1−∞) with expected length

1

nE[`(fn(Xn−1

0 , X−1∞ ))]→ H ,

134

Proof. Let Ln be the last occurence of the block xn−10 in the string x−1

−∞ (recall that the latter isknown to decoder), namely

Ln = inft > 0 : x−t+n−1−t = xn−1

0 .

Then, by Kac’s lemma applied to the process Y(n)t = Xt+n−1

t we have

E[Ln|Xn−10 = xn−1

0 ] =1

P[Xn−10 = xn−1

0 ].

We know encode Ln using the code (11.28). Note that there is crucial subtlety: even if Ln < n andthus [−t,−t+ n− 1] and [0, n− 1] overlap, the substring xn−1

0 can be decoded from the knowledgeof Ln.

We have, by applying Jensen’s inequality twice and noticing that 1nH(Xn−1

0 ) H and1n logH(Xn−1

0 )→ 0 that

1

nE[`(fint(Ln))] ≤ 1

nE[log

1

PXn−10

(Xn−10 )

] + o(1)→ H .

From Kraft’s inequality we know that for any prefix code we must have

1

nE[`(fint(Ln))] ≥ 1

nH(Xn−1

0 |X−1−∞) = H .

135

Part III

Binary hypothesis testing

136

§ 12. Binary hypothesis testing

12.1 Binary Hypothesis Testing

Two possible distributions on a space X

H0 : X ∼ PH1 : X ∼ Q

Where under hypothesis H0 (the null hypothesis) X is distributed according to P , and under H1

(the alternative hypothesis) X is distributed according to Q. A test between two distributionschooses either H0 or H1 based on an observation of X

• Deterministic test: f : X → 0, 1

• Randomized test: PZ|X : X → 0, 1, so that PZ|X(0|x) ∈ [0, 1].

Let Z = 0 denote that the test chooses P , and Z = 1 when the test chooses Q.Remark: This setting is called “testing simple hypothesis against simple hypothesis”. Simple

here refers to the fact that under each hypothesis there is only one distribution that could generatethe data. Composite hypothesis is when X ∼ P and P is only known to belong to some class ofdistributions.

12.1.1 Performance Metrics

In order to determine the “effectiveness” of a test, we look at two metrics. Let πi|j denote theprobability of the test choosing i when the correct hypothesis is j. With this

α = π0|0 = P [Z = 0] (Probability of success given H0 true)

β = π0|1 = Q[Z = 0] (Probability of error given H1 true)

Morever, π1|1 (true positive) is called the power of a test.

Remark 12.1. P [Z = 0] is a slight abuse of notation; more accurately, it means P [Z = 0] =∑x∈X P (x)PZ|X(0|x) = E[1− f(X)], where f(x) = P [reject|X = x]. Also, the choice of these two

metrics to judge the test is not unique, we can use many other pairs from π0|0, π0|1, π1|0, π1|1.

So for any test PZ|X there is an associated (α, β). There are a few ways to determine the “besttest”

• Bayesian: Assume prior distributions P[H0] = π0 and P[H1] = π1, minimize the expected error

P ∗b = mintests

π0π1|0 + π1π0|1

137

• Minimax: Assume there is a prior distribution but it is unknown, so choose the test thatpreforms the best for the worst case priors

P ∗m = mintests

maxπ0

π0π1|0 + π1π0|1

• Neyman-Pearson: Minimize error β subject to success probability at least α.

In this course, the Neyman-Pearson formulation will play a vital role.

12.2 Neyman-Pearson formulation

Definition 12.1. Given that we require P [Z = 0] ≥ α,

βα(P,Q) , infP [Z=0]≥α

Q[Z = 0]

Definition 12.2. Given (P,Q), the region of achievable points for all randomized tests is

R(P,Q) =⋃

PZ|X

(P [Z = 0], Q[Z = 0]) ⊂ [0, 1]2 (12.1)

R(P,Q)

α

β

βα(P,Q)

Remark 12.2. This region encodes a lot of useful information about the relationship between Pand Q. For example,1

P = Q⇔R(P,Q) = P ⊥ Q⇔R(P,Q) =

Moreover, TV(P,Q) = maximal length of vertical line intersecting the lower half of R(P,Q) (HW).

Theorem 12.1 (Properties of R(P,Q)).

1. R(P,Q) is a closed, convex subset of [0, 1]2.

1Recall that P is mutually singular w.r.t. Q, denoted by P ⊥ Q, if P [E] = 0 and Q[E] = 1 for some E.

138

2. R(P,Q) contains the diagonal.

3. Symmetry: (α, β) ∈ R(P,Q)⇔ (1− α, 1− β) ∈ R(P,Q).

Proof. 1. For convexity, suppose (α0, β0), (α1, β1) ∈ R(P,Q), then each specifies a test PZ0|X , PZ1|Xrespectively. Randomize between these two test to get the test λPZ0|X + λPZ1|X for λ ∈ [0, 1],which achieves the point (λα0 + λα1, λβ0 + λβ1) ∈ R(P,Q).

Closedness will follow from the explicit determination of all boundary points via Neyman-Pearson Lemma – see Remark 12.3. In more complicated situations (e.g. in testing againstcomposite hypothesis) simple explicit solutions similar to Neyman-Pearson Lemma are notavailable but closedness of the region can frequently be argued still. The basic reason is thatthe collection of functions g : X → [0, 1] forms a weakly-compact set and hence its imageunder a linear functional g 7→ (

∫gdP,

∫gdQ) is closed.

2. Test by blindly flipping a coin, i.e., let Z ∼ Bern(1− α) ⊥⊥ X. This achieves the point (α, α).

3. If (α, β) ∈ R(P,Q), then form the test that chooses P whenever PZ|X choses Q, and choosesQ whenever PZ|X choses P , which gives (1− α, 1− β) ∈ R(P,Q).

The region R(P,Q) consists of the operating points of all randomized tests, which includedeterministic tests as special cases. The achievable region of deterministic tests are denoted by

Rdet(P,Q) =⋃

E

(P (E), Q(E). (12.2)

One might wonder the relationship between these two regions. It turns out that R(P,Q) is given bythe closed convex hull of Rdet(P,Q).

We first recall a couple of notations:

• Closure: cl(E) , the smallest closed set containing E.

• Convex hull: co(E) , the smallest convex set containing E = ∑ni=1 αixi : αi ≥ 0,

∑ni=1 αi =

1, xi ∈ E,n ∈ N. A useful example: if (f(x), g(x)) ∈ E,∀x, then (E [f(X)] ,E [g(X)]) ∈cl(co(E)).

Theorem 12.2 (Randomized test v.s. deterministic tests).

R(P,Q) = cl(co(Rdet(P,Q))).

Consequently, if P and Q are on a finite alphabet X , then R(P,Q) is a polygon of at most 2|X |

vertices.

Proof. “⊃”: Comparing (12.1) and (12.2), by definition, R(P,Q) ⊃ Rdet(P,Q)). By Theorem 12.1,R(P,Q) is closed convex, and we are done with the ⊃ direction.

“⊂”: Given any randomized test PZ|X , put g(x) = PZ=0|X=x. Then g is a measurable function.Let’s recall the following lemma:

Lemma 12.1 (Area rule). For any positive random variable U ≥ 0, E[U ] =∫R+

P [U ≥ u] du.

Proof. By Fubini, E[U ] = E∫ U

0 du = E∫

1U≥udu =∫E1U≥udu.

139

Thus,

P [Z = 0] =∑

x

g(x)P (x) = EP [g(X)] =

∫ 1

0P [g(X) ≥ t]dt

Q[Z = 0] =∑

x

g(x)Q(x) = EQ[g(X)] =

∫ 1

0Q[g(X) ≥ t]dt

where we applied the formula E[U ] =∫P [U ≥ t] dt for U ≥ 0. Therefore the point (P [Z = 0], Q[Z =

0]) ∈ R is a mixture of points (P [g(X) ≥ t], Q[g(X) ≥ t]) ∈ Rdet, averaged according to t uniformlydistributed on the unit interval. Hence R ⊂ cl(co(Rdet)).

The last claim follows because there are at most 2|X | subsets in (12.2).

Example: Testing Bern(p) versus Bern(q), p < 12 < q. Using Theorem 12.2, note that there are

22 = 4 events E = ∅, 0, 1, 0, 1. Then

0

1

β

1α

R(Ber

n(p),B

ern(q)

)(p, q)

(p, q)

12.3 Likelihood ratio tests

Definition 12.3. The log likelihood ratio (LLR) is T = log dPdQ : X → R ∪ ±∞. The likelihood

ratio test (LRT) with threshold τ ∈ R is 1log dPdQ ≤ τ. Formally, we assume that dP = p(x)dµ

and dQ = q(x)dµ (one can take µ = P +Q, for example) and set

T (x) ,

log p(x)q(x) , p(x) > 0, q(x) > 0

+∞, p(x) > 0, q(x) = 0

−∞, p(x) = 0, q(x) > 0

undefined, p(x) = 0, q(x) = 0

Notes:

• LRT is a deterministic test. The intuition is that upon observing x, if Q(x)P (x) exceeds a certain

threshold, suggesting Q is more likely, one should reject the null hypothesis and declare Q.

140

• The rationale for defining extended values ±∞ of T (x) are the following observations:

∀x,∀τ ∈ R : (p(x)− expτq(x))1T (x) > τ ≥ 0

(p(x)− expτq(x))1T (x) ≥ τ ≥ 0

(q(x)− exp−τp(x))1T (x) < τ ≥ 0

(q(x)− exp−τp(x))1T (x) ≤ τ ≥ 0

This leads to the following useful consequence: For any g ≥ 0 and any τ ∈ R (note: τ = ±∞is excluded) we have

EP [g(X)1T ≥ τ] ≥ expτ · EQ[g(X)1T ≥ τ] (12.3)

EQ[g(X)1T ≤ τ] ≥ exp−τ · EP [g(X)1T ≤ τ] (12.4)

Below, these and similar inequalities are only checked for the cases of T taking real (notextended) values, but from this remark it should be clear how to treat the general case.

• Another useful observation:

Q[T = +∞] = P [T = −∞] = 0 . (12.5)

Theorem 12.3.

1. T is a sufficient statistic for testing H0 vs H1.

2. (Change of measure) For discrete alphabet X and when Q P we have

Q[T = t] = exp(−t)P [T = t] ∀f ∈ R ∪ +∞

More generally, we have for any g : R ∪ ±∞ → R

EQ[g(T )] = g(−∞)Q[T = −∞] + EP [exp−Tg(T )] (12.6)

EP [g(T )] = g(+∞)P [T = +∞] + EQ[expTg(T )] (12.7)

Proof. (2)

QT (t) =∑

XQ(x)1log

P (x)

Q(x)= t =

∑

XQ(x)1etQ(x) = P (x)

= e−t∑

XP (x)1log

P (x)

Q(x)= t = e−tPT (t)

To prove the general version (12.6), note that

EQ[g(T )] =

∫

−∞<T (x)<∞dµ q(x)g(T (x)) + g(−∞)Q[T = −∞]

=

∫

−∞<T (x)<∞dµ p(x) exp−T (x)g(T (x)) + g(−∞)Q[T = −∞]

= EP [exp−Tg(T )] + g(−∞)Q[T = −∞] ,

141

where we used (12.5) to justify restriction to finite values of T .(1) To show T is a s.s, we need to show PX|T = QX|T . For the discrete case we have:

PX|T (x|t) =PX(x)PT |X(t|x)

PT (t)=P (x)1P (x)

Q(x) = etPT (t)

=efQ(x)1P (x)

Q(x) = etPT (t)

=QXT (xt)

e−tPT (t)

(2)=

QXTQT

= QX|T (x|t).

The general argument is done similarly to the proof of (12.6).

From Theorem 12.2 we know that to obtain the achievable region R(P,Q), one can iterate overall subsets and compute the region Rdet(P,Q) first, then take its closed convex hull. But this is aformidable task if the alphabet is huge or infinite. But we know that the LLR log dP

dQ is a sufficient

statistic. Next we give bounds to the region R(P,Q) in terms of the statistics of log dPdQ . As usual,

there are two types of statements:

• Converse (outer bounds): any point in R(P,Q) must satisfy ...

• Achievability (inner bounds): the following point belong to R(P,Q) ...

12.4 Converse bounds on R(P,Q)Theorem 12.4 (Weak Converse). ∀(α, β) ∈ R(P,Q),

d(α‖β) ≤ D(P‖Q)

d(β‖α) ≤ D(Q‖P )

where d(·‖·) is the binary divergence.

Proof. Use data processing for KL divergence with PZ|X .

Lemma 12.2 (Deterministic tests). ∀E, ∀γ > 0 : P [E]− γQ[E] ≤ P[

log dPdQ > log γ

]

Proof. (Discrete version)

P [E]− γQ[E] =∑

x∈Ep(x)− γq(x) ≤

∑

x∈E(p(x)− γq(x))1p(x)>γq(x)

= P[

logdP

dQ> log γ,X ∈ E

]− γQ

[log

dP

dQ> log γ,X ∈ E

]≤ P

[log

dP

dQ> log γ

].

(General version) WLOG, suppose P,Q µ for some measure µ (since we can always takeµ = P +Q). Then dP = p(x)dµ, dQ = q(x)dµ. Then

P [E]− γQ[E] =

∫

Edµ(p(x)− γq(x)) ≤

∫

Edµ(p(x)− γq(x))1p(x)>γq(x)

= P[

logdP

dQ> log γ,X ∈ E

]−Q

[log

dP

dQ> log γ,X ∈ E

]≤ P

[log

dP

dQ> log γ

].

where the second line follows from pq =

dPdµdQdµ

= dPdQ .

[So we see that the only difference between the discrete and the general case is that the countingmeasure is replaced by some other measure µ.]

142

Note: In this case, we do not need P Q, since ±∞ is a reasonable and meaningful value for thelog likelihood ratio.

Lemma 12.3 (Randomized tests). P [Z = 0]− γQ[Z = 0] ≤ P[

log dPdQ > log γ

].

Proof. Almost identical to the proof of the previous Lemma 12.2:

P [Z = 0]− γQ[Z = 0] =∑

x

PZ|X(0|x)(p(x)− γq(x)) ≤∑

x

PZ|X(0|x)(p(x)− γq(x))1p(x)>γq(x)

= P[

logdP

dQ> log γ, Z = 0

]−Q

[log

dP

dQ> log γ, Z = 0

]

≤ P[

logdP

dQ> log γ

].

Theorem 12.5 (Strong Converse). ∀(α, β) ∈ R(P,Q), ∀γ > 0,

α− γβ ≤ P[

logdP

dQ> log γ

](12.8)

β − 1

γα ≤ Q

[log

dP

dQ< log γ

](12.9)

Proof. Apply Lemma 12.3 to (P,Q, γ) and (Q,P, 1/γ).

Note: Theorem 12.5 provides an outer bound for the region R(P,Q) in terms of half-spaces. To seethis, suppose one fixes γ > 0 and looks at the line α− γβ = c and slowing increases c from zero,there is going to be a maximal c, say c∗, at which point the line touches the lower boundary of theregion. Then (12.8) says that c∗ cannot exceed P [log dP

dQ > log γ]. Hence R must lie to the left ofthe line. Similarly, (12.9) provides bounds for the upper boundary. Altogether Theorem 12.5 statesthat R(P,Q) is contained in the intersection of a collection of half-spaces indexed by γ.Note: To apply the strong converse Theorem 12.5, we need to know the CDF of the LLR, whereasto apply the weak converse Theorem 12.4 we need only to know the expectation of the LLR, i.e.,divergence.

12.5 Achievability bounds on R(P,Q)Since we know that the set R(P,Q) is convex, it is natural to try to find all of its supporting lines(hyperplanes), as it is well known that closed convex set equals the intersection of the halfspacescorreposponding to all supporting hyperplanes. So thus, we are naturally lead to solving the problem

maxα− tβ : (α, β) ∈ R(P,Q) .This can be done rather simply:

α∗ − tβ∗ = max(α,β)∈R

(α− tβ) = maxPZ|X

∑

x∈X(P (x)− tQ(x))PZ|X(0|x) =

∑

x∈X|P (x)− tQ(x)|+

where the last equality follows from the fact that we are free to choose PZ|X(0|x), and the bestchoice is obvious:

PZ|X(0|x) = 1

log

P (x)

Q(x)≥ log t

.

Thus, we have shown that all supporting hyperplanes are parameterized by LLR-tests. Thiscompletely recovers the region R(P,Q) except for the points corresponding to the faces (flat pieces)of the region. To be precise, we state the following result.

143

Theorem 12.6 (Neyman-Pearson Lemma: “LRT is optimal”). For any α, βα is attained by thefollowing test:

PZ|X(0|x) =

1 log dPdQ > τ

λ log dPdQ = τ

0 log dPdQ < τ

(12.10)

where τ ∈ R and λ ∈ [0, 1] are the unique solutions to α = P [log dPdQ > τ ] + λP [log dP

dQ = τ ].

Proof of Theorem 12.6. Let t = exp(τ). Given any test PZ|X , let g(x) = PZ|X(0|x) ∈ [0, 1]. Wewant to show that

α = P [Z = 0] = EP [g(X)] = P[dP

dQ> t]

+ λP[dP

dQ= t]

(12.11)

⇒ β = Q[Z = 0] = EQ[g(X)]goal≥ Q

[dP

dQ> t]

+ λQ[dP

dQ= t]

(12.12)

Using the simple fact that EQ[f(X)1dPdQ≤t

] ≥ t−1EP [f(X)1dPdQ≤t

] for any f ≥ 0 twice, we have

β = EQ[g(X)1dPdQ≤t

] + EQ[g(X)1dPdQ>t

]

≥ 1

tEP [g(X)1

dPdQ≤t

]

︸︷︷︸+EQ[g(X)1

dPdQ>t

]

(12.11)=

1

t

(EP [(1− g(X))1

dPdQ>t

] + λP[dP

dQ= t]

︸︷︷︸

)+ EQ[g(X)1

dPdQ>t

]

≥ EQ[(1− g(X))1dPdQ>t

] + λQ[dP

dQ= t]

+ EQ[g(X)1dPdQ>t

]

= Q[dP

dQ> t]

+ λQ[dP

dQ= t].

Remark 12.3. As a consequence of the Neyman-Pearson lemma, all the points on the boundary ofthe region R(P,Q) are attainable. Therefore

R(P,Q) = (α, β) : βα ≤ β ≤ 1− β1−α.

Since α 7→ βα is convex on [0, 1], hence continuous, the region R(P,Q) is a closed convex set.Consequently, the infimum in the definition of βα is in fact a minimum.

Furthermore, the lower half of the region R(P,Q) is the convex hull of the union of the followingtwo sets:

α = P[

log dPdQ > τ

]

β = Q[

log dPdQ > τ

] τ ∈ R ∪ ±∞.

and α = P

[log dP

dQ ≥ τ]

β = Q[

log dPdQ ≥ τ

] τ ∈ R ∪ ±∞.

Therefore it does not lose optimality to restrict our attention on tests of the form 1log dPdQ ≥ τ or

1log dPdQ > τ. The convex combination (randomization) of the above two styles of tests lead to the

achievability of the Neyman-Pearson lemma (Theorem 12.6).

144

Remark 12.4. The test (12.10) is related to LRT2 as follows:

t

1

P [log dPdQ > t]

α

τt

1

P [log dPdQ > t]

α

τ

1. Left figure: If α = P [log dPdQ > τ ] for some τ , then λ = 0, and (12.10) becomes the LRT

Z = 1log dP

dQ≤τ

.

2. Right figure: If α 6= P [log dPdQ > τ ] for any τ , then we have λ ∈ (0, 1), and (12.10) is equivalent

to randomize over tests: Z = 1log dP

dQ≤τ

with probability λ or 1log dP

dQ<τ

with probability λ.

Corollary 12.1. ∀τ ∈ R, there exists (α, β) ∈ R(P,Q) s.t.

α = P[

logdP

dQ> τ

]

β ≤ exp(−τ)P[

logdP

dQ> τ

]≤ exp(−τ)

Proof.

Q[

logdP

dQ> τ

]=∑

Q(x)1P (x)

Q(x)> eτ

≤∑

P (x)e−τ1P (x)

Q(x)> eτ

= e−τP

[log

dP

dQ> τ

].

12.6 Asymptotics

Now we have many samples from the underlying distribution

H0 : X1, . . . , Xni.i.d.∼ P

H1 : X1, . . . , Xni.i.d.∼ Q

We’re interested in the asymptotics of the error probabilities π0|1 and π1|0. There are two maintypes of tests, both which the convergence rate to zero error is exponential.

1. Stein Regime: What is the best exponential rate of convergence for π0|1 when π1|0 has to be≤ ε?

π1|0 ≤ επ0|1 → 0

2Note that it so happens that in Definition 12.3 the LRT is defined with an ≤ instead of <.

145

2. Chernoff Regime: What is the trade off between exponents of the convergence rates of π1|0and π0|1 when we want both errors to go to 0?

π1|0 → 0

π0|1 → 0

146

§ 13. Hypothesis testing asymptotics I

Setup:

H0 : Xn ∼ PXn H1 : Xn ∼ QXn

test PZ|Xn : X n → 0, 1specification 1− α = π1|0 β = π0|1

13.1 Stein’s regime

false alarm: 1− α = π1|0 ≤ εmissed detection: β = π0|1 → 0 at the rate 2−nVε

Note: Motivation of this objective: usually a “miss”(0|1) is much worse than a “false alarm” (1|0).

Definition 13.1 (ε-optimal exponent). Vε is called an ε-optimal exponent in Stein’s regime if

Vε = supE : ∃n0, ∀n ≥ n0, ∃PZ|Xn s.t. α > 1− ε, β < 2−nE ,

⇔ Vε = lim infn→∞

1

nlog

1

β1−ε(PXn , QXn)

where βα(P,Q) = minPZ|X ,P (Z=0)≥αQ(Z = 0).

Exercise: Check the equivalence.

Definition 13.2 (Stein’s exponent).

V = limε→0

Vε.

Theorem 13.1 (Stein’s lemma). Let PXn = PnX i.i.d. and QXn = QnX i.i.d. Then

Vε = D(P‖Q), ∀ε ∈ (0, 1).

Consequently,V = D(P‖Q).

Example: If it is required that α ≥ 1−10−3, and β ≤ 10−40, what’s the number of samples needed?

Stein’s lemma provides a rule of thumb: n & − log 10−40

D(P‖Q) .

147

Proof. Denote F = log dPdQ , and Fn = log dPXn

dQXn=∑n

i=1 log dPdQ(Xi) – iid sum.

Recall Neyman Pearson’s lemma on optimal tests (likelihood ratio test): ∀τ ,

α = P (F > τ), β = Q(F > τ) ≤ e−τ

Also notice that by WLLN, under P , as n→∞,

1

nFn =

1

n

n∑

i=1

logdP (Xi)

dQ(Xi)

P−→EP[log

dP

dQ

]= D(P‖Q). (13.1)

Alternatively, under Q, we have

1

nFn

P−→EQ[log

dP

dQ

]= −D(Q‖P ) (13.2)

1. Show Vε ≥ D(P‖Q) = D.Pick τ = n(D − δ), for some small δ > 0. Then the optimal test achieves:

α = P (Fn > n(D − δ))→ 1, by (13.1)

β ≤ e−n(D−δ)

then pick n large enough (depends on ε, δ) such that α ≥ 1−ε, we have the exponent E = D−δachievable, Vε ≥ E. Further let δ → 0, we have that Vε ≥ D.

2. Show Vε ≤ D(P‖Q) = D.

a) (weak converse) ∀(α, β) ∈ R(PXn , QXn), we have

−h(α) + α log1

β≤ d(α‖β) ≤ D(PXn‖QXn) (13.3)

where the first inequality is due to

d(α‖β) = α logα

β+ α log

α

β= −h(α) + α log

1

β+ α log

1

β︸︷︷︸≥ 0 and ≈ 0 for small β

and the second is due to the weak converse Theorem 12.4 proved in the last lecture (dataprocessing inequality for divergence).∀ achievable exponent E < Vε, by definition, there exists a sequence of tests PZ|Xn such

that αn ≥ 1− ε and βn ≤ 2−nE . Plugging it in (13.3) and using h ≤ log 2, we have

− log 2 + (1− ε)nE ≤ nD(P‖Q)⇒ E ≤ D(P‖Q)

1− ε +log 2

n(1− ε)︸︷︷︸→0, as n→∞

.

Therefore

Vε ≤D(P‖Q)

1− εNotice that this is weaker than what we hoped to prove, and this weak converse result istight for ε→ 0, i.e., for Stein’s exponent we did have the desired result V = limε→0 Vε ≥D(P‖Q).

148

b) (strong converse) In proving the weak converse, we only made use of the expectation ofFn in (13.3), we need to make use of the entire distribution (CDF) in order to obtainstronger results.

Recall the strong converse result which we showed in the last lecture:

∀(α, β) ∈ R(P,Q),∀γ, α− γβ ≤ P (F > log γ)

Here, suppose there exists a sequence of tests PZ|Xn which achieve αn ≥ 1 − ε and

βn ≤ 2−nE . Then

1− ε− γ2−nE ≤ αn − γβn ≤ PXn [Fn > log γ].

Pick log γ = n(D + δ), by (13.1) the RHS goes to 0, and we have

1− ε− 2n(D+δ)2−nE ≤ o(1)

⇒D + δ − E ≥ 1

nlog(1− ε+ o(1))→ 0

⇒E ≤ D as δ → 0

⇒Vε ≤ D

Remark 13.1 (Ergodic). Just like in last section of data compression. Ergodic assumptions onPXn and QXn allow one to show that

Vε = limn→∞

1

nD(PXn‖QXn)

the counterpart of (13.3), which is the key for picking the appropriate τ , for ergodic sequence Xn isthe Birkhoff-Khintchine convergence theorem.

Remark 13.2. The theoretical importance of knowing the Stein’s exponents is that:

∀E ⊂ X n, PXn [E] ≥ 1− ε ⇒ QXn [E] ≥ 2−nVε+o(n)

Thus knowledge of Stein’s exponent Vε allows one to prove exponential bounds on probabilities ofarbitrary sets, the technique is known as “change of measure”.

13.2 Chernoff regime

We are still considering i.i.d. sequence Xn, and binary hypothesis

H0 : Xn ∼ PnX H1 : Xn ∼ QnX

But our objective in this section is to have both types of error probability to vanish exponentiallyfast simultaneously. We shall look at the following specification:

1− α = π1|0 → 0 at the rate 2−nE0

β = π0|1 → 0 at the rate 2−nE1

149

Apparently, E0 (resp. E1) can be made arbitrarily big at the price of making E1 (resp. E0) arbitrarilysmall. So the problem boils down to the optimal tradeoff, i.e., what’s the achievable region of(E0, E1)? This problem is solved by [Hoe65, Bla74].

The optimal tests give the explict error probability:

αn = P

[1

nFn > τ

], βn = Q

[1

nFn > τ

]

and we are interested in the asymptotics when n→∞, in which scenario we know (13.1) and (13.2)occur.

Stein’s regime corresponds to the corner points. Indeed, Theorem 13.1 tells us that when fixingαn = 1 − ε, namely E0 = 0, picking τ = D(P‖Q) − δ (δ → 0) gives the exponential convergencerate of βn as E1 = D(P‖Q). Similarly, exchanging the role of P and Q, we can achieves the point(E0, E1) = (D(Q‖P ), 0). More generally, to achieve the optimal tradeoff between the two cornerpoints, we need to introduce a powerful tool – Large Deviation Theory.Note: Here is a roadmap of the upcoming 2 lectures:

1. basics of large deviation (ψX , ψ∗X , tilted distribution Pλ)

2. information projection problem

minQ:EQ[X]≥γ

D(Q‖P ) = ψ∗(γ)

3. use information projection to prove tight Chernoff bound

P

[1

n

n∑

k=1

Xk ≥ γ]

= 2−nψ∗(γ)+o(n)

4. apply the above large deviation theorem to (E0, E1) to get

(E0(θ) = ψ∗P (θ), E1(θ) = ψ∗P (θ)− θ) characterize the achievable boundary.

13.3 Basics of Large deviation theory

Let Xn be an i.i.d. sequence and Xi ∼ P . Large deviation focuses on the following inequality:

P

[n∑

i=1

Xi ≥ nγ]

= 2−nE(γ)+o(n)

150

what is the rate function E(γ) = − limn→∞1n logP

[∑ni=1 Xin ≥ γ

]? (Chernoff’s ineq.)

To motivate, let us recall the usual Chernoff bound: For iid Xn, for any λ ≥ 0,

P

[n∑

i=1

Xi ≥ nγ]

= P

[exp

(λ

n∑

i=1

Xi

)≥ exp(nλγ)

]

Markov≤ exp(−nλγ)E

[exp

(λ

n∑

i=1

Xi

)]

= exp −nλγ + n logE [exp(λX)] .

Optimizing over λ ≥ 0 gives the non-asymptotic upper bound (concentration inequality) whichholds for any n:

P

[n∑

i=1

Xi ≥ nγ]≤ exp

− n sup

λ≥0(λγ − logE [exp(λX)]︸︷︷︸

log MGF

).

Of course we still need to show the lower bound.Let’s first introduce the two key quantities: log MGF (also known as the cumulant generating

function) ψX(λ) and tilted distribution Pλ.

13.3.1 Log MGF

Definition 13.3 (Log MGF).

ψX(λ) = log(E[exp(λX)]), λ ∈ R.

Per the usual convention, we will also denote ψP (λ) = ψX(λ) if X ∼ P .

Assumptions: In this section, we shall restrict to the distribution PX such that

1. MGF exists, i.e., ∀λ ∈ R, ψX(λ) <∞, which, in particular, implies all moments exist.

2. X 6=const.

Example:

• Gaussian: X ∼ N (0, 1)⇒ ψX(λ) = λ2

2 .

• Example of R.V. such that ψX(λ) does not exist: X = Z3 with Z ∼ Gaussian. ThenψX(λ) =∞,∀λ] 6= 0.

Theorem 13.2 (Properties of ψX).

1. ψX is convex;

2. ψX is continuous;

3. ψX is infinitely differentiable and

ψ′X(λ) =E[XeλX ]

E[eλX ]= e−ψX(λ)E[XeλX ].

In particular, ψX(0) = 0, ψ′X(0) = E [X].

151

4. If a ≤ X ≤ b a.s., then a ≤ ψ′X ≤ b;

5. Conversely, ifA = inf

λ∈Rψ′X(λ), B = sup

λ∈Rψ′X(λ),

then A ≤ X ≤ B a.s.;

6. ψX is strictly convex, and consequently, ψ′X is strictly increasing.

7. Chernoff bound:P (X ≥ γ) ≤ exp(−λγ + ψX(λ)), λ ≥ 0.

Remark 13.3. The slope of log MGF encodes the range of X. Indeed, 4) and 5) of Theorem 13.2together show that the smallest closed interval containing the support of PX equals (closure of) therange of ψ′X . In other words, A and B coincide with the essential infimum and supremum (min andmax of RV in the probabilistic sense) of X respectively,

A = essinf X , supa : X ≥ a a.s.B = esssupX , infb : X ≤ b a.s.

Proof. Note: 1–4 can be proved right now. 7 is the usual Chernoff bound. The proof of 5–6 relieson Theorem 13.4, which can be skipped for now.

1. Fix θ ∈ (0, 1). Recall Holder’s inequality:

E[|UV |] ≤ ‖U‖p‖V ‖q, for p, q ≥ 1,1

p+

1

q= 1

where the Lp-norm of RV is defined by ‖U‖p = (E|U |p)1/p. Applying to E[e(θλ1+θλ2)X ] withp = 1/θ, q = 1/θ, we get

E[exp((λ1/p+ λ2/q)X)] ≤ ‖ exp(λ1X/p)‖p‖ exp(λ2X/q)‖q = E[exp(λ1X)]θE[exp(λ2X)]θ,

i.e., eψX(θλ1+θλ2) ≤ eψX(λ1)θeψX(λ2)θ.

2. By our assumptions on X, domain of ψX is R, and by the fact that convex function must becontinuous on the interior of its domain, we have that ψX is continuous on R.

3. Be careful when exchanging the order of differentiation and expectation.

Assume λ > 0 (similar for λ ≤ 0).First, we show that E[|XeλX |] exists. Since

e|X| ≤ eX + e−X

|XeλX | ≤ e|(λ+1)X| ≤ e(λ+1)X + e−(λ+1)X

by assumption on X, both of the summands are absolutely integrable in X. Therefore bydominated convergence theorem (DCT), E[|XeλX |] exists and is continuous in λ.

152

Second, by the existence and continuity of E[|XeλX |], u 7→ E[|XeuX |] is integrable on [0, λ],we can switch order of integration and differentiation as follows:

eψX(λ) = E[eλX ] = E[1 +

∫ λ

0XeuXdu

]Fubini

= 1 +

∫ λ

0E[XeuX

]du

⇒ ψ′X(λ)eψX(λ) = E[XeλX ]

thus ψ′X(λ) = e−ψX(λ)E[XeλX ] exists and is continuous in λ on R.

Furthermore, using similar application of DCT we can extend to λ ∈ C and show thatλ 7→ E[eλX ] is a holomorphic function. Thus it is infinitely differentiable.

4.

a ≤ X ≤ b⇒ ψ′X(λ) =E[XeλX ]

E[eλX ]∈ [a, b].

5. Suppose PX [X > B] > 0 (for contradiction), then PX [X > B + 2ε] > 0 for some small ε > 0.But then Pλ[X ≤ B + ε]→ 0 for λ→∞ (see Theorem 13.4.3 below). On the other hand, weknow from Theorem 13.4.2 that EPλ [X] = ψ′X(λ) ≤ B. This is not yet a contradiction, sincePλ might still have some very small mass at a very negative value. To show that this cannothappen, we first assume that B − ε > 0 (otherwise just replace X with X − 2B). Next notethat

B ≥ EPλ [X] = EPλ [X1X<B−ε] + EPλ [X1B−ε≤X≤B+ε] + EPλ [X1X>B+ε]

≥ EPλ [X1X<B−ε] + EPλ [X1X>B+ε]

≥ − EPλ [|X|1X<B−ε] + (B + ε)Pλ[X > B + ε]︸︷︷︸→1

(13.4)

therefore we will obtain a contradiction if we can show that EPλ [|X|1X<B−ε]→ 0 as λ→∞.To that end, notice that convexity of ψX implies that ψ′X B. Thus, for all λ ≥ λ0 we haveψ′X(λ) ≥ B − ε

2 . Thus, we have for all λ ≥ λ0

ψX(λ) ≥ ψX(λ0) + (λ− λ0)(B − ε

2) = c+ λ(B − ε

2) , (13.5)

153

for some constant c. Then,

EPλ [|X|1X < B − ε] = E[|X|eλX−ψX(λ)1X < B − ε] (13.6)

≤ E[|X|eλX−c−λ(B− ε2

)1X < B − ε] (13.7)

≤ E[|X|eλ(B−ε)−c−λ(B− ε2

)] (13.8)

= E[|X|]e−λ ε2−c → 0 λ→∞ (13.9)

where the first inequality is from (13.5) and the second from X < B − ε. Thus, the first termin (13.4) goes to 0 implying the desired contradiction.

6. Suppose ψX is not strictly convex. Since we know that ψX is convex, then ψX must be“flat” (affine) near some point, i.e., there exists a small neighborhood of some λ0 such thatψX(λ0 + u) = ψX(λ0) + ur for some r ∈ R. Then ψPλ(u) = ur for all u in small neighborhoodof zero, or equivalently EPλ [eu(X−r)] = 1 for u small. The following Lemma 13.1 impliesPλ[X = r] = 1, but then P [X = r] = 1, contradicting the assumption X 6= const.

Lemma 13.1. E[euS ] = 1 for all u ∈ (−ε, ε) then S = 0.

Proof. Expand in Taylor series around u = 0 to obtain E[S] = 0, E[S2] = 0. Alternatively, we canextend the argument we gave for differentiating ψX(λ) to show that the function z 7→ E[ezS ] isholomorphic on the entire complex plane1. Thus by uniqueness, E[euS ] = 1 for all u.

Definition 13.4 (Rate function). The rate function ψ∗X : R→ R∪ +∞ is given by the Legendre-Fenchel transform of the log MGF:

ψ∗X(γ) = supλ∈R

λγ − ψX(λ) (13.10)

Note: The maximization (13.10) is a nice convex optimization problem since ψX is strictly convex,so we are maximizing a strictly concave function. So we can find the maximum by taking thederivative and finding the stationary point. In fact, ψ∗X is the dual of ψX in the sense of convexanalysis.

Theorem 13.3 (Properties of ψ∗X).

1More precisely, if we only know that E[eλS ] is finite for |λ| ≤ 1 then the function z 7→ E[ezS ] is holomorphic inthe vertical strip z : |Rez| < 1.

154

1. Let A = essinf X and B = esssupX. Then

ψ∗X(γ) =

λγ − ψX(λ) for some λ s.t. γ = ψ′X(λ), A < γ < Blog 1

P (X=γ) γ = A or B

+∞, γ < A or γ > B

2. ψ∗X is strictly convex and strictly positive except ψ∗X(E[X]) = 0.

3. ψ∗X is decreasing when γ ∈ (A,E[X]), and increasing when γ ∈ [E[X], B)

Proof. By Theorem 13.2.4, since A ≤ X ≤ B a.s., we have A ≤ ψ′X ≤ B. When γ ∈ (A,B), thestrictly concave function λ 7→ λγ − ψX(λ) has a single stationary point which achieves the uniquemaximum. When γ > B (resp. < A), λ 7→ λγ − ψX(λ) increases (resp. decreases) without bounds.When γ = B, since X ≤ B a.s., we have

ψ∗X(B) = supλ∈R

λB − log(E[exp(λX)]) = − log infλ∈R

E[exp(λ(X −B))]

= − log limλ→∞

E[exp(λ(X −B))] = − logP (X = B),

by monotone convergence theorem.By Theorem 13.2.6, since ψX is strictly convex, the derivative of ψX and ψ∗X are inverse to each

other. Hence ψ∗X is strictly convex. Since ψX(0) = 0, we have ψ∗X(γ) ≥ 0. Moreover, ψ∗X(E[X]) = 0follows from E[X] = ψ′X(0).

13.3.2 Tilted distribution

As early as in Lecture 3, we have already introduced tilting in the proof of Donsker-Varadhan’svariational characterization of divergence (Theorem 3.5). Let us formally define it now.

Definition 13.5 (Tilting). Given X ∼ P , the tilted measure Pλ is defined by

Pλ(dx) =eλx

E[eλX ]P (dx) = eλx−ψX(λ)P (dx) (13.11)

In other words, if P has a pdf p, then the pdf of Pλ is given by pλ(x) = eλx−ψX(λ)p(x).

Note: The set of distributions Pλ : λ ∈ R parametrized by λ is called a standard exponentialfamily, a very useful model in statistics. See [Bro86, p. 13].Example:

• Gaussian: P = N (0, 1) with density p(x) = 1√2π

exp(−x2/2). Then Pλ has densityexp(λx)

exp(λ2/2)1√2π

exp(−x2/2) = 1√2π

exp(−(x− λ)2/2). Hence Pλ = N (λ, 1).

• Binary : P is uniform on ±1. Then Pλ(1) = eλ

eλ+e−λ which puts more (resp. less) mass on 1

if λ > 0 (resp. < 0). Moreover, PλD−→δ1 if λ→∞ or δ−1 if λ→ −∞.

• Uniform: P is uniform on [0, 1]. Then Pλ is also supported on [0, 1] with pdf pλ(x) = λ exp(λx)eλ−1

.Therefore as λ increases, Pλ becomes increasingly concentrated near 1, and Pλ → δ1 as λ→∞.Similarly, Pλ → δ0 as λ→ −∞.

155

So we see that Pλ shifts the mean of P to the right (resp. left) when λ > 0 (resp. < 0). Indeed, thisis a general property of tilting.

Theorem 13.4 (Properties of Pλ).

1. Log MGF:ψPλ(u) = ψX(λ+ u)− ψX(λ)

2. Tilting trades mean for divergence:

EPλ [X] = ψ′X(λ) ≷ EP [X] if λ ≷ 0. (13.12)

D(Pλ‖P ) = ψ∗X(ψ′X(λ)) = ψ∗X(EPλ [X]). (13.13)

3.

P (X > b) > 0⇒ ∀ε > 0, Pλ(X ≤ b− ε)→ 0 as λ→∞P (X < a) > 0⇒ ∀ε > 0, Pλ(X ≥ a+ ε)→ 0 as λ→ −∞

Therefore if Xλ ∼ Pλ, then XλD−→ essinf X = A as λ → −∞ and Xλ

D−→ esssupX = B asλ→∞.

Proof. 1. By definition. (DIY)

2. EPλ [X] = E[X exp(λX)]E[exp(λX)] = ψ′X(λ), which is strictly increasing in λ, with ψ′X(0) = EP [X].

D(Pλ‖P ) = EPλ log dPλdP = EPλ log exp(λX)

E[exp(λX)] = λEPλ [X] − ψX(λ) = λψ′X(λ) − ψX(λ) =

ψ∗X(ψ′X(λ)), where the last equality follows from Theorem 13.3.1.

3.

Pλ(X ≤ b− ε) = EP [eλX−ψX(λ)1X≤b−ε]

≤ EP [eλ(b−ε)−ψX(λ)1X≤b−ε]

≤ e−λεeλb−ψX(λ)

≤ e−λε

P [X > b]→ 0 as λ→∞

where the last inequality is due to the usual Chernoff bound (Theorem 13.2.7): P [X > b] ≤exp(−λb+ ψX(λ)).

156

§ 14. Information projection and Large deviation

14.1 Large-deviation exponents

Large deviations problems deal with rare events by making statements about the tail probabilitiesof a sequence of distributions. We’re interested in the speed of decay for probabilities such asP[

1n

∑nk=1Xk ≥ γ

]for iid Xk.

In the last lecture we used Chernoff bound to obtain an upper bound on the exponent via thelog-MGF and tilting. Next we use a different method to give a formula for the exponent as a convexoptimization problem involving the KL divergence (information projection). Later in Section 14.3we shall revisit the Chernoff bound after we have computed the value of the information projection.

Theorem 14.1. Let X1, X2, . . .i.i.d.∼ P . Then for any γ ∈ R,

limn→∞

1

nlog

1

P[

1n

∑nk=1Xk > γ

] = infQ : EQ[X]>γ

D(Q‖P ) (14.1)

limn→∞

1

nlog

1

P[

1n

∑nk=1Xk ≥ γ

] = infQ : EQ[X]≥γ

D(Q‖P ) (14.2)

Furthermore, the upper bound on the probabilities hold for every n.

Example: [Binomial tail] Applying Theorem 14.1 to Xii.i.d.∼ Bern(p), we get the following bounds on

the binomial tail:

P(Bin(n, p) ≥ k) ≤ exp(−nd(k/n‖p)), k

n> p

P(Bin(n, p) ≤ k) ≤ exp(−nd(k/n‖p)), k

n< p

where we used the fact that minQ:EQ[X]≥k/nD(Q‖Bern(p)) = minq≥k/n d(q‖p) = d( kn‖p).

Proof. First note that if the events have zero probability, then both sides coincide with infinity.Indeed, if P

[1n

∑nk=1Xk > γ

]= 0, then P [X > γ] = 0. Then EQ[X] > γ ⇒ Q[X > γ] > 0⇒ Q 6

P ⇒ D(Q‖P ) =∞ and hence (14.1) holds trivially. The case for (14.2) is similar.In the sequel we assume both probabilities are nonzero. We start by proving (14.1). Set

P [En] = P[

1n

∑nk=1Xk > γ

].

Lower Bound on P [En]: Fix a Q such that EQ[X] > γ. Let Xn be iid. Then by WLLN,

Q[En] = Q

[n∑

k=1

Xk > nγ

]LLN= 1− o(1).

Now the data processing inequality gives

d(Q[En]‖P [En]) ≤ D(QXn‖PXn) = nD(Q‖P )

157

And a lower bound for the binary divergence is

d(Q[En]‖P [En]) ≥ −h(Q[En]) +Q[En] log1

P [En]

Combining the two bounds on d(Q[En]‖P [En]) gives

P [En] ≥ exp

(−nD(Q‖P )− log 2

Q[En]

)(14.3)

Optimizing over Q to give the best bound:

lim supn→∞

1

nlog

1

P [En]≤ inf

Q:EQ[X]>γD(Q‖P ).

Upper Bound on P [En]: The key observation is that given any X and any event E, PX(E) > 0can be expressed via the divergence between the conditional and unconditional distribution as:log 1

PX(E) = D(PX|X∈E‖PX). Define PXn = PXn|∑Xi>nγ , under which

∑Xi > nγ holds a.s. Then

log1

P [En]= D(PXn‖PXn) ≥ inf

QXn :EQ[∑Xi]>nγ

D(QXn‖PXn) (14.4)

We now show that the last problem “single-letterizes”, i.e., reduces n = 1. Consider the followingtwo steps:

D(QXn‖PXn) ≥n∑

j=1

D(QXj‖P ) (14.5)

≥ nD(Q‖P ) , Q ,1

n

n∑

j=1

QXj , (14.6)

where the first step follows from Corollary 2.1 after noticing that PXn = Pn, and the second step isby convexity of divergence Theorem 4.1. From this argument we conclude that

infQXn :EQ[

∑Xi]>nγ

D(QXn‖PXn) = n · infQ:EQ[X]>γ

D(Q‖P ) (14.7)

infQXn :EQ[

∑Xi]≥nγ

D(QXn‖PXn) = n · infQ:EQ[X]≥γ

D(Q‖P ) (14.8)

In particular, (14.4) and (14.7) imply the required lower bound in (14.1).Next we prove (14.2). First, notice that the lower bound argument (14.4) applies equally well,

so that for each n we have

1

nlog

1

P[

1n

∑nk=1Xk ≥ γ

] ≥ infQ : EQ[X]≥γ

D(Q‖P ) .

To get a matching upper bound we consider two cases:

• Case I: P [X > γ] = 0. If P [X ≥ γ] = 0, then both sides of (14.2) are +∞. If P [X = γ] > 0,then P [

∑Xk ≥ nγ] = P [X1 = . . . = Xn = γ] = P [X = γ]n. For the right-hand side, since

D(Q‖P ) < ∞ =⇒ Q P =⇒ Q(X ≤ γ) = 1, the only possibility for EQ[X] ≥ γ is thatQ(X = γ) = 1, i.e., Q = δγ . Then infEQ[X]≥γ D(Q‖P ) = log 1

P (X=γ) .

158

• Case II: P [X > γ] > 0. Since P[∑Xk ≥ γ] ≥ P[

∑Xk > γ] from (14.1) we know that

lim supn→∞

1

nlog

1

P[

1n

∑nk=1Xk ≥ γ

] ≤ infQ : EQ[X]>γ

D(Q‖P ) .

We next show that in this case

infQ : EQ[X]>γ

D(Q‖P ) = infQ : EQ[X]≥γ

D(Q‖P ) (14.9)

Indeed, let P = PX|X>γ which is well defined since P [X > γ] > 0. For any Q such

that EQ[X] ≥ γ, set Q = εQ + εP satisfies EQ[X] > γ. Then by convexity, D(Q‖P ) ≤εD(Q‖P ) + εD(P‖P ) = εD(Q‖P ) + ε log 1

P [X>γ] . Sending ε → 0, we conclude the proof

of (14.9).

14.2 Information Projection

The results of Theorem 14.1 motivate us to study the following general information projectionproblem: Let E be a convex set of distributions on some abstract space Ω, then for the distributionP on Ω, we want

infQ∈E

D(Q‖P )

Denote the minimizing distribution Q by Q∗. The next result shows that intuitively the “line”between P and optimal Q∗ is “orthogonal” to E .

E

P

Q∗

Distributions on X

Q

Theorem 14.2. Suppose ∃Q∗ ∈ E such that D(Q∗‖P ) = minQ∈E D(Q‖P ), then ∀Q ∈ ED(Q‖P ) ≥ D(Q‖Q∗) +D(Q∗‖P )

Proof. If D(Q‖P ) =∞, then we’re done, so we can assume that D(Q‖P ) <∞, which also impliesthat D(Q∗‖P ) <∞. For θ ∈ [0, 1], form the convex combination Q(θ) = θQ∗ + θQ ∈ E . Since Q∗ isthe minimizer of D(Q‖P ), then1

0 ≤ ∂

∂θ

∣∣∣∣θ=0

D(Q(θ)‖P ) = D(Q‖P )−D(Q‖Q∗)−D(Q∗‖P )

and we’re done.1This can be found by taking the derivative and matching terms (Exercise). Be careful with exchanging derivatives

and integrals. Need to use dominated convergence theorem similar as in the “local behavior of divergence” inProposition 4.1.

159

Remark 14.1. If we view the picture above in the Euclidean setting, the “triangle” formed by P ,Q∗ and Q (for Q∗, Q in a convex set, P outside the set) is always obtuse, and is a right triangleonly when the convex set has a “flat face”. In this sense, the divergence is similar to the squaredEuclidean distance, and the above theorem is sometimes known as a “Pythagorean” theorem.

The interesting set of Q’s that we will focus next is the “half-space” of distributions E = Q :EQ[X] ≥ γ, where X : Ω → R is some fixed function (random variable). This is justified byrelation (to be established) with the large deviation exponent in Theorem 14.1. First, we solve thisI-projection problem explicitly.

Theorem 14.3. Given a distribution P on Ω and X : Ω→ R let

A = inf ψ′X = essinf X = supa : X ≥ a P -a.s. (14.10)

B = supψ′X = esssupX = infb : X ≤ b P -a.s. (14.11)

1. The information projection problem over E = Q : EQ[X] ≥ γ has solution

minQ : EQ[X]≥γ

D(Q‖P ) =

0 γ < EP [X]

ψ∗P (γ) EP [X] ≤ γ < B

log 1P (X=B) γ = B

+∞ γ > B

(14.12)

= ψ∗P (γ)1γ ≥ EP [X] (14.13)

2. Whenever the minimum is finite, minimizing distribution is unique and equal to tilting of Palong X, namely2

dPλ = expλX − ψ(λ) · dP (14.14)

3. For all γ ∈ [EP [X], B) we have

minEQ[X]≥γ

D(Q‖P ) = infEQ[X]>γ

D(Q‖P ) = minEQ[X]=γ

D(Q‖P ) .

Note: An alternative expression is

minQ:EQ[X]≥γ

D(Q‖P ) = supλ≥0

λγ − ψX(λ) .

Proof. First case: Take Q = P .Fourth case: If EQ[X] > B, then Q[X ≥ B + ε] > 0 for some ε > 0, but P [X ≥ B + ε] = 0, since

P (X ≤ B) = 1, by Theorem 13.2.5. Hence Q 6 P =⇒ D(Q‖P ) =∞.Third case: If P (X = B) = 0, then X < B a.s. under P , and Q 6 P for any Q s.t. EQ[X] ≥ B.

Then the minimum is ∞. Now assume P (X = B) > 0. Since D(Q‖P ) < ∞ =⇒ Q P =⇒Q(X ≤ B) = 1. Therefore the only possibility for EQ[X] ≥ B is that Q(X = B) = 1, i.e., Q = δB.Then D(Q‖P ) = log 1

P (X=B) .

Second case: Fix EP [X] ≤ γ < B, and find the unique λ such that ψ′X(λ) = γ = EPλ [X] wheredPλ = exp(λX − ψX(λ))dP . This corresponds to tilting P far enough to the right to increase its

2Note that unlike previous Lecture, here P and Pλ are measures on an abstract space Ω, not necessarily on thereal line.

160

mean from EPX to γ, in particular λ ≥ 0. Moreover, ψ∗X(γ) = λγ − ψX(λ). Take any Q such thatEQ[X] ≥ γ, then

D(Q‖P ) = EQ[log

dQdPλdPdPλ

](14.15)

= D(Q‖Pλ) + EQ[logdPλdP

]

= D(Q‖Pλ) + EQ[λX − ψX(λ)]

≥ D(Q‖Pλ) + λγ − ψX(λ)

= D(Q‖Pλ) + ψ∗X(γ)

≥ ψ∗X(γ), (14.16)

where the last inequality holds with equality if and only if Q = Pλ. In addition, this shows theminimizer is unique, proving the second claim. Note that even in the corner case of γ = B (assumingP (X = B) > 0) the minimizer is a point mass Q = δB, which is also a tilted measure (P∞), sincePλ → δB as λ→∞, cf. Theorem 13.4.3.

Another version of the solution, given by expression (14.13), follows from Theorem 13.3.For the third claim, notice that there is nothing to prove for γ < EP [X], while for γ ≥ EP [X]

we have just shownψ∗X(γ) = min

Q:EQ[X]≥γD(Q‖P )

while from the next corollary we have

infQ:EQ[X]>γ

D(Q‖P ) = infγ′>γ

ψ∗X(γ′) .

The final step is to notice that ψ∗X is increasing and continuous by Theorem 13.3, and hence theright-hand side infimum equalis ψ∗X(γ). The case of minQ:EQ[X]=γ is handled similarly.

Corollary 14.1. ∀Q with EQ[X] ∈ (A,B), there exists a unique λ ∈ R such that the tilteddistribution Pλ satisfies

EPλ [X] = EQ[X]

D(Pλ‖P ) ≤ D(Q‖P )

and furthermore the gap in the last inequality equals D(Q‖Pλ) = D(Q‖P )−D(Pλ‖P ).

Proof. Proceed as in the proof of Theorem 14.3, and find the unique λ s.t. EPλ [X] = ψ′X(λ) = EQ[X].Then D(Pλ‖P ) = ψ∗X(EQ[X]) = λEQ[X] − ψX(λ). Repeat the steps (14.15)-(14.16) obtainingD(Q‖P ) = D(Q‖Pλ) +D(Pλ‖P ).

Remark: For any Q, this allows us to find a tilted measure Pλ that has the same mean yetsmaller (or equal) divergence.

14.3 Interpretation of Information Projection

The following picture describes many properties of information projections.

161

Q 6≪ P

Q 6≪ P

bP

b

b

Q∗

=Pλ

Q

D(Pλ||P )= ψ∗(λ)

One Parameter Family

γ = A

γ = B

EQ[X ] = γλ = 0

λ > 0

Space of distributions on R

• Each set Q : EQ[X] = γ corresponds to a slice. As γ varies from A to B, the curves fill theentire space except for the corner regions.

• When γ < A or γ > B, Q 6 P .

• As γ varies, the Pλ’s trace out a curve via ψ∗(γ) = D(Pλ‖P ). This set of distributions iscalled a one parameter family, or exponential family.

Key Point: The one parameter family curve intersects each γ-slice E = Q : EQ[X] = γ“orthogonally” at the minimizing Q∗ ∈ E , and the distance from P to Q∗ is given by ψ∗(λ). To seethis, note that applying Theorem 14.2 to the convex set E gives us D(Q‖P ) ≥ D(Q‖Q∗) +D(Q∗‖P ).Now thanks to Corollary 14.1, we in fact have equality D(Q‖P ) = D(Q‖Q∗) + D(Q∗‖P ) andQ∗ = Pλ for some tilted measure.

Chernoff bound revisited: The proof of upper bound in Theorem 14.1 is via the definition ofinformation projection. Theorem 14.3 shows that the value of the information projection coincideswith the rate function (conjugate of log-MGF). This shows the optimality of the Chernoff bound(recall Theorem 13.2.7). Indeed, we directly verify this for completeness: For all λ ≥ 0,

P

[n∑

k=1

Xk ≥ nγ]≤ e−nγλ(EP [eλX ])n = e−n(λγ−ψX(λ))

where we used iid Xk’s in the expectation. Optimizing over λ ≥ 0 to get the best upper bound:

supλ≥0

λγ − ψX(λ) = supλ∈R

λγ − ψX(λ) = ψ∗X(γ)

where the first equality follows since γ ≥ EP [X], therefore λ 7→ λγ−ψX(λ) is increasing when λ ≤ 0.Hence

P

[n∑

k=1

Xk ≥ nγ]≤ e−nψ∗X(γ). (14.17)

Remark: The Chernoff bound is tight precisely because, from information projection, the lowerbound showed that the best change of measure is to change to the tilted measure Pλ.

162

14.4 Generalization: Sanov’s theorem

Theorem 14.4 (Sanov’s Theorem). Consider observing n samples X1, . . . , Xn ∼ iid P . Let P bethe empirical distribution, i.e., P = 1

n

∑nj=1 δXj . Let E be a convex set of distributions. Then under

regularity conditions on E and P we have

P[P ∈ E ] = e−nminQ∈E D(Q‖P )+o(n)

Note: Examples of regularity conditions: space X is finite and E is closed with non-empty interior;space X is Polish and the set E is weakly closed and has non-empty interior.

Proof sketch. The lower bound is proved as in Theorem 14.1: Just take an arbitrary Q ∈ E andapply a suitable version of WLLN to conclude Qn[P ∈ E ] = 1 + o(1).

For the upper bound we can again adapt the proof from Theorem 14.1. Alternatively, we canwrite the convex set E as an intersection of half spaces. Then we’ve already solved the problemfor half-spaces Q : EQ[X] ≥ γ. The general case follows by the following consequence ofTheorem 14.2: if Q∗ is projection of P onto E1 and Q∗∗ is projection of Q∗ on E2, then Q∗∗ is alsoprojection of P onto E1 ∩ E2:

D(Q∗∗‖P ) = minQ∈E1∩E2

D(Q‖P )⇐D(Q∗‖P ) = minQ∈E1 D(Q‖P )

D(Q∗∗‖Q∗) = minQ∈E2 D(Q‖Q∗)

(Repeated projection property)Indeed, by first tilting from P to Q∗ we find

P [P ∈ E1 ∩ E2] ≤ 2−nD(Q∗‖P )Q∗[P ∈ E1 ∩ E2]

≤ 2−nD(Q∗‖P )Q∗[P ∈ E2]

and from here proceed by tilting from Q∗ to Q∗∗ and note that D(Q∗‖P )+D(Q∗∗‖Q∗) = D(Q∗∗‖P ).

Remark: Sanov’s theorem tells us the probability that, after observing n iid samples of adistribution, the empirical distribution is still far away from the true distribution, is exponentiallysmall.

163

§ 15. Hypothesis testing asymptotics II

Setup:

H0 : Xn ∼ PXn H1 : Xn ∼ QXn (i.i.d.)

test PZ|Xn : X n → 0, 1specification: 1− α = π

(n)1|0 ≤ 2−nE0 β = π

(n)0|1 ≤ 2−nE1

Bounds:

• achievability (Neyman Pearson)

α = 1− π1|0 = P [Fn > τ ], β = π0|1 = Q[Fn > τ ]

• converse (strong): from Theorem 12.5:

∀(α, β) achievable, α− γβ ≤ P [Fn > log γ] (15.1)

where

F = logdPXn

dQXn(Xn),

15.1 (E0, E1)-Tradeoff

Goal:π1|0 = 1− α ≤ 2−nE0 , π0|1 = β ≤ 2−nE1 .

Our goal in the Chernoff regime is to find the best tradeoff, which we formally define as follows(compare to Stein’s exponent in Lecture 13)

E∗1(E0) , supE1 : ∃n0, ∀n ≥ n0,∃PZ|Xn s.t. α > 1− 2−nE0 , β < 2−nE1

= lim infn→∞

1

nlog

1

β1−2−nE0 (Pn, Qn)

Define

T = logdQ

dP(X), Tk = log

dQ

dP(Xk), thus log

dQn

dPn(Xn) =

n∑

k=1

Tk

Log MGF of T under P (again assumed to be finite and also T 6= const since P 6= Q):

ψP (λ) = logEP [eλT ]

= log∑

x

P (x)1−λQ(x)λ = log

∫(dP )1−λ(dQ)λ

ψ∗P (θ) = supλ∈R

θλ− ψP (λ)

164

Note that since ψP (0) = ψP (1) = 0 from convexity ψP (λ) is finite on 0 ≤ λ ≤ 1. Furthermore,assuming P Q and Q P we also have that λ 7→ ψP (λ) continuous everywhere on [0, 1] (on (0, 1) it follows from convexity, but for boundary points we need more detailed arguments).Consequently, all the results in this section apply under just the conditions of P Q and Q P .However, since in previous lecture we were assuming that log-MGF exists for all λ, we will onlypresent proofs under this extra assumption.

Theorem 15.1. Let P Q, Q P , then

E0(θ) = ψ∗P (θ), E1(θ) = ψ∗P (θ)− θ (15.2)

parametrized by −D(P‖Q) ≤ θ ≤ D(Q‖P ) characterizes the best exponents on the boundary ofachievable (E0, E1).

Note: The geometric interpretation of the above theorem is shown in the following picture, which relyon the properties of ψP (λ) and ψ∗P (θ). Note that ψP (0) = ψP (1) = 0. Moreover, by Theorem 13.3(Properties of ψ∗X), θ 7→ E0(θ) is increasing, θ 7→ E1(θ) is decreasing.

Remark 15.1 (Renyi divergence). Renyi defined a family of divergence indexed by λ 6= 1

Dλ(P‖Q) ,1

λ− 1logEQ

[(dP

dQ

)λ]≥ 0,

which generalizes Kullback-Leibler divergence since Dλ(P‖Q)λ→1−−−→ D(P‖Q). Note that ψP (λ) =

(λ− 1)Dλ(Q‖P ) = −λD1−λ(P‖Q). This provides another explanation that ψP is negative between0 and 1, and the slope at endpoints is: ψ′P (0) = −D(P‖Q) and ψ′P (1) = D(Q‖P ).

Corollary 15.1 (Bayesian criterion). Fix a prior (π0, π1) such that π0 + π1 = 1 and 0 < π0 < 1.Denote the optimal Bayesian (average) error probability by

P ∗e (n) , infPZ|Xn

π0π1|0 + π1π0|1

with exponent

E , limn→∞

1

nlog

1

P ∗e (n).

ThenE = max

θmin(E0(θ), E1(θ)) = ψ∗P (0) = − inf

λ∈[0,1]ψP (λ),

regardless of the prior, and ψ∗P (0) , C(P,Q) is called the Chernoff exponent.

165

Remark 15.2 (Bhattacharyya distance). There is an important special case in which Chernoffexponent simplifies. Instead of i.i.d. observations, consider independent, but not identicallydistributed observations. Namely, suppose that two hypotheses correspond to two different stringsxn and xn over a finite alphabet X . The hypothesis tester observes Y n = (Yj , j = 1, . . . n) obtainedby applying one of the strings to the input of the memoryless channel PY |X (the alphabet Y doesnot need to be finite, but we assume this below). Extending Corollary it can be shown, that in thiscase optimal probability P ∗e (xn, xn) has (Chernoff) exponent1

E = − infλ∈[0,1]

1

n

n∑

t=1

log∑

y∈YPY |X(y|xt)λPY |X(y|xt)1−λ .

If |X | = 2 and if compositions of xn and xn are equal (!), the expression is invariant under λ↔ (1−λ)and thus from convexity in λ we infer that λ = 1

2 is optimal, yielding E = 1ndB(xn, xn), where

dB(xn, xn) = −n∑

t=1

log∑

y∈Y

√PY |X(y|xt)PY |X(y|xt)

is known as Bhattacharyya distance between codewords xn and xn. Without the two assumptionsstated, dB(·, ·) does not necessarily give the optimal error exponent. We do, however, always havethe bounds

1

42−2dB(xn,xn) ≤ P ∗e (xn, xn) ≤ 2−dB(xn,xn)

with the upper bound being the more tight the more joint composition of (xn, xn) resembles that of(xn, xn).

Proof of Theorem 15.1. The idea is to apply the large deviation theory to iid sum∑n

k=1 Tk. Specif-ically, let’s rewrite the bounds in terms of T :

• Achievability (Neyman Pearson)

let τ = −nθ, π(n)1|0 = P

[n∑

k=1

Tk ≥ nθ]

π(n)0|1 = Q

[n∑

k=1

Tk < nθ

]

• Converse (strong): from (15.1)

let γ = 2−nθ, π1|0 + 2−nθπ0|1 ≥ P[

n∑

k=1

Tk ≥ nθ]

Achievability: Using Neyman Pearson test, for fixed τ = −nθ, apply the large deviationtheorem:

1− α = π(n)1|0 = P

[n∑

k=1

Tk ≥ nθ]

= 2−nψ∗P (θ)+o(n), for θ ≥ EPT = −D(P‖Q)

β = π(n)0|1 = Q

[n∑

k=1

Tk < nθ

]= 2−nψ

∗Q(θ)+o(n), for θ ≤ EQT = D(Q‖P )

1In short, the tilting parameter λ does not need to change between coordinates t corresponding to different valuesof (xt, xt).

166

Notice that by the definition of T we have

ψQ(λ) = logEQ[eλ log(Q/P )] = logEP [e(λ+1) log(Q/P )] = ψP (λ+ 1)

⇒ ψ∗Q(θ) = supλ∈R

θλ− ψP (λ+ 1) = ψ∗P (θ)− θ

thus (E0, E1) in (15.2) is achievable.Converse: We want to show that any achievable (E0, E1) pair must be below the curve

(E0(θ), E1(θ)) in the above Neyman-Pearson test with parameter θ. Apply the strong conversebound we have:

2−nE0 + 2−nθ2−nE1 ≥ 2−nψ∗P (θ)+o(n)

⇒min(E0, E1 + θ) ≤ ψ∗P (θ), ∀n, θ,−D(P‖Q) ≤ θ ≤ D(Q‖P )

⇒ either E0 ≤ ψ∗P (θ) or E1 ≤ ψ∗P (θ)− θ

15.2 Equivalent forms of Theorem 15.1

Alternatively, the optimal (E0, E1)-tradeoff can be stated in the following equivalent forms:

Theorem 15.2. 1. The optimal exponents are given (parametrically) in terms of λ ∈ [0, 1] as

E0 = D(Pλ‖P ), E1 = D(Pλ‖Q) (15.3)

where the distribution Pλ2 is tilting of P along T , cf. (14.14), which moves from P0 = P to

P1 = Q as λ ranges from 0 to 1:

dPλ = (dP )1−λ(dQ)λ exp−ψP (λ).

2. Yet another characterization of the boundary is

E∗1(E0) = minQ′:D(Q′‖P )≤E0

D(Q′‖Q) , 0 ≤ E0 ≤ D(Q‖P ) (15.4)

Proof. The first part is verified trivially. Indeed, if we fix λ and let θ(λ) , EPλ [T ], then from (13.13)we have

D(Pλ‖P ) = ψ∗P (θ) ,

whereas

D(Pλ‖Q) = EPλ

[log

dPλdQ

]= EPλ

[log

dPλdP

dP

dQ

]= D(Pλ‖P )− EPλ [T ] = ψ∗P (θ)− θ .

Also from (13.12) we know that as λ ranges in [0, 1] the mean θ = EPλ [T ] ranges from −D(P‖Q) toD(Q‖P ).

To prove the second claim (15.4), the key observation is the following: Since Q is itself a tiltingof P along T (with λ = 1), the following two families of distributions

dPλ = expλT − ψP (λ) · dPdQλ′ = expλ′T − ψQ(λ′) · dQ

2This is called a geometric mixture of P and Q.

167

are in fact the same family with Qλ′ = Pλ′+1.Now, suppose that Q∗ achieves the minimum in (15.4) and that Q∗ 6= Q, Q∗ 6= P (these cases

should be verified separately). Note that we have not shown that this minimum is achieved, but itwill be clear that our argument can be extended to the case of when Q′n is a sequence achieving theinfimum. Then, on one hand, obviously

D(Q∗‖Q) = minQ′:D(Q′‖P )≤E0

D(Q′‖Q) ≤ D(P‖Q)

On the other hand, since E0 ≤ D(Q‖P ) we also have

D(Q∗‖P ) ≤ D(Q‖P ) .

Therefore,

EQ∗ [T ] = EQ∗[log

dQ∗

dP

dQ

dQ∗

]= D(Q∗‖P )−D(Q∗‖Q) ∈ [−D(P‖Q), D(Q‖P )] . (15.5)

Next, we have from Corollary 14.1 that there exists a unique Pλ with the following three properties:3

EPλ [T ] = EQ∗ [T ]

D(Pλ‖P ) ≤ D(Q∗‖P )

D(Pλ‖Q) ≤ D(Q∗‖Q)

Thus, we immediately conclude that minimization in (15.4) can be restricted to Q∗ belonging to thefamily of tilted distributions Pλ, λ ∈ R. Furthermore, from (15.5) we also conclude that λ ∈ [0, 1].Hence, characterization of E∗1(E0) given by (15.3) coincides with the one given by (15.4).

3Small subtlety: In Corollary 14.1 we ask EQ∗ [T ] ∈ (A,B). But A,B – the essential range of T – depend on thedistribution under which the essential range is computed, cf. (14.10). Fortunately, we have Q P and P Q, soessential range is the same under both P and Q. And furthermore (15.5) implies that EQ∗ [T ] ∈ (A,B).

168

Note: Geometric interpretation of (15.4) is as follows: As λ increases from 0 to 1, or equivalently,θ increases from −D(P‖Q) to D(Q‖P ), the optimal distribution traverses down the curve. Thiscurve is in essense a geodesic connecting P to Q and exponents E0,E1 measure distances to P andQ. It may initially sound strange that the sum of distances to endpoints actually varies along thegeodesic, but it is a natural phenomenon: just consider the unit circle with metric induced by theambient Euclidean metric. Than if p and q are two antipodal points, the distance from intermediatepoint to endpoints do not sum up to d(p, q) = 2.

15.3* Sequential Hypothesis Testing

Review: Filtrations, stopping times

• A sequence of nested σ-algebras F0 ⊂ F1 ⊂ F2 · · · ⊂ Fn · · · ⊂ F is called a filtrationof F .

• A random variable τ is called a stopping time of a filtration Fn if a) τ is valued inZ+ and b) for every n ≥ 0 the event τ ≤ n ∈ Fn.

• The σ-algebra Fτ consists of all events E such that E ∩ τ ≤ n ∈ Fn for all n ≥ 0.

• When Fn = σX1, . . . , Xn the interpretation is that τ is a time that can bedetermined by causally observing the sequence Xj , and random variables measurablewith respect to Fτ are precisely those whose value can be determined on the basisof knowing (X1, . . . , Xτ ).

• Let Mn be a martingale adapted to Fn, i.e. Mn is Fn-measurable and E[Mn|Fk] =Mmin(n,k). Then Mn = Mmin(n,τ) is also a martingale. If collection Mn is uniformlyintegrable then

E[Mτ ] = E[M0] .

• For more details, see [C11, Chapter V].

169

Different realizations of Xk are informative to different levels, the total “information” we receivefollows a random process. Therefore, instead of fixing the sample size n, we can make n a stoppingtime τ , which gives a “better” (E0, E1) tradeoff. Solution is the concept of sequential test:

• Informally: Sequential test Z at each step declares either “H0”, “H1” or “give me one moresample”.

• Rigorous definition is as follows: A sequential hypothesis test is a stopping time τ of thefiltration Fn , σX1, . . . , Xn and a random variable Z ∈ 0, 1 measurable with respect toFτ .

• Each sequential test has the following performance metrics:

α = P[Z = 0], β = Q[Z = 0] (15.6)

l0 = EP[τ ], l1 = EQ[τ ] (15.7)

The easiest way to see why sequential tests may be dramatically superior to fixed-sample-sizetests is the following example: Consider P = 1

2δ0 + 12δ1 and Q = 1

2δ0 + 12δ−1. Since P 6⊥ Q, we

also have Pn 6⊥ Qn. Consequently, no finite-sample-size test can achieve zero error under bothhypotheses. However, an obvious sequential test (wait for the first appearance of ±1) achieves zeroerror probability with finite average number of samples (2) under both hypotheses. This advantageis also very clear in the achievable error exponents as the next figure shows.

Theorem 15.3 (Wald). Assume bounded LLR:4

∣∣∣ logP (x)

Q(x)

∣∣∣ ≤ c0,∀x

where c0 is some positive constant. If the error probabilities satisfy:

π1|0 ≤ 2−l0E0 , π0|1 ≤ 2−l1E1

for large l0, l1, then the following inequality for the exponents holds

E0E1 ≤ D(P‖Q)D(Q‖P ).

4This assumption is satisfied for discrete distributions on finite spaces.

170

with optimal boundary achieved by the sequential probability ratio test SPRT(A,B) (A, B are largepositive numbers) defined as follows:

τ = infn : Sn ≥ B or Sn ≤ −A

Z =

0, if Sτ ≥ B1, if Sτ < −A

where

Sn =n∑

k=1

logP (Xk)

Q(Xk)

is the log likelihood function of the first k observations.

Note: (Intuition on SPRT) Under the usual hypothesis testing setup, we collect n samples, evaluatethe LLR Sn, and compare it to the threshold to give the optimal test. Under the sequential setupwith iid data, Sn : n ≥ 1 is a random walk, which has positive (resp. negative) drift D(P‖Q)(resp. −D(Q‖P )) under the null (resp. alternative)! SPRT test simply declare P if the random walkcrosses the upper boundary B, or Q if the random walk crosses the upper boundary −A.

Proof. As preparation we show two useful identities:

• For any stopping time with EP [τ ] <∞ we have

EP [Sτ ] = EP [τ ]D(P‖Q) (15.8)

and similarly, if EQ[τ ] <∞ then

EQ[Sτ ] = −EQ[τ ]D(Q‖P ) .

To prove these, notice thatMn = Sn − nD(P‖Q)

is clearly a martingale w.r.t. Fn. Consequently,

Mn ,Mmin(τ,n)

is also a martingale. ThusE[Mn] = E[M0] = 0 ,

or, equivalently,E[Smin(τ,n)] = E[min(τ, n)]D(P‖Q) . (15.9)

This holds for every n ≥ 0. From boundedness assumption we have |Sn| ≤ nc and thus|Smin(n,τ)| ≤ nτ , implying that collection Smin(n,τ), n ≥ 0 is uniformly integrable. Thus, wecan take n→∞ in (15.9) and interchange expectation and limit safely to conclude (15.8).

• Let τ be a stopping time. The Radon-Nikodym derivative of P w.r.t. Q on σ-algebra Fτ isgiven by

dP|FτdQ|Fτ

= expSτ .

Indeed, what we need to verify is that for every event E ∈ Fτ we have

EP [1E ] = EQ[expSτ1E ] (15.10)

171

To that end, consider a decomposition

1E =∑

n≥0

1E∩τ=n .

By monotone convergence theorem applied to (15.10) it is sufficient to verify that for every n

EP [1E∩τ=n] = EQ[expSτ1E∩τ=n] . (15.11)

This, however, follows from the fact that E ∩ τ = n ∈ Fn anddP|FndQ|Fn

= expSn by the very

definition of Sn.

We now proceed to the proof. For achievability we apply (15.10) to infer

π1|0 = P[Sτ ≤ −A]

= EQ[expSτ1Sτ ≤ −A]≤ e−A

Next, we denot τ0 = infn : Sn ≥ B and observe that τ ≤ τ0, whereas expectation of τ0 we estimatefrom (15.8):

EP [τ ] ≤ EP [τ0] = EP [Sτ0 ] ≤ B + c0 ,

where in the last step we used the boundedness assumption to infer

Sτ0 ≤ B + c0

Thus

l0 = EP[τ ] ≤ EP[τ0] ≤ B + c0

D(P‖Q)≈ B

D(P‖Q)for large B

Similarly we can show π0|1 ≤ e−B and l1 ≤ AD(Q‖P ) for large A. Take B = l0D(P‖Q), A =

l1D(Q‖P ), this shows the achievability.

Converse: Assume (E0, E1) achievable for large l0, l1 and apply data processing inequality ofdivergence:

d(P(Z = 1)‖Q(Z = 1)) ≤ D(P‖Q)∣∣Fτ

= EP [Sτ ] = EP[τ ]D(P‖Q) from (15.8)

= l0D(P‖Q)

notice that for l0E0 and l1E1 large, we have d(P(Z = 1)‖Q(Z = 1)) ≈ l1E1, therefore l1E1 .l0D(P‖Q). Similarly we can show that l0E0 . l1D(Q‖P ), finally we have

E0E1 ≤ D(P‖Q)D(Q‖P ), as l0, l1 →∞

172

Part IV

Channel coding

173

§ 16. Channel coding

Objects we have studied so far:

1. PX - Single distribution, data compression

2. PX vs QX - Comparing two distributions, Hypothesis testing

3. Now: PY |X : X → Y (called a channel, dealing with a collection of distributions).

16.1 Channel Coding

Definition 16.1. An M -code for PY |X is an encoder/decoder pair (f, g) of (randomized) functions1

• encoder f : [M ]→ X

• decoder g : Y → [M ] ∪ e

Here [M ] , 1, . . . ,M.In most cases f and g are deterministic functions, in which case we think of them (equivalently)

in terms of codewords, codebooks, and decoding regions

• ∀i ∈ [M ] : ci = f(i) are codewords, the collection C = c1, . . . , cM is called a codebook.

• ∀i ∈ [M ], Di = g−1(i) is the decoding region for i.

b

bb

b

b

b

bb

bb

b

c1

cM

D1

DM

Figure 16.1: When X = Y, the decoding regions can be pictured as a partition of the space, eachcontaining one codeword.

Note: The underlying probability space for channel coding problems will always be

Wf−→ X

PY |X−→ Yg−→ W

1For randomized encoder/decoders, we identify f and g as probability transition kernels PX|W and PW |Y .

174

When the source alphabet is [M ], the joint distribution is given by:

(general) PWXY W (m, a, b, m) =1

MPX|W (a|m)PY |X(b|a)PW |Y (m|b)

(deterministic f, g) PWXY W (m, cm, b, m) =1

MPY |X(b|cm)1b ∈ Dm

Throughout the notes, these quantities will be called:

• W - Original message

• X - (Induced) Channel input

• Y - Channel output

• W - Decoded message

16.1.1 Performance Metrics

Three ways to judge the quality of a code in terms of error probability:

1. Average error probability: Pe , P[W 6= W ].

2. Maximum error probability: Pe,max , maxm∈[M ] P[W 6= m|W = m].

3. Bit error rate: in the special case when M = 2k, we identify W with a k-bit string Sk ∈ Fk2.

Then the bit error rate is Pb ,1k

∑kj=1 P[Sj 6= Sj ], which means the average fraction of errors

in a k-bit block. It is also convenient to introduce in this case the Hamming distance

dH(Sk, Sk) , #i : Si 6= Sj .

Then, the bit-error rate becomes the normalized expected Hamming distance:

Pb =1

kE[dH(Sk, Sk)] .

To distinguish the bit error rate Pb from the previously defined Pe (resp. Pe,max), we will alsocall the latter the average (resp. max) block error rate.

The most typical metric is average probability of error, but the others will be used occasionally inthe course as well. By definition, Pe ≤ Pe,max. Therefore the maximum error probability is a morestringent criterion which offers uniform protection for all codewords.

16.1.2 Fundamental Limit for a given PY |X

Definition 16.2. A code (f, g) is an (M, ε)-code for PY |X if f : [M ]→ X , g : Y → [M ] ∪ e, andPe ≤ ε. Similarly, an (M, ε)max-code must satisfy Pe,max ≤ ε.

Then the fundamental limits of channel codes are defined as

M∗(ε) = maxM : ∃(M, ε)-codeM∗max(ε) = maxM : ∃(M, ε)max-code

Remark: log2M∗ gives the maximum number of bits that we can pump through a channel PY |X

while still guaranteeing the error probability (in the appropriate sense) is at most ε.

175

Example: The random transformation BSC(n, δ) (binary symmetric channel) is defined as

X = 0, 1nY = 0, 1n

where the input Xn is contaminated by additive noise Zn ⊥⊥ Xn and the channel outputs

Y n = Xn ⊕ Zn

where Zni.i.d.∼ Bern(δ). Pictorially, the BSC(n, δ) channel takes a binary sequence length n and flips

the bits independently with probability δ:

PY n|Xn

0 1 0 0 1 1 0 0 1 1

1 1 0 1 0 1 0 0 0 1

Question: When δ = 0.11, n = 1000, what is the maximum number of bits you can send withPe ≤ 10−3?

Ideas:

0. Can one send 1000 bits with Pe ≤ 10−3? No and apparently the probability that at least onebit is flipped is Pe = 1− (1− δ)n ≈ 1. This implies that uncoded transmission does not meetour objective and coding is necessary – tradeoff: reduce the number of bits to send, increasethe probability of success.

1. Take each bit and repeat it l times (l-repetition code).

0 0 1 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0

f

l

Decoding by majority vote, the probability of error of this scheme is Pe ≈ kP[Binom(l, δ) > l/2]and kl ≤ n = 1000, which for Pe ≤ 10−3 gives l = 21, k = 47 bits.

2. Reed-Muller Codes (1, r). Interpret a message a0, . . . , ar−1 ∈ Fr2 as the polynomial (in thiscase, a degree-1 and (r − 1)-variate polynomial)

∑r−1i=1 aixi + a0. Then codewords are formed

by evaluating the polynomial at all possible xr−1 ∈ Fr−12 . This code, which maps r bits to

2r−1 bits, has minimum distance 2r−2. For r = 7, there is a [64, 7, 32] Reed-Muller code andit can be shown that the MAP decoder of this code passed over the BSC(n = 64, δ = 0.11)achieves probability of error ≤ 6 · 10−6. Thus, we can use 16 such blocks (each carrying 7data bits and occupying 64 bits on the channel) over the BSC(1024, δ), and still have (unionbound) overall Pe . 10−4. This allows us to send 7 · 16 = 112 bits in 1024 channel uses, morethan double that of the repetition code.

176

3. Shannon’s theorem (to be shown soon) tells us that over memoryless channel of blocklength nthe fundamental limit satisfies

logM∗ = nC + o(n) (16.1)

as n → ∞ and for arbitrary ε ∈ (0, 1). Here C = maxX I(X1;Y1) is the capacity of thesingle-letter channel. In our case we have

I(X;Y ) = maxPX

I(X;X + Z) = log 2− h(δ) ≈ 1

2bit

So Shannon’s expansion (16.1) can be used to predict (non-rigorously, of course) that it shouldbe possible to send around 500 bits reliably. As it turns out, for this blocklength this is notquite possible.

4. Even though calculating logM∗ is not computationally feasible (searching over all codebooksis doubly exponential in block length n), we can find bounds on logM∗ that are easy tocompute. We will show later in the course that in fact, for BSC(1000, .11)

414 ≤ logM∗ ≤ 416

5. The first codes to approach the bounds on logM∗ are called Turbo codes (after the turbochargerengine, where the exhaust is fed back in to power the engine). This class of codes is known assparse graph codes, of which LDPC codes are particularly well studied. As a rule of thumb,these codes typically approach 80 . . . 90% of logM∗ when n ≈ 103 . . . 104.

16.2 Basic Results

Recall that the object of our study is M∗(ε) = maxM : ∃(M, ε)-code.

16.2.1 Determinism

1. Given any encoder f : [M ]→ X , the decoder that minimizes Pe is the Maximum A Posteriori(MAP) decoder, or equivalently, the Maximal Likelihood (ML) decoder, since the codewordsare equiprobable (W is uniform):

g∗(y) = argmaxm∈[M ]

P [W = m|Y = y]

= argmaxm∈[M ]

P [Y = y|W = m]

Furthermore, for a fixed f , the MAP decoder g is deterministic

2. For given M , PY |X , the Pe-minimizing encoder is deterministic.

Proof. Let f : [M ]→ X be a random transformation. We can always represent randomizedencoder as deterministic encoder with auxiliary randomness. So instead of f(a|m), considerthe deterministic encoder f(m,u), that receives external randomness u. Then looking at allpossible values of the randomness,

Pe = P [W 6= W ] = EU [P[W 6= W |U ] = EU [Pe(U)].

Each u in the expectation gives a deterministic encoder, hence there is a deterministic encoderthat is at least as good as the average of the collection, i.e., ∃u0 s.t. Pe(u0) ≤ P[W 6= W ]

177

Remark: If instead we use maximal probability of error as our performance criterion, thenthese results don’t hold; randomized encoders and decoders may improve performance. Example:consider M = 2 and we are back to the binary hypotheses testing setup. The optimal decoder (test)that minimizes the maximal Type-I and II error probability, i.e., max1−α, β, is not deterministic,if max1− α, β is not achieved at a vertex of the region R(P,Q).

16.2.2 Bit Error Rate vs Block Error Rate

Now we give a bound on the average probability of error in terms of the bit error probability.

Theorem 16.1. For all (f, g), M = 2k =⇒ Pb ≤ Pe ≤ kPbNote: The most often used direction Pb ≥ 1

kPe is rather loose for large k.

Proof. Recall that M = 2k gives us the interpretation of W = Sk sequence of bits.

1

k

k∑

i=1

1Si 6= Si ≤ 1Sk 6= Sk ≤k∑

i=1

1Si 6= Si,

where the first inequality is obvious and the second follow from the union bound. Taking expectationof the above expression gives the theorem.

Theorem 16.2 (Assouad). If M = 2k then

Pb ≥ minP[W = c′|W = c] : c, c′ ∈ Fk2, dH(c, c′) = 1 .Proof. Let ei be a length k vector that is 1 in the i-th position, and zero everywhere else. Then

k∑

i=1

1Si 6= Si ≥k∑

i=1

1Sk = Sk + ei

Dividing by k and taking expectation gives

Pb ≥1

k

k∑

i=1

P[Sk = Sk + ei]

≥ minP[W = c′|W = c] : c, c′ ∈ Fk2, dH(c, c′) = 1 .

Similarly, we can prove the following generalization:

Theorem 16.3. If A,B ∈ Fk2 (with arbitrary marginals!) then for every r ≥ 1 we have

Pb =1

kE[dH(A,B)] ≥

(k − 1

r − 1

)Pr,min (16.2)

Pr,min , minP[B = c′|A = c] : c, c′ ∈ Fk2, dH(c, c′) = r (16.3)

Proof. First, observe that

P[dH(A,B) = r|A = a] =∑

b:dH(a,b)=r

PB|A(b|a) ≥(k

r

)Pr,min .

Next, noticedH(x, y) ≥ r1dH(x, y) = r

and take the expectation with x ∼ A, y ∼ B.

178

Remark: In statistics, Assouad’s Lemma is a useful tool for obtaining lower bounds on theminimax risk of an estimator. Say the data X is distributed according to Pθ parameterized by θ ∈ Rkand let θ = θ(X) be an estimator for θ. The goal is to minimize the maximal risk supθ∈Θ Eθ[‖θ− θ‖1].

A lower bound (Bayesian) to this worst-case risk is the average risk E[‖θ− θ‖1], where θ is distributedto any prior. Consider θ uniformly distributed on the hypercube 0, εk with side length ε embeddedin the space of parameters. Then

infθ

supθ∈0,εk

E[‖θ − θ‖1] ≥ kε

4min

dH(θ,θ′)=1(1− TV(Pθ, Pθ′)) . (16.4)

This can be proved using similar ideas to Theorem 16.2. WLOG, assume that ε = 1.

E[‖θ − θ‖1](a)

≥ 1

2E[‖θ − θdis‖1] =

1

2E[dH(θ, θdis)]

≥ 1

2

k∑

i=1

minθi=θi(X)

P[θi 6= θi](b)=

1

4

k∑

i=1

(1− TV(PX|θi=0, PX|θi=1))

(c)

≥ k

4min

dH(θ,θ′)=1(1− TV(Pθ, Pθ′)) .

Here θdis is the discretized version of θ, i.e. the closest point on the hypercube to θ and so(a) follows from |θi − θi| ≥ 1

21|θi−θi|>1/2 = 121θi 6=θdis,i, (b) follows from the optimal binary

hypothesis testing for θi given X, (c) follows from the convexity of TV: TV(PX|θi=0, PX|θi=1) =

TV( 12k−1

∑θ:θi=0 PX|θ,

12k−1

∑θ:θi=1 PX|θ) ≤ 1

2k−1

∑θ:θi=0 TV(PX|θ, PX|θ⊕ei) ≤ maxdH(θ,θ′)=1 TV(Pθ, Pθ′).

Alternatively, (c) also follows from by providing the extra information θ\i and allowing θi = θi(X, θ\i)

in the second line.

16.3 General (Weak) Converse Bounds

Theorem 16.4 (Weak converse).

1. Any M -code for PY |X satisfies

logM ≤ supX I(X;Y ) + h(Pe)

1− Pe

2. When M = 2k

logM ≤ supX I(X;Y )

log 2− h(Pb)

Proof. 1. Since W → X → Y → W , we have the following chain of inequalities, cf. Fano’sinequality (Theorem 5.4):

supXI(X;Y ) ≥ I(X;Y ) ≥ I(W ; W )

dpi≥ d(P[W = W ]‖ 1

M)

≥ −h(P[W 6= W ]) + P[W = W ] logM

Plugging in Pe = P[W 6= W ] finishes the proof of the first part.

179

2. Now Sk → X → Y → Sk. Recall from Theorem 5.1 that for iid Sn,∑I(Si; Si) ≤ I(Sk; Sk).

This gives us

supXI(X;Y ) ≥ I(X;Y ) ≥

k∑

i=1

I(Si, Si)

≥ k 1

k

∑d

(P[Si = Si]

∥∥∥1

2

)

≥ kd(

1

k

k∑

i=1

P[Si = Si]∥∥∥1

2

)

= kd

(1− Pb

∥∥∥1

2

)= k(log 2− h(Pb))

where the second line used Fano’s inequality (Theorem 5.4) for binary random variable (ordivergence data processing), and the third line used the convexity of divergence.

16.4 General achievability bounds: Preview

Remark: Regarding differences between information theory and statistics: in statistics, there isa parametrized set of distributions on a space (determined by the model) from which we try toestimate the underlying distribution or parameter from samples. In data transmission, the challengeis to choose the structure on the parameter space (channel coding) such that, upon observing asample, we can estimate the correct parameter with high probability. With this in mind, it is naturalto view

logPY |X=x

PY

as an LLR of a binary hypothesis test, where we compare the hypothesis X = x to the distributioninduced by our codebook: PY = PY |X PX (so compare ci to “everything else”). To decode, weask M different questions of this form. This motivates importance of the random variable (calledinformation density):

i(X;Y ) = logPY |X(Y |X)

PY (Y ),

where PY = PY |X PX . Note that

I(X;Y ) = E[i(X;Y )],

which is what the weak converse is based on.Shortly, we will show a result (Shannon’s Random Coding Theorem), that states: ∀PX ,

∀τ , ∃(M, ε)-code with

ε ≤ P[i(X;Y ) ≤ logM + τ ] + e−τ

Details in the next lecture.

180

§ 17. Channel coding: achievability bounds

Notation: in the following proofs, we shall make use of the independent pairs (X,Y ) ⊥⊥ (X,Y )

X → Y (X : sent codeword)

X → Y (X : unsent codeword)

The joint distribution is given by:

PXYXY (a, b, a, b) = PX(a)PY |X(b|a)PX(a)PY |X(b|a).

17.1 Information density

Definition 17.1 (Information density). Given joint distribution PX,Y we define

iPXY (x; y) = logPY |X(y|x)

PY (y)= log

dPY |X=x(y)

dPY (y)(17.1)

and we define iPXY (x; y) = +∞ for all y in the singular set where PY |X=x is not absolutely continuousw.r.t. PY . We also define iPXY (x; y) = −∞ for all y such that dPY |X=x/dPY equals zero. We willalmost always abuse notation and write i(x; y) dropping the subscript PX,Y , assuming that the jointdistribution defining i(·; ·) is clear from the context.Notice that i(x; y) depends on the underlying PX and PY |X , which should be understood from thecontext.

Remark 17.1 (Intuition). Information density is a useful concept in understanding decoding. Indiscriminating between two codewords, one concerns with (as we learned in binary hypothesis

testing) the LLR, logPY |X=c1PY |X=c2

. In M -ary hypothesis testing, a similar role is played by information

density i(c1; y), which, loosely speaking, evaluates the likelihood of c1 against the average likelihood,or “everything else”, which we model by PY .

Remark 17.2 (Joint measurability). There is a measure-theoretic subtlety in (17.1): The so-definedfunction i(·; ·) may not be a measurable function on the product space X × Y . For a resolution, seeSection 2.6* and Remark 2.4 in particular.

Remark 17.3 (Alternative definition). Observe that for discrete X ,Y , (17.1) is equivalently writtenas

i(x; y) = logPX,Y (x, y)

PX(x)PY (y)= log

PX|Y (x|y)

PX(x)

For the general case, we often use the alternative definition, which is symmetric in X and Y and ismeasurable w.r.t. X × Y:

i(x; y) = logdPX,Y

dPX × PY(x, y) (17.2)

181

Notice a subtle difference between (17.1) and (17.2) for the continuous case: In (17.2) the Radon-Nikodym derivative is only defined up to sets of measure zero, therefore whenever PX(x) = 0 thevalue of PY (i(x, Y ) > t) is undefined. This problem does not occur with definition (17.1), and thatis why we prefer it. In any case, for discrete X , Y, or under other regularity conditions, all thedefinitions are equivalent.

Proposition 17.1 (Properties of information density).

1. E[i(X;Y )] = I(X;Y ). This justifies the name “(mutual) information density”.

2. If there is a bijective transformation (X,Y ) → (A,B), then almost surely iPXY (X;Y ) =iPAB (A;B) and in particular, distributions of i(X;Y ) and i(A;B) coincide.

3. (Conditioning and unconditioning trick) Suppose that f(y) = 0 and g(x) = 0 wheneveri(x; y) = −∞, then1

E[f(Y )] = E[exp−i(x;Y )f(Y )|X = x], ∀x (17.3)

E[g(X)] = E[exp−i(X; y)g(X)|Y = y], ∀y (17.4)

4. Suppose that f(x, y) = 0 whenever i(x; y) = −∞, then

E[f(X,Y )] = E[exp−i(X;Y )f(X,Y )] (17.5)

E[f(X,Y )] = E[exp−i(X;Y )f(X,Y )]. (17.6)

Proof. The proof is simply a change of measure. For example, to see (17.3), note

Ef(Y ) =∑

y∈YPY (y)f(y) =

∑

y∈YPY |X(y|x)

PY (y)

PY |X(y|x)f(y)

notice that by the assumption on f(·), the summation is valid even if for some y we have thatPY |X(y|x) = 0. Similarly, E[f(x, Y )] = E[exp−i(x;Y )f(x, Y )|X = x]. Integrating over x ∼ PXgives (17.5). The rest are by interchanging X and Y .

Corollary 17.1.

P[i(x;Y ) > t] ≤ exp(−t), ∀x (17.7)

P[i(X;Y ) > t] ≤ exp(−t) (17.8)

Proof. Pick f(Y ) = 1 i(x;Y ) > t in (17.3).

Remark 17.4. We have used this trick before: For any probability measure P and any measure Q,

Q[

logdP

dQ≥ t]≤ exp(−t). (17.9)

for example, in hypothesis testing (Corollary 12.1). In data compression, we frequently used the factthat |x : logPX(x) ≥ t| ≤ exp(−t), which is also of the form (17.9) with Q = counting measure.

1Note that (17.3) holds when i(x; y) is defined as i = logdPY |XPY

, and (17.4) holds when i(x; y) is defined as

i = logdPX|YPX

. (17.5) and (17.6) hold under either of the definitions. Since in the following we shall only make use of

(17.3) and (17.5), this is another reason we adopted definition (17.1).

182

17.2 Shannon’s achievability bound

Theorem 17.1 (Shannon’s achievability bound). For a given PY |X , ∀PX , ∀τ > 0, ∃(M, ε)-codewith

ε ≤ P[i(X;Y ) ≤ logM + τ ] + exp(−τ). (17.10)

Proof. Recall that for a given codebook c1, . . . , cM, the optimal decoder is MAP, or equivalently,ML, since the codewords are equiprobable:

g∗(y) = argmaxm∈[M ]

PX|Y (cm|y)

= argmaxm∈[M ]

PY |X(y|cm)

= argmaxm∈[M ]

i(cm; y).

The step of maximizing the likelihood can make analyzing the error probability difficult. Similar towhat we did in almost loss compression (e.g., Theorem 7.4), the magic in showing the followingtwo achievability bounds is to consider a suboptimal decoder. In Shannon’s bound, we consider athreshold-based suboptimal decoder g(y) as follows:

g(y) =

m, ∃!cm s.t. i(cm; y) ≥ logM + τe, o.w.

Interpretationi(cm; y) ≥ logM + τ ⇔ PX|Y (cm|y) ≥M exp(τ)PX(cm),

i.e., the likelihood of cm being the transmitted codeword conditioned on receiving y exceeds somethreshold.

For a given codebook (c1, . . . , cM ), the error probability is:

Pe(c1, . . . , cM ) = P[i(cW ;Y ) ≤ logM + τ ∪ ∃m 6= W, i(cm;Y ) > logM + τ]

where W is uniform on [M ].We generate the codebook (c1, . . . , cM ) randomly with cm ∼ PX i.i.d. for m ∈ [M ]. By symmetry,

the error probability averaging over all possible codebooks is given by:

E[Pe(c1, . . . , cM )]

= E[Pe(c1, . . . , cM )|W = 1]

= P[i(c1;Y ) ≤ logM + τ ∪ ∃m 6= 1, i(cm, Y ) > logM + τ|W = 1]

≤ P[i(c1;Y ) ≤ logM + τ |W = 1] +

M∑

m=2

P[i(cm;Y ) > logM + τ |W = 1] (union bound)

= P [i(X;Y ) ≤ logM + τ ] + (M − 1)P[i(X;Y ) > logM + τ

](random codebook)

≤ P [i(X;Y ) ≤ logM + τ ] + (M − 1) exp(−(logM + τ)) (by Corollary 17.1)

≤ P [i(X;Y ) ≤ logM + τ ] + τ) + exp(−τ)

Finally, since the error probability averaged over the random codebook satisfies the upper bound,there must exist some code allocation whose error probability is no larger than the bound.

183

Remark 17.5 (Typicality).

• The property of a pair (x, y) satisfying the condition i(x; y) ≥ γ can be interpreted as “jointtypicality”. Such version of joint typicality is useful when random coding is done in productspaces with cj ∼ PnX (i.e. coordinates of the codeword are iid).

• A popular alternative to the definition of typicality is to require that the empirical jointdistribution is close to the true joint distribution, i.e., Pxn,yn ≈ PXY , where

Pxn,yn(a, b) =1

n·#j : xj = a, yj = b .

This definition is natural for cases when random coding is done with cj ∼ uniform on the setxn : Pxn ≈ PX (type class).

17.3 Dependence-testing bound

The following result is a refinement of Theorem 17.1:

Theorem 17.2 (DT bound). ∀PX , ∃(M, ε)-code with

ε ≤ E[exp

−(i(X;Y )− log

M − 1

2

)+]

(17.11)

where x+ , max(x, 0).

Proof. For a fixed γ, consider the following suboptimal decoder:

g(y) =

m, for the smallest m s.t. i(cm; y) ≥ γe, o/w

Note that given a codebook c1, . . . , cM, we have by union bound

P[W 6= j|W = j] = P[i(cj ;Y ) ≤ γ|W = j] + P[i(cj ;Y ) > γ, ∃k ∈ [j − 1], s.t. i(ck;Y ) > γ]

≤ P[i(cj ;Y ) ≤ γ|W = j] +

j−1∑

k=1

P[i(ck;Y ) > γ|W = j].

Averaging over the randomly generated codebook, the expected error probability is upper bounded

184

by:

E[Pe(c1, . . . , cM )] =1

M

M∑

j=1

P[W 6= j|W = j]

≤ 1

M

M∑

j=1

(P[i(X;Y ) ≤ γ] +

j−1∑

k=1

P[i(X;Y ) > γ])

= P[i(X;Y ) ≤ γ] +M − 1

2P[i(X;Y ) > γ]

= P[i(X;Y ) ≤ γ] +M − 1

2E[exp(−i(X;Y ))1 i(X;Y ) > γ] (by (17.3))

= E[1 i(X;Y ) ≤ γ+

M − 1

2exp(−i(X;Y ))1 i(X,Y ) > γ

]

= E[

min(

1,M − 1

2exp(−i(X;Y ))

)](γ = log

M − 1

2minimizes the upper bound)

= E[exp

−(i(X;Y )− log

M − 1

2

)+]

.

To optimize over γ, note the simple observation that U1E + V 1Ec ≥ minU, V , with equality iff

U ≥ V on E. Therefore for any x, y, 1[i(x; y) ≤ γ]+M−12 e−i(x;y)1[i(x; y) > γ] ≥ min(1, M−1

2 e−i(x;y)),achieved by γ = log M−1

2 regardless of x, y.

Note: Dependence-testing: The RHS of (17.11) is equivalent to the minimum error probability ofthe following Bayesian hypothesis testing problem:

H0 : X,Y ∼ PX,Y versus H1 : X,Y ∼ PXPY

prior prob.: π0 =2

M + 1, π1 =

M − 1

M + 1.

Note that X,Y ∼ PX,Y and X,Y ∼ PXPY , where X is the sent codeword and X is the unsentcodeword. As we know from binary hypothesis testing, the best threshold for the LRT to minimizethe weighted probability of error is log π1

π0.

Note: In Theorem 17.2 we have avoided minimizing over τ in Shannon’s bound (17.10) to get theminimum upper bound in Theorem 17.1. Moreover, DT bound is stronger than the best Shannon’sbound (with optimized τ).Note: Similar to the random coding achievability bound of almost lossless compression (Theorem7.4), in Theorem 17.1 and Theorem 17.2 we only need the random codewords to be pairwiseindependent.

17.4 Feinstein’s Lemma

The previous achievability results are obtained using probabilistic methods (random coding). Incontrast, the following achievability due to Feinstein uses a greedy construction. Moreover,Feinstein’s construction holds for maximal probability of error.

Theorem 17.3 (Feinstein’s lemma). ∀PX , ∀γ > 0, ∀ε ∈ (0, 1), ∃(M, ε)max-code such that

M ≥ γ(ε− P[i(X;Y ) < log γ]) (17.12)

185

Remark 17.6 (Comparison with Shannon’s bound). We can also interpret (17.12) as for fixed M ,there exists an (M, ε)max-code that achieves the maximal error probability bounded as follows:

ε ≤ P[i(X;Y ) < log γ] +M

γ

Take log γ = logM + τ , this gives the bound of exactly the same form in (17.10). However, thetwo are proved in seemingly quite different ways: Shannon’s bound is by random coding, whileFeinstein’s bound is by greedily selecting the codewords. Nevertheless, Feinstein’s bound is strongerin the sense that it concerns about the maximal error probability instead of the average.

Proof. Recall the goal is to find codewords c1, . . . , cM ∈ X and disjoint subsets (decoding regions)D1, . . . , DM ⊂ Y, s.t.

PY |X(Di|ci) ≥ 1− ε, ∀i ∈ [M ].

The idea is to construct a codebook of size M in a greedy way.For every x ∈ X , associate it with a preliminary decode region defined as follows:

Ex , y : i(x; y) ≥ log γNotice that the preliminary decoding regions Ex may be overlapping, and we will trim them intofinal decoding regions Dx, which will be disjoint.

We can assume that P[i(X;Y ) < log γ] ≤ ε, for otherwise the R.H.S. of (17.12) is negative andthere is nothing to prove. We first claim that there exists some c such that PY [Ec|X = c] ≥ 1− ε.Show by contradiction. Assume that ∀c ∈ X , P[i(c;Y ) ≥ log γ|X = c] < 1− ε, then average overc ∼ PX , we have P[i(X;Y ) ≥ log γ] < 1− ε, which is a contradiction.

Then we construct the codebook in the following greedy way:

1. Pick c1 to be any codeword such that PY [Ec1 |X = c1] ≥ 1− ε, and set D1 = Ec1 ;

2. Pick c2 to be any codeword such that PY [Ec2\D1|X = c2] ≥ 1− ε, and set D2 = Ec2\D1;. . .

3. Pick cM to be any codeword such that PY [EcM \ ∪M−1j=1 Dj |X = cM ] ≥ 1− ε, and set DM =

EcM \ ∪M−1j=1 Dj . We stop if no more codeword can be found, i.e., M is determined by the

stopping condition:∀c ∈ X , PY [Ex0\ ∪Mj=1 Dj |X = c] < 1− ε

Averaging the stopping condition over c ∼ PX , we have

P(i(X;Y ) ≥ log γ\Y ∈ ∪Mj=1Dj) < 1− εby union bound P (A\B) ≥ P (A)− P (B), we have

P(i(X;Y ) ≥ log γ)−M∑

j=1

PY (Dj) < 1− ε

⇒ P(i(X;Y ) ≥ log γ)− M

γ< 1− ε

where the last step makes use of the following key observation:

PY (Dj) ≤ PY (Ecj ) = PY (i(cj ;Y ) ≥ log γ) <1

γ, (by Corollary 17.1).

186

§ 18. Linear codes. Channel capacity

Recall that last time we showed the following achievability bounds:

Shannon’s: Pe ≤ P [i(X;Y ) ≤ logM + τ ] + exp−τ⇑

DT: Pe ≤ E[exp

−(i(X;Y )− log

M − 1

2

)+]

Feinstein’s: Pe,max ≤ P [i(X;Y ) ≤ logM + τ ] + exp−τ

This time we shall use a shortcut to prove the above bounds and in which case Pe = Pe,max.

18.1 Linear coding

Recall the definition of Galois field from Section 9.2.

Definition 18.1 (Linear code). Let X = Y = Fnq , M = qk. Denote the codebook by C , cu : u ∈Fkq. A code f : Fkq → Fnq is a linear code if ∀u ∈ Fkq , cu = uG (row-vector convention), where

G ∈ Fk×nq is a generator matrix.

Proposition 18.1.

c ∈ C⇔ c ∈ row span of G

⇔ c ∈ KerH, for some H ∈ F(n−k)×nq s.t. HGT = 0.

Note: For linear codes, the codebook is a k-dimensional linear subspace of Fnq (ImG or KerH). Thematrix H is called a parity check matrix.Example: (Hamming code) The [7, 4, 3]2 Hamming code over F2 is a linear code with G = [I;P ]and H = [−P T ; I] is a parity check matrix.

G =

1 0 0 0 1 1 00 1 0 0 1 0 10 0 1 0 0 1 10 0 0 1 1 1 1

H =

1 1 0 1 1 0 01 0 1 1 0 1 00 1 1 1 0 0 1

Note:

• Parity check: all four bits in the same circle sum up to zero.

• The minimum distance of this code is 3. Hence it can correct 1bit of error and detect 2 bits of error.

x5

x6 x7

x4x1x2

x3

187

Linear codes are almost always examined with channels of additive noise, a precise definition ofwhich is given below:

Definition 18.2 (Additive noise). PY |X is additive-noise over Fnq if

PY |X(y|x) = PZn(y − x)⇔ Y = X + Zn where Zn ⊥⊥ X

Now: Given a linear code and an additive-noise channel PY |X , what can we say about thedecoder?

Theorem 18.1. Any [k, n]Fq linear code over an additive-noise PY |X has a maximum likelihooddecoder g : Fnq → Fnq such that:

1. g(y) = y − gsynd(HyT ), i.e., the decoder is a function of the “syndrome” HyT only

2. Decoding regions are translates: Du = cu +D0, ∀u

3. Pe,max = Pe,

where gsynd : Fn−kq → Fnq , defined by gsynd(s) = argmaxz:HxT=s PZ(z), is called the “syndromedecoder”, which decodes the most likely realization of the noise.

Proof. 1. The maximum likelihood decoder for linear code is

g(y) = argmaxc∈C

PY |X(y|c) = argmaxc:HcT=0

PZ(y − c) = y − argmaxz:HzT=HyT

PZ(z)

︸︷︷︸,gsynd(HyT )

,

2. For any u, the decoding region

Du = y : g(y) = cu = y : y−gsynd(HyT ) = cu = y : y−cu = gsynd(H(y−cu)T ) = cu+D0,

where we used HcTu = 0 and c0 = 0.

3. For any u,

P[W 6= u|W = u] = P[g(cu+Z) 6= cu] = P[cu+Z−gsynd(HcTu+HZT ) 6= cu] = P[gsynd(HZT ) 6= Z].

Remark 18.1. The advantages of linear codes include at least

1. Low-complexity encoding

2. Slightly lower complexity ML decoding (syndrome decoding)

3. Under ML decoding, maximum probability of error = average probability of error. This is aconsequence of the symmetry of the codes. Note that this holds as long as the decoder is afunction of the syndrome only. As shown in Theorem 18.1, syndrome is a sufficient statisticfor decoding a linear code.

Theorem 18.2 (DT bounds for linear codes). Let PY |X be additive noise over Fnq . ∀k, ∃ a linear code f :

Fkq → Fnq with the error probability:

Pe,max = Pe ≤ E[q−(n−k−logq

1PZn (Zn)

)+](18.1)

188

Remark 18.2. The analogy between Theorem 17.2 and Theorem 18.2 is the same as Theorem 9.4and Theorem 9.5 (full random coding vs random linear codes).

Proof. Recall that in proving the Shannon’s achievability bound or the DT bound (Theorem 17.1and 17.2), we select the code words c1, . . . , cM i.i.d ∼ PX and showed that

E[Pe(c1, . . . , cM )] ≤ P[i(X;Y ) ≤ γ] +M − 1

2P[i(X;Y ) ≥ γ]

As noted after the proof of the DT bound, we only need the random codewords to be pairwiseindependent. Here we will adopt a similar approach. Note that M = qk.

Let’s first do a quick check of the capacity achieving input distribution for PY |X with additivenoise over Fnq :

maxPX

I(X;Y ) = maxPX

H(Y )−H(Y |X) = maxPX

H(Y )−H(Zn) = n log q −H(Zn)⇒ P ∗X uniform on Fnq

We shall use the uniform distribution PX in the “random coding” trick.Moreover, the optimal (MAP) decoder with uniform input is the ML decoder, whose decoding

regions are translational invariant by Theorem 18.1, namely Du = cu +D0,∀u, and therefore:

Pe,max = Pe = P[W 6= u|W = u], ∀u.

Step 1: Random linear coding with dithering:

cu = uG+ h, ∀u ∈ Fkq

G and h are drawn from the new ensemble, where the k × n entries of G and the 1 × nentries of h are i.i.d. uniform over Fq. We add the dithering to eliminate the special rolethat the all-zero codeword plays (since it is contained in any linear codebook).

Step 2: Claim that the codewords are pairwise independent and uniform: ∀u 6= u′, (cu, cu′) ∼ (X,X),where PX,X(x, x) = 1/q2n. To see this:

cu ∼ uniform on Fnqcu′ = u′G+ h = uG+ h+ (u′ − u)G = cu + (u′ − u)G

We claim that cu ⊥⊥ G because conditioned on the generator matrix G = G0, cu ∼uniform on Fnq due to the dithering h.We also claim that cu ⊥⊥ cu′ because conditioned on cu, (u′ − u)G ∼ uniform on Fnq .Thus random linear coding with dithering indeed gives codewords cu, cu′ pairwise independentand are uniformly distributed.

Step 3: Repeat the same argument in proving DT bound for the symmetric and pairwise independentcodewords, we have

E[Pe(c1, . . . , cM )] ≤ P[i(X;Y ) ≤ γ] +M − 1

2P(i(X,Y ) ≥ γ)

⇒Pe ≤ E[exp−(i(X;Y )− log

M − 1

2

)+] = E[q−(i(X;Y )−logq

qk−12

)+

] ≤ E[q−(i(X;Y )−k

)+

]

where we used M = qk and picked the base of log to be q.

189

Step 4: compute i(X;Y ):

i(a; b) = logqPZn(b− a)

q−n= n− logq

1

PZn(b− a)

therefore

Pe ≤ E[q−(n−k−logq

1PZn (Zn)

)+

] (18.2)

Step 5: Kill h. We claim that there exists a linear code without dithering such that (18.2) issatisfied. Indeed shifting a codebook has no impact on its performance. We modify thecoding scheme with G, h which achieves the bound in the following way: modify thedecoder input Y ′ = Y − h, then when cu is sent, the additive noise PY ′|X becomes thenY ′ = uG+ h+ Zn − h = uG+ Zn, which is equivalent to that the linear code generated byG is used. Note that this is possible thanks to the additivity of the noisy channel.

Remark 18.3. • The ensemble cu = uG+h has the pairwise independence property. The jointentropy H(c1, . . . , cM ) = H(G) +H(h) = (nk+n) log q is significantly smaller than Shannon’s“fully random” ensemble we used in the previous lecture. Recall that in that ensemble each cjwas selected independently uniform over Fnq , implying H(c1, . . . , cM ) = qkn log q. Question:

minH(c1, . . . , cM ) =??

where minimum is over all distributions with P [ci = a, cj = b] = q−2n when i 6= j (pairwiseindependent, uniform codewords). Note that H(c1, . . . , cM ) ≥ H(c1, c2) = 2n log q. Similarly,we may ask for (ci, cj) to be uniform over all pairs of distinct elements. In this case Wozencraftensemble for the case of n = 2k achieves H(c1, . . . , cqk) ≈ 2n log q.

• There are many different ensembles of random codebooks:

– Shannon ensemble: C = c1, . . . , cMi.i.d.∼ PX – fully random

– Elias ensemble [Eli55]: C = uG : u ∈ Fkq, with the k × n generator matrix G uniformlydrawn at random.

– Gallager ensemble: C = c : HcT = 0, with the (n − k) × n parity-check matrix Huniformly drawn at random.

• With some non-zero probability G may fail to be full rank [Exercise: Find P [rank(G) < k] as afunction of n, k, q!]. In such a case, there are two identical codewords and hence Pe,max ≥ 1/2.There are two alternative ensembles of codes which do not contain such degenerate codebooks:

1. G ∼ uniform on all full rank matrices

2. search codeword cu ∈ KerH where H ∼ uniform on all n× (n− k) full row rank matrices.(random parity check construction)

Analysis of random coding over such ensemble is similar, except that this time (X, X) havedistribution

PX,X =1

q2n − qn1X 6=X′

uniform on all pairs of distinct codewords and not pairwise independent.

190

18.2 Channels and channel capacity

Basic question of data transmission: How many bits can one transmit reliably if one is allowed touse the channel n times?

• Rate = # of bits per channel use

• Capacity = highest achievable rate

Next we formalize these concepts.

Definition 18.3 (Channel). A channel is specified by:

• input alphabet A

• output alphabet B

• a sequence of random transformation kernels PY n|Xn : An → Bn, n = 1, 2, . . . .

• The parameter n is called the blocklength.

Note: we do not insist on PY n|Xn to have any relation for different n, but it is common thatthe conditional distribution of the first k letters of the n-th transformation is in fact a function ofonly the first k letters of the input and this function equals PY k|Xk – the k-th transformation. Suchchannels, in particular, are non-anticipatory: channel outputs are causal functions of channel inputs.

Channel characteristics:

• A channel is discrete if A and B are finite.

• A channel is additive-noise if A = B are abelian group, and

Pyn|xn = PZn(yn − xn)⇔ Y n = Xn + Zn.

• A channel is memoryless if there exists a sequence PXk|Yk , k = 1, . . . of transformationsacting A → B such that PY n|Xn =

∏nk=1 PYk|Xk (in particular, the channels are compatible at

different blocklengths).

• A channel is stationary memoryless if PY n|Xn =∏nk=1 PY1|X1

.

• DMC (discrete memoryless stationary channel): A DMC can be specified in two ways:

– an |A| × |B|-dimensional (row-stochastic) matrix PY |X where elements specify the transi-tion probabilities

– a bipartite graph with edge weight specifying the transition probabilities.

Example:

191

Definition 18.4 (Fundamental Limits). For any channel,

• An (n,M, ε)-code is an (M, ε)-code for the n-th random transformation PY n|Xn .

• An (n,M, ε)max-code is analogously defined for maximum probability of error.

The non-asymptotic fundamental limits are

M∗(n, ε) = maxM : ∃ (n,M, ε)-code (18.3)

M∗max(n, ε) = maxM : ∃ (n,M, ε)max-code (18.4)

Definition 18.5 (Channel capacity). The ε-capacity Cε and Shannon capacity C are

Cε , lim infn→∞

1

nlogM∗(n, ε)

C = limε→0+

Cε.

Remark 18.4.

• This operational definition of the capacity represents the maximum achievable rate at whichone can communicate through a channel with probability of error less than ε. In other words,for any R < C, there exists an (n, exp(nR), εn)-code, such that εn → 0.

• Typically, the ε-capacity behaves like the plot below on the left-hand side, where C0 is calledthe zero-error capacity, which represents the maximal achievable rate with no error. Oftentimes C0 = 0, meaning without tolerating any error zero information can be transmitted. If Cεis constant for all ε (see plot on the right-hand side), then we say that the strong converseholds (more on this later).

ǫǫ

CǫCǫ

b

11

C0

00

Zero errorCapacity

strong converseholds

192

Proposition 18.2 (Equivalent definitions of Cε and C).

Cε = supR : ∀δ > 0,∃n0(δ),∀n ≥ n0(δ),∃(n, 2n(R−δ), ε) codeC = supR : ∀ε > 0, ∀δ > 0,∃n0(δ, ε),∀n ≥ n0(δ, ε), ∃(n, 2n(R−δ), ε) code

Proof. This trivially follows from applying the definitions of M∗(n, ε) (DIY).

Question: Why do we define capacity Cε and C with respect to average probability of error, say,

C(max)ε and C(max)? Why not maximal probability of error? It turns out that these two definitions

are equivalent, as the next theorem shows.

Theorem 18.3. ∀τ ∈ (0, 1),

τM∗(n, ε(1− τ)) ≤M∗max(n, ε) ≤M∗(n, ε)

Proof. The second inequality is obvious, since any code that achieves a maximum error probabilityε also achieves an average error probability of ε.

For the first inequality, take an (n,M, ε(1− τ))-code, and define the error probability for the jth

codeword as

λj , P[W 6= j|W = j]

ThenM(1− τ)ε ≥

∑λj =

∑λj1λj≤ε +

∑λj1λj>ε ≥ ε|j : λj > ε|.

Hence |j : λj > ε| ≤ (1 − τ)M . [Note that this is exactly Markov inequality!] Now byremoving those codewords1 whose λj exceeds ε, we can extract an (n, τM, ε)max-code. Finally, takeM = M∗(n, ε(1− τ)) to finish the proof.

Corollary 18.1 (Capacity under maximal probability of error). C(max)ε = Cε for all ε > 0 such

that Cε = Cε−. In particular, C(max) = C.2

Proof. Using the definition of M∗ and the previous theorem, for any fixed τ > 0

Cε ≥ C(max)ε ≥ lim inf

n→∞

1

nlog τM∗(n, ε(1− τ)) ≥ Cε(1−τ)

Sending τ → 0 yields Cε ≥ C(max)ε ≥ Cε−.

18.3 Bounds on Cε; Capacity of Stationary Memoryless Channels

Now that we have the basic definitions for Cε, we define another type of capacity, and show thatfor a stationary memoryless channels, the two notions (“operational” and “information” capacity)coincide.

Definition 18.6. The information capacity of a channel is

Ci = lim infn→∞

1

nsupPXn

I(Xn;Y n)

1This operation is usually referred to as expurgation which yields a smaller code by killing part of the codebook toreach a desired property.

2Notation: f(x−) , limyx f(y).

193

Remark: This quantity is not the same as the Shannon capacity, and has no direct operationalinterpretation as a quantity related to coding. Rather, it is best to think of this only as taking then-th random transformation in the channel, maximizing over input distributions, then normalizingand looking at the limit of this sequence.

Next we give coding theorems to relate information capacity (information measures) toShannon capacity (operational quantity).

Theorem 18.4 (Upper Bound for Cε). For any channel, ∀ε ∈ [0, 1), Cε ≤ Ci1−ε and C ≤ Ci.

Proof. Recall the general weak converse bound, Theorem 16.4:

logM∗(n, ε) ≤ supPXn I(Xn;Y n) + h(ε)

1− εNormalizing this by n the taking the lim inf gives

Cε = lim infn→∞

1

nlogM∗(n, ε) ≤ lim inf

n→∞

1

n

supPXn I(Xn;Y n) + h(ε)

1− ε =Ci

1− ε

Next we give an achievability bound:

Theorem 18.5 (Lower Bound for Cε). For a stationary memoryless channel, Cε ≥ Ci, for anyε ∈ (0, 1].

The following result follows from pairing the upper and lower bounds on Cε.

Theorem 18.6 (Shannon ’1948). For a stationary memoryless channel,

C = Ci = supPX

I(X;Y ). (18.5)

Remark 18.5. The above result, known as Shannon’s Noisy Channel Theorem, is perhapsthe most significant result in information theory. For communications engineers, the major surprisewas that C > 0, i.e. communication over a channel is possible with strictly positive rate for anyarbitrarily small probability of error. This result influenced the evolution of communication systemsto block architectures that used bits as a universal currency for data, along with encoding anddecoding procedures.

Before giving the proof of Theorem 18.5, we show the second equality in (18.5). Notice thatCi for stationary memoryless channels is easy to compute: Rather than solving an optimizationproblem for each n and taking the limit of n→∞, computing Ci boils down to maximizing mutualinformation for n = 1. This type of result is known as “single-letterization” in information theory.

Proposition 18.3 (Memoryless input is optimal for memoryless channels).

• For memoryless channels,

supPXn

I(Xn;Y n) =

n∑

i=1

supPXi

I(Xi;Yi).

• For stationary memoryless channels,

Ci = supPX

I(X;Y ).

194

Proof. Recall that for product kernels PY n|Xn =∏PYi|Xi , we have I(Xn;Y n) ≤∑n

k=1 I(Xk;Yk),with equality when Xi’s are independent. Then

Ci = lim infn→∞

1

nsupPXn

I(Xn;Y n) = lim infn→∞

supPX

I(X;Y ) = supPX

I(X;Y ).

Proof of Theorem 18.5. ∀PX , and let PXn = PnX (iid product). Recall Shannon’s or Feinstein’sachievability bound (Theorem 17.1 or 17.3): For any n,M and any τ > 0, there exists (n,M, εn)-code,s.t.

εn ≤ P[i(Xn;Y n) ≤ logM + τ ] + exp(−τ)

Here the information density is defined as

i(Xn;Y n) = logdPY n|Xn

dPY n(Y n|Xn) =

n∑

k=1

logdPY |X

dPY(Yk|Xk) =

n∑

k=1

i(Xk;Yk),

which is a sum of iid r.v.’s with mean I(X;Y ). Set logM = n(I(X;Y )− 2δ) for δ > 0, and takingτ = δn in Shannon’s bound, we have

εn ≤ P[ n∑

k=1

i(Xk;Yk) ≤ nI(X;Y )− δn]

+ exp(−δn)n→∞−−−→ 0,

where the first term goes to zero by WLLN.Therefore, ∀PX , ∀δ > 0, there exists a sequence of (n,Mn, εn)-codes with εn → 0 (where

logMn = n(I(X;Y )− 2δ)). Hence, for all n such that εn ≤ εlogM∗(n, ε) ≥ n(I(X;Y )− 2δ)

And so

Cε = lim infn→∞

1

nlogM∗(n, ε) ≥ I(X;Y )− 2δ ∀PX ,∀δ

Since this holds for all PX and all δ, we conclude Cε ≥ supPX I(X;Y ) = Ci.

Remark 18.6. Shannon’s noisy channel theorem (Theorem 18.6) shows that by employing codesof large blocklength, we can approach the channel capacity arbitrarily close. Given the asymptoticnature of this result (or any other asymptotic result), two natural questions are in order dealingwith the price to pay for reaching capacity:

1. The complexity of achieving capacity: Is it possible to find low-complexity encoders anddecoders with polynomial number of operations in the blocklength n which achieve thecapacity? This question is resolved by Forney in 1966 who showed that this is possible inlinear time with exponentially small error probability. His main idea is concatenated codes.We will study the complexity question in detail later.

Note that if we are content with polynomially small probability of error, e.g., Pe = O(n−100),then we can construct polynomial-time decodable codes as follows. First, it can be shown thatwith rate strictly below capacity, the error probability of optimal codes decays exponentiallyw.r.t. the blocklenth. Now divide the block of length n into shorter block of length C log nand apply the optimal code for blocklength C log n with error probability n−101. The by theunion bound, the whole block is has error with probability at most n−100. The encoding andexhaustive-search decoding are obviously polynomial time.

195

2. The speed of achieving capacity: Suppose we want to achieve 90% of the capacity, we wantto know how long do we need wait? The blocklength is a good proxy for delay. In otherwords, we want to know how fast the gap to capacity vanish as blocklength grows. Shannon’stheorem shows that the gap C− 1

n logM∗(n, ε) = o(1). Next theorem shows that under properconditions, the o(1) term is in fact O( 1√

n).

The main tool in the proof of Theorem 18.5 is the WLLN. The lower bound Cε ≥ Ci inTheorem 18.5 shows that logM∗(n, ε) ≥ nC + o(n) (since normalizing by n and taking the liminfmust result in something ≥ C). If instead we do a more refined analysis using the CLT, we find

Theorem 18.7. For any stationary memoryless channel with C = maxPX I(X;Y ) (i.e. ∃P ∗X =argmaxPX I(X;Y )) such that V = Var[i(X∗;Y ∗)] <∞, then

logM∗(n, ε) ≥ nC −√nV Q−1(ε) + o(

√n),

where Q(·) is the complementary Gaussian CDF and Q−1(·) is its functional inverse.

Proof. Writing the little-o notation in terms of lim inf, our goal is

lim infn→∞

logM∗(n, ε)− nC√nV

≥ −Q−1(ε) = Φ−1(ε),

where Φ(t) is the standard normal CDF.Recall Feinstein’s bound

∃(n,M, ε)max : M ≥ β (ε− P[i(Xn;Y n) ≤ log β])

Take log β = nC +√nV t, then applying the CLT gives

logM ≥ nC +√nV t+ log

(ε− P

[∑i(Xk;Yk) ≤ nC +

√nV t

])

=⇒ logM ≥ nC +√nV t+ log (ε− Φ(t)) ∀Φ(t) < ε

=⇒ logM − nC√nV

≥ t+log(ε− Φ(t))√

nV

Where Φ(t) is the standard normal CDF. Taking the liminf of both sides

lim infn→∞

logM∗(n, ε)− nC√nV

≥ t ∀t s.t. Φ(t) < ε

Taking t Φ−1(ε), and writing the liminf in little o form completes the proof

logM∗(n, ε) ≥ nC −√nV Q−1(ε) + o(

√n)

196

18.4 Examples of DMC

Binary symmetric channels

0

1

0

1

δ

δ

δ

δ

Y = X + Z, Z ∼ Bern(δ) ⊥⊥ X

0 12

1δ

1 bit

C

Capacity of BSC:C = sup

PX

I(X;Y ) = 1− h(δ)

Proof. I(X;X +Z) = H(X +Z)−H(X +Z|X) = H(X +Z)−H(Z) ≤ 1− h(δ), with equality iffX ∼ Bern(1/2).

Note: More generally, for all additive-noise channel over a finite abelian groupG, C = supPX I(X;X+Z) = log |G| −H(Z), achieved by uniform X.

Binary erasure channels

0

1

0

1

e

δ

δ

δ

δ

BEC is a multiplicative channel: If we think about the in-put X ∈ ±1, and output Y ∈ ±1, 0. Then equivalentlywe can write Y = XZ with Z ∼ Bern(δ) ⊥⊥ X.

0 1δ

1 bit

C

Capacity of BEC:C = sup

PX

I(X;Y ) = 1− δ bits

Note: Without evaluating Shannon’s formula, it is clear thta C ≤ 1− δ, because δ-fraction of themessage are lost, i.e., even if the encoder knows a priori where the erasures are going to occur, therate still cannot exceed 1− δ.

Proof. Note that P (X = 0|Y = e) = P (X=0)δδ = P (X = 0). Therefore I(X;Y ) = H(X)−H(X|Y ) =

H(X)−H(X|Y = e) ≤ (1− δ)H(X) ≤ 1− δ, with equality iff X ∼ Bern(1/2).

197

18.5* Information Stability

We saw that C = Ci for stationary memoryless channels, but what other channels does this holdfor? And what about non-stationary channels? To answer this question, we introduce the notion ofinformation stability.

Definition 18.7. A channel is called information stable if there exists a sequence of input distribu-tion PXn , n = 1, 2, . . . such that

1

ni(Xn;Y n)

P−→Ci .

For example, we can pick PXn = (P ∗X)n for stationary memoryless channels. Therefore stationarymemoryless channels are information stable.

The purpose for defining information stability is the following theorem.

Theorem 18.8. For an information stable channel, C = Ci.

Proof. Like the stationary, memoryless case, the upper bound comes from the general converse Theo-rem 16.4, and the lower bound uses a similar strategy as Theorem 18.5, except utilizing the definitionof information stability in place of WLLN.

The next theorem gives conditions to check for information stability in memoryless channelswhich are not necessarily stationary.

Theorem 18.9. A memoryless channel is information stable if there exists X∗k : k ≥ 1 such thatboth of the following hold:

1

n

n∑

k=1

I(X∗k ;Y ∗k )→ Ci (18.6)

∞∑

n=1

1

n2V ar[i(X∗n;Y ∗n )] <∞ . (18.7)

In particular, this is satisfied if

|A| <∞ or |B| <∞ (18.8)

Proof. To show the first part, it is sufficient to prove

P

[1

n

∣∣∣∣∣n∑

k=1

i(X∗k ;Y ∗k )− I(X∗k , Y∗k )

∣∣∣∣∣ > δ

]→ 0

So that 1n i(X

n;Y n)→ Ci in probability. We bound this by Chebyshev’s inequality

P

[1

n

∣∣∣∣∣n∑

k=1

i(X∗k ;Y ∗k )− I(X∗k , Y∗k )

∣∣∣∣∣ > δ

]≤

1n2

∑nk=1 Var[i(X∗k ;Y ∗k )]

δ2→ 0 ,

where convergence to 0 follows from Kronecker lemma (Lemma 18.1 to follow) applied withbn = n2, xn = Var[i(X∗n;Y ∗n )]/n2.

The second part follows from the first. Indeed, notice that

Ci = lim infn→∞

1

n

n∑

k=1

supPXk

I(Xk;Yk) .

198

Now select PX∗k such that

I(X∗k ;Y ∗k ) ≥ supPXk

I(Xk;Yk)− 2−k .

(Note that each supPXkI(Xk;Yk) ≤ log min|A|, |B| <∞.) Then, we have

n∑

k=1

I(X∗k ;Y ∗k ) ≥n∑

k=1

supPXk

I(Xk;Yk)− 1 ,

and hence normalizing by n we get (18.6). We next show that for any joint distribution PX,Y wehave

Var[i(X;Y )] ≤ 2 log2(min(|A|, |B|)) . (18.9)

The argument is symmetric in X and Y , so assume for concreteness that |B| <∞. Then

E[i2(X;Y )]

,∫

AdPX(x)

∑

y∈BPY |X(y|x)

[log2 PY |X(y|x) + log2 PY (y)− 2 logPY |X(y|x) · logPY (y)

]

≤∫

AdPX(x)

∑

y∈BPY |X(y|x)

[log2 PY |X(y|x) + log2 PY (y)

](18.10)

=

∫

AdPX(x)

∑

y∈BPY |X(y|x) log2 PY |X(y|x)

+

∑

y∈BPY (y) log2 PY (y)

≤∫

AdPX(x)g(|B|) + g(|B|) (18.11)

=2g(|B|) ,

where (18.10) is because 2 logPY |X(y|x) · logPY (y) is always non-negative, and (18.11) followsbecause each term in square-brackets can be upper-bounded using the following optimizationproblem:

g(n) , supaj≥0:

∑nj=1 aj=1

n∑

j=1

aj log2 aj . (18.12)

Since the x log2 x has unbounded derivative at the origin, the solution of (18.12) is always in theinterior of [0, 1]n. Then it is straightforward to show that for n > e the solution is actually aj = 1

n .For n = 2 it can be found directly that g(2) = 0.5629 log2 2 < log2 2. In any case,

2g(|B|) ≤ 2 log2 |B| .

Finally, because of the symmetry, a similar argument can be made with |B| replaced by |A|.

Lemma 18.1 (Kronecker Lemma). Let a sequence 0 < bn ∞ and a non-negative sequence xnsuch that

∑∞n=1 xn <∞, then

1

bn

n∑

j=1

bjxj → 0

199

Proof. Since bn’s are strictly increasing, we can split up the summation and bound them from above

n∑

k=1

bkxk ≤ bmm∑

k=1

xk +n∑

k=m+1

bkxk

Now throw in the rest of the xk’s in the summation

=⇒ 1

bn

n∑

k=1

bkxk ≤bmbn

∞∑

k=1

xk +

n∑

k=m+1

bkbnxk ≤

bmbn

∞∑

k=1

xk +

∞∑

k=m+1

xk

=⇒ limn→∞

1

bn

n∑

k=1

bkxk ≤∞∑

k=m+1

xk → 0

Since this holds for any m, we can make the last term arbitrarily small.

Important example: For jointly Gaussian (X,Y ) we always have bounded variance:

Var[i(X;Y )] = ρ2(X,Y ) log2 e ≤ log2 e , ρ(X,Y ) =cov[X,Y ]√

Var[X] Var[Y ]. (18.13)

Indeed, first notice that we can always represent Y = X + Z with X = aX ⊥⊥ Z. On the otherhand, we have

i(x; y) =log e

2

[x2 + 2xz

σ2Y

− σ2

σ2Y σ

2Z

z2

], z , y − x .

From here by using Var[·] = Var[E[·|X]] + Var[·|X] we need to compute two terms separately:

E[i(X;Y )|X] =log e

2

X2 − σ2

X

σ2Z

σ2Y

,

and hence

Var[E[i(X;Y )|X]] =2 log2 e

4σ4Y

σ4X.

On the other hand,

Var[i(X;Y )|X] =2 log2 e

4σ4Y

[4σ2Xσ2Z + 2σ4

X] .

Putting it all together we get (18.13). Inequality (18.13) justifies information stability of all sorts ofGaussian channels (memoryless and with memory), as we will see shortly.

200

§ 19. Channels with input constraints. Gaussian channels.

19.1 Channel coding with input constraints

Motivations: Let us look at the additive Gaussian noise. Then the Shannon capacity is infinite,since supPX I(X;X + Z) = ∞ achieved by X ∼ N (0, P ) and P → ∞. But this is at the price ofinfinite second moment. In reality, limitation of transmission power ⇒ constraints on the encodingoperations ⇒ constraints on input distribution.

Definition 19.1. An (n,M, ε)-code satisfies the input constraint Fn ⊂ An if the encoder isf : [M ]→ Fn. (Without constraint, the encoder maps into An).

b

b

bb

b

bb

b

b

b

b

bb

Fn

An

Codewords all land in a subset of An

Definition 19.2 (Separable cost constraint). A channel with separable cost constraint is specifiedas follows:

1. A,B: input/output spaces

2. PY n|Xn : An → Bn, n = 1, 2, . . .

3. Cost function c : A → R

Input constraint: average per-letter cost of a codeword xn (with slight abuse of notation)

c(xn) =1

n

n∑

k=1

c(xk) ≤ P

Example: A = B = R

• Average power constraint (separable):

1

n

n∑

i=1

|xi|2 ≤ P ⇔ ‖xn‖2 ≤√nP

• Peak power constraint (non-separable):1

max1≤i≤n

|xi| ≤ A ⇔ ‖xn‖∞ ≤ A1In fact, this reduces to the case without cost with input space replaced by [−A,A].

201

Definition 19.3. Some basic definitions in parallel with the channel capacity without inputconstraint.

• A code is an (n,M, ε, P )-code if it is an (n,M, ε)-code satisfying input constraint Fn , xn :1n

∑c(xk) ≤ P

• Finite-n fundamental limits:

M∗(n, ε, P ) = maxM : ∃(n,M, ε, P )-codeM∗max(n, ε, P ) = maxM : ∃(n,M, ε, P )max-code

• ε-capacity and Shannon capacity

Cε(P ) = lim infn→∞

1

nlogM∗(n, ε, P )

C(P ) = limε↓0

Cε(P )

• Information capacity

Ci(P ) = lim infn→∞

1

nsup

PXn :E[∑nk=1 c(Xk)]≤nP

I(Xn;Y n)

• Information stability: Channel is information stable if for all (admissible) P , there exists asequence of channel input distributions PXn such that the following two properties hold:

1

niPXn,Y n (Xn;Y n)

P−→Ci(P ) (19.1)

P[c(Xn) > P + δ]→ 0 ∀δ > 0 . (19.2)

Note: These are the usual definitions, except that in Ci(P ), we are permitted to maximizeI(Xn;Y n) using input distributions from the constraint set PXn : E[

∑nk=1 c(Xk)] ≤ nP instead

of the distributions supported on Fn.

Definition 19.4 (Admissible constraint). We say P is an admissible constraint if ∃x0 ∈ As.t. c(x0) ≤ P , or equivalently, ∃PX : E[c(X)] ≤ P . The set of admissible P ’s is denoted byDc, and can be either in the form (P0,∞) or [P0,∞), where P0 , infx∈A c(x).

Clearly, if P /∈ Dc, then there is no code (even a useless one, with 1 codeword) satisfying theinput constraint. So in the remaining we always assume P ∈ Dc.

Proposition 19.1. Define f(P ) = supPX :E[c(X)]≤P I(X;Y ). Then

1. f is concave and non-decreasing. The domain of f is dom f , x : f(x) > −∞ = Dc.

2. One of the following is true: f(P ) is continuous and finite on (P0,∞), or f =∞ on (P0,∞).

Furthermore, both properties hold for the function P 7→ Ci(P ).

202

Proof. In the first part all statements are obvious, except for concavity, which follows from theconcavity of PX 7→ I(X;Y ). For any PXi such that E [c(Xi)] ≤ Pi, i = 0, 1, let X ∼ λPX0 + λPX1 .Then E [c(X)] ≤ λP0 + λP1 and I(X;Y ) ≥ λI(X0;Y0) + λI(X1;Y1). Hence f(λP0 + λP1) ≥λf(P0) + λf(P1). The second claim follows from concavity of f(·).

To extend these results to Ci(P ) observe that for every n

P 7→ 1

nsup

PXn :E[c(Xn)]≤PI(Xn;Y n)

is concave. Then taking lim infn→∞ the same holds for Ci(P ).

An immediate consequence is that memoryless input is optimal for memoryless channel withseparable cost, which gives us the single-letter formula of the information capacity:

Corollary 19.1 (Single-letterization). Information capacity of stationary memoryless channel withseparable cost:

Ci(P ) = f(P ) = supE[c(X)]≤P

I(X;Y ).

Proof. Ci(P ) ≥ f(P ) is obvious by using PXn = (PX)n. For “≤”, use the concavity of f(·), we havethat for any PXn ,

I(Xn;Y n) ≤n∑

j=1

I(Xj ;Yj) ≤n∑

j=1

f(E[c(Xj)])≤nf

1

n

n∑

j=1

E[c(Xj)]

≤ nf(P ).

19.2 Capacity under input constraint C(P )?= Ci(P )

Theorem 19.1 (General weak converse).

Cε(P ) ≤ Ci(P )

1− ε

Proof. The argument is the same as before: Take any (n,M, ε, P )-code, W → Xn → Y n → W .Apply Fano’s inequality, we have

−h(ε) + (1− ε) logM ≤ I(W ; W ) ≤ I(Xn;Y n) ≤ supPXn :E[c(Xn)]≤P

I(Xn;Y n) ≤ nf(P )

Theorem 19.2 (Extended Feinstein’s Lemma). Fix a random transformation PY |X . ∀PX ,∀F ⊂X ,∀γ > 0,∀M , there exists an (M, ε)max-code with:

• Encoder satisfies the input constraint: f : [M ]→ F ⊂ X ;

• Probability of error bound:

εPX(F ) ≤ P[i(X;Y ) < log γ] +M

γ

203

Note: when F = X , it reduces to the original Feinstein’s Lemma.

Proof. Similar to the proof of the original Feinstein’s Lemma, define the preliminary decodingregions Ec = y : i(c; y) ≥ log γ for all c ∈ X . Sequentially pick codewords c1, . . . , cM fromthe set F and the final decoding region D1, . . . , DM where Dj , Ecj\ ∪j−1

k=1 Dk. The stoppingcriterion is that M is maximal, i.e.,

∀x0 ∈ F, PY [Ex0\ ∪Mj=1 Dj

∣∣X = x0] < 1− ε⇔ ∀x0 ∈ X , PY [Ex0\ ∪Mj=1 Dj

∣∣X = x0] < (1− ε)1[x0 ∈ F ] + 1[x0 ∈ F c]⇒ average over x0 ∼ PX , P[i(X;Y ) ≥ log γ\ ∪Mj=1 Dj ] ≤ (1− ε)PX(F ) + PX(F c) = 1− εPX(F )

From here, we can complete the proof by following the same steps as in the proof of Feinstein’slemma (Theorem 17.3).

Theorem 19.3 (Achievability). For any information stable channel with input constraints andP > P0 we have

C(P ) ≥ Ci(P ) (19.3)

Proof. Let us consider a special case of the stationary memoryless channel (the proof for generalinformation stable channel follows similarly). So we assume PY n|Xn = (PY |X)n.

Fix n ≥ 1. Since the channel is stationary memoryless, we have PY n|Xn = (PY |X)n. Choose aPX such that E[c(X)] < P , Pick logM = n(I(X;Y )− 2δ) and log γ = n(I(X;Y )− δ).

With the input constraint set Fn = xn : 1n

∑c(xk) ≤ P, and iid input distribution PXn = PnX ,

we apply the extended Feinstein’s Lemma, there exists an (n,M, εn, P )max-code with the encodersatisfying input constraint F and the error probability

εn PX(F )︸︷︷︸→1

≤ P (i(Xn;Y n) ≤ n(I(X;Y )− δ))︸︷︷︸→0 as n→∞ by WLLN and stationary memoryless assumption

+ exp(−nδ)︸︷︷︸→0

Also, since E[c(X)] < P , by WLLN, we have PXn(Fn) = P ( 1n

∑c(xk) ≤ P )→ 1.

εn(1 + o(1)) ≤ o(1)

⇒ εn → 0 as n→∞⇒ ∀ε, ∃n0, s.t. ∀n ≥ n0, ∃(n,M, εn, P )max-code, with εn ≤ ε

Therefore

Cε(P ) ≥ 1

nlogM = I(X;Y )− 2δ, ∀δ > 0,∀PX s.t. E[c(X)] < P

⇒ Cε(P ) ≥ supPX :E[c(X)]<P

limδ→0

(I(X;Y )− 2δ)

⇒ Cε(P ) ≥ supPX :E[c(X)]<P

I(X;Y ) = Ci(P−) = Ci(P )

where the last equality is from the continuity of Ci on (P0,∞) by Proposition 19.1. Noticethat for general information stable channel, we just need to use the definition to show thatP (i(Xn;Y n) ≤ n(Ci − δ))→ 0, and all the rest follows.

204

Theorem 19.4 (Shannon capacity). For an information stable channel with cost constraint andfor any admissible constraint P we have

C(P ) = Ci(P ).

Proof. The case of P = P0 is treated in the homework. So assume P > P0. Theorem 19.1shows Cε(P ) ≤ Ci(P )

1−ε , thus C(P ) ≤ Ci(P ). On the other hand, from Theorem 19.3 we haveC(P ) ≥ Ci(P ).

Note: In homework, you will show that C(P0) = Ci(P0) also holds, even though Ci(P ) may bediscontinuous at P0.

19.3 Applications

19.3.1 Stationary AWGN channel

+

Z ∼ N (0, σ2)

X Y

Definition 19.5 (AWGN). The additive Gaussian noise (AWGN) channel is a stationary memorylessadditive-noise channel with separable cost constraint: A = B = R, c(x) = x2, PY |X is given byY = X + Z, where Z ∼ N (0, σ2) ⊥⊥ X, and average power constraint EX2 ≤ P .

In other words, Y n = Xn + Zn, where Zn ∼ N (0, In).

Note: Here “white” = uncorrelatedGaussian

= independent.Note: Complex AWGN channel is similarly defined: A = B = C, c(x) = |x|2, and Zn ∼ CN (0, In)

Theorem 19.5. For stationary (C)-AWGN channel, the channel capacity is equal to informationcapacity, and is given by:

C(P ) = Ci(P ) =1

2log

(1 +

P

σ2

)for AWGN

C(P ) = Ci(P ) = log

(1 +

P

σ2

)for C-AWGN

Proof. By Corollary 19.1,Ci = sup

PX :EX2≤PI(X;X + Z)

Then use Theorem 4.6 (Gaussian saddle point) to conclude X ∼ N (0, P ) (or CN (0, P )) is theunique caid.

205

Note: Since Zn ∼ N (0, σ2), then with high probability,‖Zn‖2 concentrates around

√nσ2. Similarly, due the power

constraint and the fact that Zn ⊥⊥ Xn, we have E[‖Y n‖2

]=

E[‖Y n‖2

]+E

[‖Zn‖2

]≤ n(P + σ2) and the received vector

Y n lies in an `2-ball of radius approximately√n(P + σ2).

Since the noise can at most perturb the codeword by√nσ2

in Euclidean distance, if we can pack M balls of radius√nσ2

into the `2-ball of radius√n(P + σ2) centered at the origin,

then this gives a good codebook and decision regions. Thepacking number is related to the volume ratio. Note thatthe volume of an `2-ball of radius r in Rn is given by cnr

n

for some constant cn. Then cn(n(P+σ2))n/2

cn(nσ2)n/2=(1 + P

σ2

)n/2.

Take the log and divide by n, we get 12 log

(1 + P

σ2

).

c1

c2

c3

c4

c5

c6 c7

c8

cM

· · ·

√n(P

+σ 2)

√nσ2

Why the above is not a proof for either achievability or converse?

• Packing number is not necessarily given by the volume ratio.

• Codewords need not correspond to centers of disjoint `2-balls.

Theorem 19.5 applies to Gaussian noise. What if the noise is non-Gaussian and how sensitive isthe capacity formula 1

2 log(1 + SNR) to the Gaussian assumption? Recall the Gaussian saddlepointresult we have studied in Lecture 4 where we showed that for the same variance, Gaussian noiseis the worst which shows that the capacity of any non-Gaussian noise is at least 1

2 log(1 + SNR).Conversely, it turns out the increase of the capacity can be controlled by how non-Gaussian thenoise is (in terms of KL divergence). The following result is due to Ihara.

Theorem 19.6 (Additive Non-Gaussian noise). Let Z be a real-valued random variable independentof X and EZ2 <∞. Let σ2 = VarZ. Then

1

2log

(1 +

P

σ2

)≤ sup

PX :EX2≤PI(X;X + Z) ≤ 1

2log

(1 +

P

σ2

)+D(PZ‖N (EZ, σ2)).

Proof. Homework.

Note: The quantity D(PZ‖N (EZ, σ2)) is sometimes called the non-Gaussianness of Z, whereN (EZ, σ2) is a Gaussian with the same mean and variance as Z. So if Z has a non-Gaussian density,say, Z is uniform on [0, 1], then the capacity can only differ by a constant compared to AWGN,which still scales as 1

2 log SNR in the high-SNR regime. On the other hand, if Z is discrete, thenD(PZ‖N (EZ, σ2)) =∞ and indeed in this case one can show that the capacity is infinite becausethe noise is “too weak”.

19.3.2 Parallel AWGN channel

Definition 19.6 (Parallel AWGN). A parallel AWGN channel with L branches is defined as follows:A = B = RL; c(x) =

∑Lk=1 |xk|2; PY L|XL : Yk = Xk + Zk, for k = 1, . . . , L, and Zk ∼ N (0, σ2

k) areindependent for each branch.

206

Theorem 19.7 (Waterfilling). The capacity of L-parallel AWGN channel is given by

C =1

2

L∑

j=1

log+ T

σ2j

where log+(x) , max(log x, 0), and T ≥ 0 is determined by

P =

L∑

j=1

|T − σ2j |+

Proof.

Ci(P ) = supPXL

:∑

E[X2i ]≤P

I(XL;Y L)

≤ sup∑Pk≤P,Pk≥0

L∑

k=1

supE[X2

k ]≤PkI(Xk;Yk)

= sup∑Pk≤P,Pk≥0

L∑

k=1

1

2log(1 +

Pkσ2k

)

with equality if Xk ∼ N (0, Pk) are independent. So the question boils down to the last maximizationproblem – power allocation: Denote the Lagragian multipliers for the constraint

∑Pk ≤ P by λ

and for the constraint Pk ≥ 0 by µk. We want to solve max∑ 1

2 log(1 + Pkσ2k

)− µkPk + λ(P −∑Pk).

First-order condition on Pk gives that

1

2

1

σ2k + Pk

= λ− µk, µkPk = 0

therefore the optimal solution is

Pk = |T − σ2k|+, T is chosen such that P =

L∑

k=1

|T − σ2k|+

Note: The figure illustrates the power allocation via water-filling. In this particular case, the secondbranch is too noisy (σ2 too big) such that it is better be discarded, i.e., the assigned power is zero.

Note: [Significance of the waterfilling theorem] In the high SNR regime, the capacity for 1 AWGNchannel is approximately 1

2 logP , while the capacity for L parallel AWGN channel is approximately

207

L2 log(PL ) ≈ L

2 logP for large P . This L-fold increase in capacity at high SNR regime leads to thepowerful technique of spatial multiplexing in MIMO.

Also notice that this gain does not come from multipath diversity. Consider the scheme that asingle stream of data is sent through every parallel channel simultaneously, with multipath diversity,the effective noise level is reduced to 1

L , and the capacity is approximately log(LP ), which is muchsmaller than L

2 log(PL ) for P large.

19.4* Non-stationary AWGN

Definition 19.7 (Non-stationary AWGN). A non-stationary AWGN channel is defined as follows:A = B = R, c(x) = x2, PYj |Xj : Yj = Xj + Zj , where Zj ∼ N (0, σ2

j ).

Theorem 19.8. Assume that for every T the following limits exist:

Ci(T ) = limn→∞

1

n

n∑

j=1

1

2log+ T

σ2j

P (T ) = limn→∞

1

n

n∑

j=1

|T − σ2j |+

then the capacity of the non-stationary AWGN channel is given by the parameterized form: C(T ) =Ci(T ) with input power constraint P (T ).

Proof. Fix T > 0. Then it is clear from the waterfilling solution that

sup I(Xn;Y n) =

n∑

j=1

1

2log+ T

σ2j

, (19.4)

where supremum is over all PXn such that

E[c(Xn)] ≤ 1

n

n∑

j=1

|T − σ2j |+ . (19.5)

Now, by assumption, the LHS of (19.5) converges to P (T ). Thus, we have that for every δ > 0

Ci(P (T )− δ) ≤ Ci(T ) (19.6)

Ci(P (T ) + δ) ≥ Ci(T ) (19.7)

Taking δ → 0 and invoking continuity of P 7→ Ci(P ), we get that the information capacity satisfies

Ci(P (T )) = Ci(T ) .

The channel is information stable. Indeed, from (18.13)

Var(i(Xj ;Yj)) =log2 e

2

PjPj + σ2

j

≤ log2 e

2

and thusn∑

j=1

1

n2Var(i(Xj ;Yj)) <∞ .

From here information stability follows via Theorem 18.9.

Note: Non-stationary AWGN is primarily interesting due to its relationship to the stationaryAdditive Colored Gaussian noise channel in the following discussion.

208

19.5* Stationary Additive Colored Gaussian noise channel

Definition 19.8 (Additive colored Gaussian noise channel ). An Additive Colored Gaussian noisechannel is defined as follows: A = B = R, c(x) = x2, PYj |Xj : Yj = Xj +Zj , where Zj is a stationaryGaussian process with spectral density fZ(ω) > 0, ω ∈ [−π, π].

Theorem 19.9. The capacity of stationary ACGN channel is given by the parameterized form:

C(T ) =1

2π

∫ 2π

0

1

2log+ T

fZ(ω)dω

P (T ) =1

2π

∫ 2π

0

∣∣∣T − fZ(ω)∣∣∣+dω

Proof. Take n ≥ 1, consider the diagonalization of the covariance matrix of Zn:

Cov(Zn) = Σ = U∗ΣU, such that Σ = diag(σ1, . . . , σn)

Since Cov(Zn) is positive semi-definite, U is a unitary matrix. Define Xn = UXn and Y n = UY n,the channel between Xn and Y n is thus

Y n = Xn + UZn,

Cov(UZn) = UCov(Zn)U∗ = Σ

Therefore we have the equivalent channel as follows:

Y n = Xn + Zn, Znj ∼ N (0, σ2j ) indep across j

By Theorem 19.8, we have that

C = limn→∞

1

n

n∑

j=1

log+ T

σ2j

=1

2π

∫ 2π

0

1

2log+ T

fZ(ω)dω. ( by Szego, Theorem 5.6)

limn→∞

1

n

n∑

j=1

|T − σ2j |+ = P (T )

Finally since U is unitary, C = C.

Note: Noise is born white, the colored noise is essentially due to some filtering.

209

19.6* Additive White Gaussian Noise channel with IntersymbolInterference

Definition 19.9 (AWGN with ISI). An AWGN channel with ISI is defined as follows: A = B = R,c(x) = x2, and the channel law PY n|Xn is given by

Yk =n∑

j=1

hk−jXj + Zk , k = 1, . . . , n

where Zk ∼ N (0, 1) is white Gaussian noise, hk, k = −∞, . . . ,∞ are coefficients of a discrete-timechannel filter.

Theorem 19.10. Suppose that the sequence hk is an inverse Fourier transform of a frequencyresponse H(ω):

hk =1

2π

∫ 2π

0eiωkH(ω)dω .

Assume also that H(ω) is a continuous function on [0, 2π]. Then the capacity of the AWGN channelwith ISI is given by

C(T ) =1

2π

∫ 2π

0

1

2log+(T |H(ω)|2)dω

P (T ) =1

2π

∫ 2π

0

∣∣∣∣T −1

|H(ω)|2∣∣∣∣+

dω

Proof. (Sketch) At the decoder apply the inverse filter with frequency response ω 7→ 1H(ω) . The

equivalent channel then becomes a stationary colored-noise Gaussian channel:

Yj = Xj + Zj ,

where Zj is a stationary Gaussian process with spectral density

fZ(ω) =1

|H(ω)|2 .

Then apply Theorem 19.9 to the resulting channel.Remark: to make the above argument rigorous one must carefully analyze the non-zero error

introduced by truncating the deconvolution filter to finite n.

19.7* Gaussian channels with amplitude constraints

We have examined some classical results of additive Gaussian noise channels. In the following, wewill list some more recent results without proof.

Theorem 19.11 (Amplitude-constrained capacity of AWGN channel). For an AWGN channelYi = Xi + Zi with amplitude constraint |Xi| ≤ A and energy constraint

∑ni=1X

2i ≤ nP , we denote

the capacity by:C(A,P ) = max

PX :|X|≤A,E|X|2≤PI(X;X + Z).

Capacity achieving input distribution P ∗X is discrete, with finitely many atoms on [−A,A]. Moreover,

the convergence speed of limA→∞C(A,P ) = 12 log(1 + P ) is of the order e−O(A2).

For details, see [Smi71] and [PW14, Section III].

210

19.8* Gaussian channels with fading

Fading channels are often used to model the urban signal propagation with multipath or shadowing.The received signal Yi is modeled to be affected by multiplicative fading coefficient Hi and additivenoise Zi:

Yi = HiXi + Zi, Zi ∼ N (0, 1)

In the coherent case (also known as CSIR – for channel state information at the receiver), thereceiver has access to the channel state information of Hi, i.e. the channel output is effectively(Yi, Hi). Whenever Hj is a stationary ergodic process, we have the channel capacity given by:

C(P ) = E[

1

2log(1 + P |H|2)

]

and the capacity achieving input distribution is the usual PX = N (0, P ). Note that the capacityC(P ) is in the order of logP and we call the channel “energy efficient”.

In the non-coherent case where the receiver does not have the information of Hi, no simpleexpression for the channel capacity is known. It is known, however, that the capacity achievinginput distribution is discrete [AFTS01], and the capacity scales as [TE97, LM03]

C(P ) = O(log logP ), P →∞ (19.8)

This channel is said to be “energy inefficient”.

With introduction of multiple antenna channels, there are endless variations, theoretical openproblems and practically unresolved issues in the topic of fading channels. We recommend consultingthe textbook [TV05] for details.

211

§ 20. Lattice codes (by O. Ordentlich)

Consider the n-dimensional additive white Gaussian noise (AWGN) channel

Y = X + Z

where Z ∼ N (0, In×n) is statistically independent of the input X. Our goal is to communicatereliably over this channel, under the power constraint

1

n‖X‖2 ≤ SNR

where SNR is the signal-to-noise-ratio. The capacity of the AWGN channel is

C = 12 log(1 + SNR) bits/channel use,

and is achieved with high probability by a codebook drawn at random from the Gaussian i.i.d. ensem-ble. However, a typical codebook from this ensemble has very little structure, and is therefore notapplicable for practical systems. A similar problem occurs in discrete additive memoryless stationarychannels, e.g., BSC, where most members of the capacity achieving i.i.d. uniform codebook ensemblehave no structure. In the discrete case, engineers resort to linear codes to circumvent the lack ofstructure. Lattice codes are the Euclidean space counterpart of linear codes, and as we shall see,enable to achieve the capacity of the AWGN channel with much more structure than random codes.In fact, we will construct a lattice code with rate that approaches 1

2 log(1 + SNR) that is guaranteedto achieve small error probability for essentially all additive noise channels with the same noisesecond moment. More precisely, our scheme will work if the noise vector Z is semi norm-ergodic.

Definition 20.1. We say that a sequence in n of random noise vectors Z(n) of length n with (finite)effective variance σ2

Z ,1nE‖Z(n)‖2, is semi norm-ergodic if for any ε, δ > 0 and n large enough

Pr

(Z(n) /∈ B(

√(1 + δ)nσ2

Z

)≤ ε, (20.1)

where B(r) is an n-dimensional ball of radius r.

20.1 Lattice Definitions

A lattice Λ is a discrete subgroup of Rn which is closed under reflection and real addition. Anylattice Λ in Rn is spanned by some n× n matrix G such that

Λ = t = Ga : a ∈ Zn.We will assume G is full-rank. Denote the nearest neighbor quantizer associated with the lattice Λby

QΛ(x) , argmint∈Λ

‖x− t‖, (20.2)

212

where ties are broken in a systematic manner. We define the modulo operation w.r.t. a lattice Λ as

[x] mod Λ , x−QΛ(x),

and note that it satisfies the distributive law,

[[x] mod Λ + y

]mod Λ = [x + y] mod Λ.

The basic Voronoi region of Λ, denoted by V, is the set of all points in Rn which are quantizedto the zero vector. The systematic tie-breaking in (20.2) ensures that

·⋃

t∈Λ

(V + t) = Rn,

where ·⋃ denotes disjoint union. Thus, V is a fundamental cell of Λ.

Definition 20.2. A measurable set S ∈ Rn is called a fundamental cell of Λ if

·⋃

t∈Λ

(S + t) = Rn.

We denote the volume of a set S ∈ Rn by Vol(S).

Proposition 20.1. If S is a fundamental cell of Λ, then Vol(S) = Vol(V). Furthermore

S mod Λ = [s] mod Λ : s ∈ S = V.

Proof ([Zam14]). For any t ∈ Λ define

At , S ∩ (t + V); Dt , V ∩ (t + S).

Note that

Dt = [(−t + V) ∩ S] + t

= A−t + t.

Thus

Vol(S) =∑

t∈Λ

Vol(At) =∑

t∈Λ

Vol(A−t + t) =∑

t∈Λ

Vol(Dt) = Vol(V).

Moreover

S = ·⋃

t∈Λ

At = ·⋃

t∈Λ

A−t = ·⋃

t∈Λ

Dt − t,

and therefore

[S] mod Λ = ·⋃

t∈Λ

Dt = V.

213

Corollary 20.1. If S is a fundamental cell of a lattice Λ with generating matrix G, then Vol(S) =|det(G)|. In Particular, Vol(V) = | det(G)|.

Proof. Let P = G · [0, 1)n and note that it is a fundamental cell of Λ as Rn = Zn + [0, 1)n. Theclaim now follows from Proposition 20.1 since Vol(P) = |det(G)| ·Vol([0, 1)n) = |det(G)|.

Definition 20.3 (Lattice decoder). A lattice decoder w.r.t. the lattice Λ returns for every y ∈ Rnthe point QΛ(y).

Remark 20.1. Recall that for linear codes, the ML decoder merely consisted of mapping syndromesto shifts. Similarly, it can be shown that a lattice decoder can be expressed as

QΛ(y) = y − gsynd

([G−1y] mod 1

), (20.3)

for some gsynd : [0, 1)n 7→ Rn, where the mod 1 operation above is to be understood as componentwisemodulo reduction. Thus, a lattice decoder is indeed much more “structured” than ML decoder for arandom code.

Note that for an additive channel Y = X + Z, if X ∈ Λ we have that

Pe = Pr (QΛ(Y) 6= X) = Pr(Z /∈ V). (20.4)

We therefore see that the resilience of a lattice to additive noise is dictated by its Voronoi region.Since we know that Z will be inside a ball of radius

√n(1 + δ) with high probability, we would like

the Voronoi region to be as close as possible to a ball. We define the effective radius of a lattice,denoted reff(Λ) as the radius of a ball with the same volume as V , namely Vol (B (reff(Λ))) = Vol(V).

Definition 20.4 (Goodness for coding). A sequence of lattices Λ(n) with growing dimension,satisfying

limn→∞

r2eff(Λ(n))

n= Φ

for some Φ > 0, is called good for channel coding if for any additive semi norm-ergodic noise sequenceZ(n) with effective variance σ2

Z = 1nE‖Z‖2 < Φ

limn→∞

Pr(Z(n) /∈ V(n)

)= 0.

An alternative interpretation of this property, is that for a sequence Λ(n) that is good for coding,for any 0 < δ < 1 holds

limn→∞

Vol(B((1− δ)reff(Λ(n))

)∩ V(n)

)

Vol(B((1− δ)reff(Λ(n))

)) = 1.

Roughly speaking, the Voronoi region of a lattice that is good for coding is as resilient to seminorm-ergodic noise as a ball with the same volume.

214

(a)

reff

(b)

Figure 20.1: (a) shows a lattice in R2, and (b) shows its Voronoi region and the correspondingeffective ball.

20.2 First Attempt at AWGN Capacity

Assume we have a lattice Λ ⊂ Rn with reff(Λ) =√n(1 + δ) that is good for coding, and we would

like to use it for communicating over an additive noise channel. In order to meet the power constraint,we must first intersect Λ, or a shifted version of Λ, with some compact set S that enforces the powerconstraint. The most obvious choice is taking S to be a ball with radius

√nSNR, and take some

shift v ∈ Rn, such that the codebook

C = (v + Λ)⋂B(√nSNR) (20.5)

satisfies the power constraint. Moreover [Loe97], there exist a shift v such that

|C| ≥ Vol (S)

Vol(V)

=

(√nSNR

reff(Λ)

)n

= 2n2 (log(SNR)−log(1+δ)).

To see this, let V ∼ Uniform(V), and write the expected size of |C| as

E|C| = E∑

t∈Λ

1((t + V) ∈ S)

=1

Vol(V)

∫

v∈V

∑

t∈Λ

1((t + v) ∈ S)dv

=1

Vol(V)

∫

x∈Rn1(x ∈ S)dx

=Vol(S)

Vol(V). (20.6)

215

For decoding, we will simply apply the lattice decoder QΛ(Y − v) on the shifted output. SinceY − v = t + Z for some t ∈ Λ, the error probability is

Pe = Pr (QΛ(Y − v) 6= t) = Pr(Z /∈ V).

Since Λ is good for coding andr2eff(Λ)n = (1 + δ) > 1

nE‖Z‖2, the error probability of this scheme overan additive semi norm-ergodic noise channel will vanish with n. Taking δ → 0 we see that any rateR < 1

2 log(SNR) can be achieved reliably. Note that for this coding scheme (encoder+decoder) theaverage error probability and the maximal error probability are the same.

The construction above gets us close to the AWGN channel capacity. We note that a possiblereason for the loss of +1 in the achievable rate is the suboptimality of the lattice decoder for thecodebook C. The lattice decoder assumes all points of Λ were equally likely to be transmitted.However, in C only lattice points inside the ball can be transmitted. Indeed, it was shown [UR98]that if one replaces the lattice decoder with a decoder that takes the shaping region into account,there exist lattices and shifts for which the codebook (v + Λ)

⋂B(√nSNR) is capacity achieving.

The main drawback of this approach is that the decoder no longer exploits the full structure ofthe lattice, so the advantages of using a lattice code w.r.t. some typical member of the Gaussiani.i.d. ensemble are not so clear anymore.

20.3 Nested Lattice Codes/Voronoi Constellations

A lattice Λc is said to be nested in Λf if Λc ⊂ Λf . The lattice Λc is referred to as the coarse latticeand Λf as the fine lattice. The nesting ratio is defined as

Γ(Λf ,Λc) ,

(Vol(Vc)Vol(Vf )

)1/n

(20.7)

A nested lattice code (sometimes also called “Voronoi constellation”) based on the nested latticepair Λc ⊂ Λf is defined as [CS83, FJ89, EZ04]

L , Λf ∩ Vc. (20.8)

Proposition 20.2.

|L| = Vol(Vc)Vol(Vf )

.

Thus, the codebook L has rate R = 1n log |L| = log Γ(Λf ,Λc).

Proof. First note that

Λf , ·⋃

t∈L(t + Λc).

Let

S , ·⋃

t∈L(t + Vf ),

216

and note that

Rn = ·⋃

b∈Λf

(b + Vf )

= ·⋃

a∈Λc

·⋃

t∈L(a + t + Vf )

= ·⋃

a∈Λc

(a +

(·⋃

t∈L(t + Vf )

))

= ·⋃

a∈Λc

(a + S) .

Thus, S is a fundamental cell of Λc, and we have

Vol(Vc) = Vol(S) = |L| ·Vol(Vf ).

We will use the codebook L with a standard lattice decoder, ignoring the fact that only pointsin Vc were transmitted. Therefore, the resilience to noise will be dictated mainly by Λf . The role ofthe coarse lattice Λc is to perform shaping. In order to maximize the rate of the codebook L withoutviolating the power constraint, we would like Vc to have the maximal possible volume, under theconstraint that the average power of a transmitted codeword is no more than nSNR.

The average transmission power of the codebook L is related to a quantity called the secondmoment of a lattice. Let U ∼ Uniform(V). The second moment of Λ is defined as σ2(Λ) , 1

nE‖U‖2.Let W ∼ Uniform(B(reff(Λ)). By the isoperimetric inequality [Zam14]

σ2(Λ) ≥ 1

nE‖W‖2 =

r2eff(Λ)

n+ 2.

A lattice Λ exhibits a good tradeoff between average power and volume if its second moment is closeto that of B(reff(Λ).

Definition 20.5 (Goodness for MSE quantization). A sequence of lattices Λ(n) with growingdimension, is called good for MSE quantization if

limn→∞

nσ2(Λ(n)

)

r2eff

(Λ(n)

) = 1.

Remark 20.2. Note that both “goodness for coding” and “goodness for quantization” are scaleinvariant properties: if Λ satisfy them, so does αΛ for any α ∈ R.

Theorem 20.1 ([OE15]). If Λ is good for MSE quantization and U ∼ Uniform(V), then U is seminorm-ergodic. Furthermore, if Z is semi norm-ergodic and statistically independent of U, then forany α, β ∈ R the random vector αU + βZ is semi norm-ergodic.

Theorem 20.2 ([ELZ05, OE15]). For any finite nesting ratio Γ(Λf ,Λc), there exist a nested latticepair Λc ⊂ Λf where the coarse lattice Λc is good for MSE quantization and the fine lattice Λf isgood for coding.

217

Figure 20.2: An example of a nested lattice code. The points and Voronoi region of Λc are plottedin blue, and the points of the fine lattice in black.

t

U

−

⊕mod-Λ

⊕X

Z

Y α ⊕ Yeff QΛf(·) mod-Λ

t

Figure 20.3: Schematic illustration of the Mod-Λ scheme.

We now describe the Mod-Λ coding scheme introduced by Erez and Zamir [EZ04]. Let Λc ⊂ Λfbe a nested lattice pair, where the coarse lattice is good for MSE quantization and has σ2(Λc) =SNR(1− ε), whereas the fine lattice is good for coding and has r2

eff(Λf ) = n SNR1+SNR(1 + ε). The rate

is therefore

R =1

nlog

(Vol(Vc)Vol(Vf )

)

=1

2log

(r2

eff(Λc)

r2eff(Λf )

)

→ 1

2log

(SNR(1− ε)SNR

1+SNR(1 + ε)

)(20.9)

→ 1

2log (1 + SNR) ,

where in (20.9) we have used the goodness of Λc for MSE quantization, that impliesr2eff(Λc)n → σ2(Λc).

The scheme also uses common randomness, namely a dither vector U ∼ Uniform(Vc) statisticallyindependent of everything, known to both the transmitter and the receiver. In order to transmit amessage w ∈ [1, . . . , 2nR] the encoder maps it to the corresponding point t = t(w) ∈ L and transmits

X = [t + U] mod Λ. (20.10)

Lemma 20.1 (Crypto Lemma). Let Λ be a lattice in Rn, let U ∼ Uniform(V) and let V be arandom vector in Rn, statistically independent of U. The random vector X = [V + U] mod Λ isuniformly distributed over V and statistically independent of V.

218

Proof. For any v ∈ Rn the set v + V is a fundamental cell of Λ. Thus, by Proposition 20.1 we havethat [v + V] mod Λ = V and Vol(v + V) = Vol(V). Thus, for any v ∈ Rn

X|V = v ∼ [v + U] mod Λ ∼ Uniform(V).

The Crypto Lemma ensures that 1nE‖X‖2 = (1− ε)SNR, but our power constraint was ‖X‖2 ≤

nSNR. Since X is uniformly distributed over Vc and Λc is good for MSE quantization, Theorem 20.1implies that ‖X‖2 ≤ nSNR with high probability. Thus, whenever the power constraint is violatedwe can just transmit 0 instead of X, and this will have a negligible effect on the error probability ofthe scheme.

The receiver scales its observation by a factor α > 0 to be specified later, subtracts the dither Uand reduces the result modulo the coarse lattice

Yeff = [αY −U] mod Λc

= [X−U + (α− 1)X + αZ] mod Λc

= [t + (α− 1)X + αZ] mod Λc (20.11)

= [t + Zeff] mod Λc, (20.12)

where we have used the modulo distributive law in (20.11), and

Zeff = (α− 1)X + αZ (20.13)

is effective noise, that is statistically independent of t, with effective variance

σ2eff(α) ,

1

nE‖Zeff‖2 < α2 + (1− α)2SNR. (20.14)

Since Z is semi norm-ergodic, and X is uniformly distributed over the Voronoi region of a lattice thatis good for MSE quantization, Theorem 20.1 implies that Zeff is semi norm-ergodic with effectivevariance σ2

eff(α). Setting α = SNR/(1 + SNR), such as to minimize the upper bound on σ2eff(α)

results in effective variance σ2eff < SNR/(1 + SNR).

The receiver next computes

t = [QΛf (Yeff)] mod Λc

=[QΛf (t + Zeff)

]mod Λc, (20.15)

and outputs the message corresponding to t as its estimate. Since Λf is good for coding, Zeff issemi norm-ergodic, and

r2eff(Λf )

n= (1 + ε)

SNR

1 + SNR> σ2

eff,

we have that Pr(t 6= t)→ 0 as the lattice dimension tends to infinity. Thus, we have proved thefollowing.

Theorem 20.3. There exist a coding scheme based on a nested lattice pair, that reliably achievesany rate below 1

2 log(1 + SNR) with lattice decoding for all additive semi norm-ergodic channels. Inparticular, if the additive noise is AWGN, this scheme is capacity achieving.

Remark 20.3. In the Mod-Λ scheme the error probability does not depend on the chosen message,such that Pe,max = Pe,avg. However, this required common randomness in the form of the dither U.By a standard averaging argument it follows that there exist some fixed shift u that achieves thesame, or better, Pe,avg. However, for a fixed shift the error probability is no longer independent ofthe chosen message.

219

20.4 Dirty Paper Coding

Assume now that the channel is

Y = X + S + Z,

where Z is a unit variance semi norm-ergodic noise, X is subject to the same power constraint‖X‖2 ≤ nSNR as before, and S is some arbitrary interference vector, known to the transmitter butnot to the receiver.

Naively, one can think that the encoder can handle the interference S just by subtracting itfrom the transmitted codeword. However, if the codebook is designed to exactly meet the powerconstraint, after subtracting S the power constraint will be violated. Moreover, if ‖S‖2 > nSNR,this approach is just not feasible.

Using the Mod-Λ scheme, S can be cancelled out with no cost in performance. Specifically,instead of transmitting X = [t + U] mod Λc, the transmitted signal in the presence of knowninterference will be

X = [t + U− αS] mod Λc.

Clearly, the power constraint is not violated as X ∼ Uniform(Vc) due to the Crypto Lemma (now,U should also be independent of S). The decoder is exactly the same as in the Mod-Λ scheme withno interference. It is easy to verify that the interference is completely cancelled out, and any ratebelow 1

2 log(1 + SNR) can still be achieved.

Remark 20.4. When Z is Gaussian and S is Gaussian there is a scheme based on random codesthat can reliably achieve 1

2 log(1 + SNR). For arbitrary S, to date, only lattice based coding schemesare known to achieve the interference free capacity. There are many more scenarios where latticecodes can reliably achieve better rates than the best known random coding schemes.

20.5 Construction of Good Nested Lattice Pairs

We now briefly describe a method for constructing nested lattice pairs. Our construction is basedon starting with a linear code over a prime finite field, and embedding it periodically in Rn to forma lattice.

Definition 20.6 (p-ary Construction A). Let p be a prime number, and let F ∈ Zk×np be a k × nmatrix whose entries are all members of the finite field Zp. The matrix F generates a linear p-arycode

C(F) ,

x ∈ Znp : x = [wTF] mod p w ∈ Zkp.

The p-ary Construction A lattice induced by the matrix F is defined as

Λ(F) , p−1C(F) + Zn.

Note that any point in Λ(F) can be decomposed as x = p−1c + a for some c ∈ C(F) (where weidentify the elements of Zp with the integers [0, 1, . . . , p−1]) and a ∈ Zn. Thus, for any x1,x2 ∈ Λ(F)

220

we have

x1 + x2 = p−1(c1 + c2) + a1 + a2

= p−1([c1 + c2] mod p+ pa) + a1 + a2

= p−1c + a

∈ Λ(F)

where c = [c1 + c2] mod p ∈ C(F) due to the linearity of C(F), and a and a are some vectors inZn. It can be verified similarly that for any x ∈ Λ(F) it holds that −x ∈ Λ(F), and that if allcodewords in C(F) are distinct, then Λ(F) has a finite minimum distance. Thus, Λ(F) is indeeda lattice. Moreover, if F is full-rank over Zp, then the number of distinct codewords in C(F) ispk. Consequently, the number of lattice points in every integer shift of the unit cube is pk, so thecorresponding Voronoi region must satisfy Vol(V) = p−k.

Similarly, we can construct a nested lattice pair from a linear code. Let 0 ≤ k′ < k and let F′ bethe sub-matrix obtained by taking only the first k′ rows of F. The matrix F′ generates a linearcode C′(F′) that is nested in C(F), i.e., C′(F′) ⊂ C(F). Consequently we have that Λ(F′) ⊂ Λ(F),and the nesting ratio is

Γ(Λ(F),Λ(F′)) = pk−k′n .

An advantage of this nested lattice construction for Voronoi constellations is that there is a verysimple mapping between messages and codewords in L = Λf ∩ Vc. Namely, we can index our setof 2nR = pk−k

′messages by all vectors in Zk−k′p . Then, for each message vector w ∈ Zk−k′p , the

corresponding codeword in L = Λ(F) ∩ V(Λ(F′)) is obtained by constructing the vector

wT = [0 · · · 0︸︷︷︸k′ zeros

wT ] ∈ Zkp, (20.16)

and taking t = t(w) =[[wTF] mod p

]mod Λ(F′). Also, in order to specify the codebook L, only

the (finite field) generating matrix F is needed.If we take the elements of F to be i.i.d. and uniform over Zp, we get a random ensemble of

nested lattice codes. It can be shown that if p grows fast enough with the dimension n (takingp = O(n(1+ε)/2) suffices) almost all pairs in the ensemble have the property that both the fine andcoarse lattice are good for both coding and for MSE quantization [OE15].

Disclaimer: This text is a very brief and non-exhaustive survey of the applications of latticesin information theory. For a comprehensive treatment, see [Zam14].

221

§ 21. Channel coding: energy-per-bit, continuous-time channels

21.1 Energy per bit

Consider the additive Gaussian noise channel:

Yi = Xi + Zi, Zi ∼ N (0, N0/2). (21.1)

In the last lecture, we analyzed the maximum number of information bits (M∗(n, ε, P )) that can bepumped through for given n time use of the channel under the energy constraint P . Today we shallstudy the counterpart of it: without any time constraint, in order to send k information bits, whatis the minimum energy needed? (E∗(k, ε))

Definition 21.1 ( (E, 2k, ε) code). For a channel W → X∞ → Y∞ → W , where Y∞ = X∞ +Z∞,a (E, 2k, ε) code is a pair of encoder-decoder:

f : [2k]→ R∞, g : R∞ → [2k],

such that 1). ∀m, ‖f(m)‖22 ≤ E,2). P [g(f(W ) + Z∞) 6= W ] ≤ ε.

Definition 21.2 (Fundamental limit).

E∗(k, ε) = minE : ∃(E, 2k, ε) code

Note: Operational meaning of limε→0E∗(k, ε): it suggests the smallest battery one needs in order

to send k bits without any time constraints, below that level reliable communication is impossible.

Theorem 21.1 ((Eb/N0)min = −1.6dB).

limε→0

lim supk→∞

E∗(k, ε)

k=

N0

log2 e,

1

log2 e= −1.6dB (21.2)

Proof.

222

1. (“≥” converse part)

−h(ε) + εk ≤ d((1− ε)‖ 1

M) (Fano)

≤ I(W ; W ) (data processing for divergence)

≤ I(X∞;Y∞) (data processing for M.I.)

≤∞∑

i=1

I(Xi;Yi) ( limn→∞

I(Xn;U) = I(X∞;U))

≤∞∑

i=1

1

2log(1 +

EX2i

N0/2) (Gaussian)

≤ log e

2

∞∑

i=1

EX2i

N0/2(linearization)

≤ E

N0log e

⇒E∗(k, ε)

k≥ N0

log e(ε− h(ε)

k).

2. (“≤” achievability part)Notice that a (n, 2k, ε, P ) code for AWGN channel is also a (nP, 2k, ε) code for the energyproblem without time constraint. Therefore,

log2M∗max(n, ε, P ) ≥ k ⇒ E∗(k, ε) ≤ nP.

∀P , take kn = blogM∗max(n, ε, P )c, we have E∗(kn,ε)kn

≤ nPkn, ∀n, and take the limit:

lim supn→∞

E∗(kn, ε)

kn≤ lim sup

n→∞

nP

logM∗max(n, ε, P )

=P

lim infn→∞1n logM∗max(n, ε, P )

=P

12 log(1 + P

N0/2)

Choose P for the lowest upper bound:

lim supn→∞

E∗(kn, ε)

kn≤ inf

P≥0

P12 log(1 + P

N0/2)

= limP→0

P12 log(1 + P

N0/2)

=N0

log2 e

Note: [Remark] In order to send information reliably at Eb/N0 = −1.6dB, infinitely many timeslots are needed, and the information rate (spectral efficiency) is thus 0. In order to have non-zerospectral efficiency, one necessarily has to step back from −1.6 dB.

223

Note: [PPM code] The following code, pulse-position modulation (PPM), is very efficient in termsof Eb/N0.

PPM encoder: ∀m, f(m) = (0, 0, . . . ,√E︸︷︷︸

m-th location

, . . . ) (21.3)

It is not hard to derive an upper bound on the probability of error that this code achieves [PPV11,Theorem 2]:

ε ≤ E

[min

MQ

(√2E

N0+ Z

), 1

], Z ∼ N (0, 1) .

In fact, the code can be further slightly optimized by subtracting the common center of gravity(2−k√E, . . . , 2−k

√E . . .) and rescaling each codeword to satisfy the power constraint. The resulting

constellation (simplex code) is conjectured to be non-asymptotic optimum in terms of Eb/N0 forsmall ε (“simplex conjecture”).

21.2 What is N0?

In the above discussion, we have assumed Zi ∼ N (0, N0/2), but how do we determine N0?In reality the signals are continuous time (CT) process, the continuous time AWGN channel for

the RF signals is modeled as:

Y (t) = X(t) +N(t) (21.4)

where noise N(t) (added at the receiver antenna) is a real stationary ergodic process and is assumedto be “white Gaussian noise” with single-sided PSD N0. Figure 21.1 at the end illustrates thecommunication architecture. In the following discussion, we shall find the equivalent discretetime (DT) AWGN model for the continuous time (CT) AWGN model in (21.4), and identify therelationship between N0 in the DT model and N(t) in the CT model.

• Goal: communication in fc ±B/2 band.(the (possibly complex) baseband signal lies in [−W,+W ], where W = B/2)

• observations:

1. Any signal band limited to fc ±B/2 can be produced by this architecture

2. At the step of C/D conversion, the LPF followed by sampling at B samples/sec issufficient statistics for estimating X(t), XB(t), as well as Xi.

First of all, what is N(t) in (21.4)?

Engineers’ definition of N(t)

224

Estimate the average power dissipation at the resistor:

limT→∞

1

T

∫ T

t=0F 2t dt

ergodic= E[F 2]

(*)= N0B

If for some constant N0, (*) holds for any narrow band with center frequency fc and bandwidth B,then N(t) is called a “white noise” with one-sided PSD N0.

Typically, white noise comes from thermal noise at the receiver antenna. Thus:

N0 ≈ kT (21.5)

where k = 1.38× 10−23 is the Boltzmann constant, and T is the absolute temperature. The unit ofN0 is (Watt/Hz = J).

An intuitive explanation to (21.5) is as follows: the thermal energy carried by each microscopicdegree of freedom (dof) is approximately kT

2 ; for bandwidth B and duration T , there are in total2BT dof; by “white noise” definition we have the total energy of the noise to be:

N0BT =kT

22BT ⇒ N0 = kT.

Mathematicians’ definition of N(t)

Denote the set of all real finite energy signals f(t) by L2(R), it is a vector space with the innerproduct of two signals f(t), g(t) defined by

< f, g >=

∫ ∞

t=−∞f(t)g(t)dt.

Definition 21.3 (White noise). N(t) is a white noise with two-sided PSD being constant N0/2 if∀f, g ∈ L2(R) such that

∫∞−∞ f

2(t)dt =∫∞−∞ g

2(t)dt = 1, we have that

1.

< f,N >,∫ ∞

−∞f(t)N(t)dt ∼ N (0,

N0

2). (21.6)

2. The joint distribution of (< f,N >,< g,N >) is jointly Gaussian with covariance equal toinner product < f, g >.

Note: By this definition, N(t) is not a stochastic process, rather it is a collection of linear mappingsthat map any f ∈ L2(R) to a Gaussian random variable.

225

Note: Informally, we write:

N(t) is white noise with one-sided PSD N0(or two-sided PSD N0/2)⇐⇒ E[N(t)N(s)] =N0

2δ(t− s)(21.7)

Note: The concept of one-sided PSD arises when N(t) is necessarily real, since in that case powerspectrum density is symmetric around 0, and thus to get the noise power in band [a, b] one can get

noise power =

∫ b

aFone-sided(f)df =

∫ b

a+

∫ −a

−bFtwo-sided(f)df ,

where Fone-sided(f) = 2Ftwo-sided(f). In theory of stochastic processes it is uncommon to talk aboutone-sided PSD, but in engineering it is.

Verify the equivalence between CT /DT models

First, consider the relation between RF signals and baseband signals.

X(t) = Re(XB(t)√

2ejωct),

YB(t) =√

2LPF2(Y (t)ejωct),

where ωc = 2πfc. The LPF2 with high cutoff frequency ∼ 34fc serves to kill the high frequency

component after demodulation, and the amplifier of magnitude√

2 serves to preserve the totalenergy of the signal, so that in the absence of noise we have that YB(t) = XB(t). Therefore,

YB(t) = XB(t) + N(t) ∼ C

where N(t) is a complex Gaussian white noise and

E[N(t)N(s)∗] = N0δ(t− s).

Notice that after demodulation, the PSD of the noise is N0/2 with N0/4 in the real part and N0/4in the imaginary part, and after the

√2 amplifier the PSD of the noise is restored to N0/2 in both

real and imaginary part.Next, consider the equivalent discrete time signals.

XB(t) =

∞∑

i=−∞XisincB(t− i

B)

Yi =

∫ ∞

t=−∞YB(t)sincB(t− i

B)dt

Yi = Xi + Zi

226

where the additive noise Zi is given by:

Zi =

∫ ∞

t=−∞N(t)sincB(t− i

B)dt ∼ i.i.d CN (0, N0). (by (21.6))

if we focus on the real part of all signals, it is consistent with the real AWGN channel model in(21.1).

Finally, the energy of the signal is preserved:

∞∑

i=−∞|Xi|2 = ‖XB(t)‖22 = ‖X(t)‖22.

Note: [Punchline]

CT AWGN (band limited)⇐⇒ DT C-AWGN

two-sided PSDN0

2⇐⇒ Zi ∼ CN (0, N0)

energy=

∫X(t)2dt⇐⇒ energy=

∑|Xi|2

21.3 Capacity of the continuous-time band-limited AWGNchannel

Theorem 21.2. Let M∗CT (T, ε, P ) the maximum number of waveforms that can be sent through thechannel

Y (t) = X(t) +N(t) , EN(t)N(s) =N0

2δ(t− s)

such that:

1. in the duration [0, T ];

2. band limited to [fc − B2 , fc + B

2 ] for some large carrier frequency

3. input energy constrained to∫ Tt=0 x

2(t) ≤ TP ;

4. error probability P [W 6= W ] ≤ ε.

Then

limε→0

lim infn→∞

1

TlogM∗CT (T, ε, P ) = B log(1 +

P

N0B) , (21.8)

Proof. Consider the DT equivalent C-AWGN channel of this CT model, we have that

1

TlogM∗CT (T, ε, P ) =

1

TlogM∗C−AWGN(BT, ε, P/B)

This is because:

• in time T we get to choose BT complex samples

227

• The power constraint in the DT model changed because for blocklength BT we have

BT∑

i=1

|Xi|2 = ‖X(t)‖22 ≤ PT ,

thus per-letter power constraint is PB .

Calculate the rate of the equivalent DT AWGN channel and we are done.

Note the above “theorem” is not rigorous, since conditions 1 and 2 are mutually exclusive:any time limited non-trivial signal cannot be band limited. Rigorously, one should relax 2 byconstraining the signal to have a vanishing out-of-band energy as T →∞. Rigorous approach tothis question lead to the theory of prolate spheroidal functions.

21.4 Capacity of the continuous-time band-unlimited AWGNchannel

In the limit of large bandwidth B the capacity formula (21.8) yields

CB=∞(P ) = limB→∞

B log(1 +P

N0B) =

P

N0log e .

It turns out that this result is easy to prove rigorously.

Theorem 21.3. Let M∗(T, ε, P ) the maximum number of waveforms that can be sent through thechannel

Y (t) = X(t) +N(t) , EN(t)N(s) =N0

2δ(t− s)

such that each waveform x(t)

1. is non-zero only on [0, T ];

2. input energy constrained to∫ Tt=0 x

2(t) ≤ TP ;

3. error probability P [W 6= W ] ≤ ε.

Then

limε→0

lim infT→∞

1

TlogM∗(T, ε, P ) =

P

N0log e (21.9)

Proof. Note that the space of all square-integrable functions on [0, T ], denoted L2[0, T ] has countablebasis (e.g. sinusoids). Thus, by changing to that basis we may assume that equivalent channelmodel

Yj = Xj + Zj , Zj ∼ N (0,N0

2) ,

and energy constraint (dependent upon duration T ):

∞∑

j=1

X2j ≤ PT .

But then the problem is equivalent to energy-per-bit one and hence

log2M∗(T, ε, P ) = k ⇐⇒ E∗(k, ε) = PT .

228

Thus,

limε→0

lim infn→∞

1

Tlog2M

∗(T, ε, P ) =P

limε→0 lim supk→∞E∗(k,ε)

k

=P

N0log2 e ,

where the last step is by Theorem 21.1.

229

Figure 21.1: DT / CT AWGN model

230

21.5 Capacity per unit cost

Generalizing the energy-per-bit setting of Theorem 21.1 we get the problem of capacity per unitcost :

1. Given a random transformation PY∞|X∞ and cost function c : X → R+, we let

M∗(E, ε) = maxM : (E,M, ε)-code ,

where (E,M, ε)-code is defined as a map [M ]→ X∞ with every codeword x∞ satisfying

∞∑

t=1

c(xt) ≤ E . (21.10)

2. Capacity per unit cost is defined as

Cpuc , limε→0

lim infE→∞

1

ElogM∗(E, ε) .

3. Let C(P ) be the capacity-cost function of the channel (in the usual sense of capacity, asdefined in (19.1). Assuming P0 = 0 and C(0) = 0 it is not hard to show that:

Cpuc = supP

C(P )

P= lim

P→0

C(P )

P=

d

dP

∣∣∣∣P=0

C(P ) .

4. The surprising discovery of Verdu is that one can avoid computing C(P ) and derive theCpuc directly. This is a significant help, as for many practical channels C(P ) is unknown.Additionally, this gives a yet another fundamental meaning to KL-divergence.

Theorem 21.4. For a stationary memoryless channel PY∞|X∞ =∏PY |X with P0 = c(x0) = 0

(i.e. there is a symbol of zero cost), we have

Cpuc = supx 6=x0

D(PY |X=x‖PY |X=x0)

c(x).

In particular, Cpuc =∞ if there exists x1 6= x0 with c(x1) = 0.

Proof. Let

CV = supx 6=x0

D(PY |X=x‖PY |X=x0)

c(x).

Converse: Consider a (E,M, ε) code W → X∞ → Y∞ → W . Introduce an auxiliary distributionQW,X∞,Y∞,W , where a channel is a useless one

QY∞|X∞ = QY∞ , P∞Y |X=x0

.

That is, the overall factorization is

QW,X∞,Y∞,W = PWPX∞|WQY∞PW |Y∞ .

231

Then, as usual we have from the data-processing for divergence

(1− ε) logM + h(ε) ≤ d(1− ε‖ 1

M) (21.11)

≤ D(PW,X∞,Y∞,W ‖QW,X∞,Y∞,W ) (21.12)

= D(PY∞|X∞‖QY∞ |PX∞) (21.13)

= E

[ ∞∑

t=1

d(Xt)

], (21.14)

where we denoted for convenience

d(x) , D(PY |X=x‖PY |X=x0) .

By the definition of CV we haved(x) ≤ c(x)CV .

Thus, continuing (21.14) we obtain

(1− ε) logM + h(ε) ≤ CV E

[ ∞∑

t=1

c(Xt)

]≤ CV · E ,

where the last step is by the cost constraint (21.10). Thus, dividing by E and taking limits we get

Cpuc ≤ CV .

Achievability: We generalize the PPM code (21.3). For each x1 ∈ X and n ∈ Z+ we define theencoder f as follows:

f(1) = (x1, x1, . . . , x1︸︷︷︸n-times

, x0, . . . , x0︸︷︷︸n(M−1)-times

) (21.15)

f(2) = (x0, x0, . . . , x0︸︷︷︸n-times

, x1, . . . , x1︸︷︷︸n-times

, x0, . . . , x0︸︷︷︸n(M−2)-times

) (21.16)

· · · (21.17)

f(M) = ( x0, . . . , x0︸︷︷︸n(M−1)-times

, x1, x1, . . . , x1︸︷︷︸n-times

) (21.18)

Now, by Stein’s lemma there exists a subset S ⊂ Yn with the property that

P[Y n ∈ S|Xn = (x1, . . . , x1)] ≥ 1− ε1 (21.19)

P[Y n ∈ S|Xn = (x0, . . . , x0)] ≤ exp−nD(PY |X=x1‖PY |X=x0

) + o(n) . (21.20)

Therefore, we propose the following (suboptimal!) decoder:

Y n ∈ S =⇒ W = 1 (21.21)

Y 2nn+1 ∈ S =⇒ W = 2 (21.22)

· · · (21.23)

From the union bound we find that the overall probability of error is bounded by

ε ≤ ε1 +M exp−nD(PY |X=x1‖PY |X=x0

) + o(n) .

232

At the same time the total cost of each codeword is given by nc(x1). Thus, taking n→∞ and afterstraightforward manipulations, we conclude that

Cpuc ≥D(PY |X=x1

‖PY |X=x0)

c(x1).

This holds for any symbol x1 ∈ X , and so we are free to take supremum over x1 to obtain Cpuc ≥ CV ,as required.

21.5.1 Energy-per-bit for AWGN channel subject to fading

Consider a stationary memoryless Gaussian channel with fading Hj (unknown at the receiver).Namely,

Yj = HjXj + Zj , Hj ∼ N (0, 1) ⊥⊥ Zj ∼ N (0,N0

2) .

The cost function is the usual quadratic one c(x) = x2. As we discussed previously, cf. (19.8), thecapacity-cost function C(P ) is unknown in closed form, but is known to behave drastically differentfrom the case of non-fading AWGN (i.e. when Hj = 1). So here previous theorem comes handy, aswe cannot just compute C ′(0). Let us perform a simple computation required, cf. (1.17):

Cpuc = supx6=0

D(N (0, x2 + N02 )‖N (0, N0

2 ))

x2(21.24)

=1

N0supx 6=0

(log e−

log(1 + 2x2

N0)

2x2

N0

)(21.25)

=log e

N0(21.26)

Comparing with Theorem 21.1 we discover that surprisingly, the capacity-per-unit-cost is unaffectedby the presence of fading. In other words, the random multiplicative noise which is so detrimentalat high SNR, appears to be much more benign at low SNR (recall that Cpuc = C ′(0)). There is oneimportant difference, however. It should be noted that the supremization over x in (21.25) is solvedat x =∞. Following the proof of the converse bound, we conclude that any code hoping to achieveoptimal Cpuc must satisfy a strange constraint:

∑

t

x2t 1|xt| ≥ A ≈

∑

t

x2t ∀A > 0

i.e. the total energy expended by each codeword must be almost entirely concentrated in very largespikes. Such a coding method is called “flash signalling”. Thus, we can see that unlike non-fadingAWGN (for which due to rotational symmetry all codewords can be made “mellow”), the only hopeof achieving full Cpuc in the presence of fading is by signalling in huge bursts of energy.

This effect manifests itself in the speed of convergence to Cpuc with increasing constellation sizes.

Namely, the energy-per-bit E∗(k,ε)k behaves as

E∗(k, ε)

k= (−1.59 dB) +

√const

kQ−1(ε) (AWGN) (21.27)

E∗(k, ε)

k= (−1.59 dB) +

3

√log k

k(Q−1(ε))2 (fading) (21.28)

Fig. 21.2 shows numerical details.

233

100 101 102 103 104 105 106 107 108−2

0

2

4

6

8

10

12

14

Achievability

Converse

−1.59 dB

Rayleigh fading, noCSI

fading+CSIR, non-fading AWGN

Information bits, k

Eb

N0,dB

Figure 21.2: Comparing the energy-per-bit required to send a packet of k-bits for different channelmodels (curves represent upper and lower bounds on the unknown optimal value E∗(k,ε)

k ). As acomparison: to get to −1.5 dB one has to code over 6 · 104 data bits when the channel is non-fadingAWGN or fading AWGN with Hj known perfectly at the receiver. For fading AWGN withoutknowledge of Hj (noCSI), one has to code over at least 7 · 107 data bits to get to the same −1.5 dB.Plot generated via [Spe15].

234

§ 22. Advanced channel coding. Source-Channel separation.

Topics: Strong Converse, Channel Dispersion, Joint Source Channel Coding (JSCC)

22.1 Strong Converse

We begin by stating the main theorem.

Theorem 22.1. For any stationary memoryless channel with either |A| <∞ or |B| <∞ we haveCε = C for 0 < ε < 1.

Remark: In Theorem 18.4, we showed that C ≤ Cε ≤ C1−ε . Now we are asserting that equality

holds for every ε. Our previous converse arguments (Theorem 16.4 based on Fano’s inequality)showed that communication with an arbitrarily small error probability is possible only when usingrate R < C; the strong converse shows that when you try to communicate with any rate abovecapacity R > C, then the probability of error will go to 1 (typically with exponential speed in n).In other words,

ε∗(n, exp(nR))→

0 R < C

1 R > C

where ε∗(n,M) is the inverse of M∗(n, ε) defined in (18.3).In practice, engineers observe this effect in the form of waterfall plots, which depict the dependence

of a given communication system (code+modulation) on the SNR.

110−1

10−2

10−3

10−4

Pe

SNR10−5

Below a certain SNR, the probability of error shoots up to 1, so that the receiver will only seegarbage.

Proof. We will give a sketch of the proof. Take an (n,M, ε)-code for channel PY |X . The main trickis to consider an auxiliary channel QY |X which is easier to analyze.

Xn Y nW WPY n|Xn

QY n|Xn

235

Sketch 1: Here, we take QY n|Xn = (P ∗Y )n, where P ∗Y is the capacity-achieving output distri-bution (caod) of the channel PY |X .1 Note that for communication purposes, QY n|Xn is a uselesschannel; it ignores the input and randomly picks a member of the output space according to (P ∗Y )n,so that Xn and Y n are decoupled (independent). Consider the probability of error under eachchannel:

Q[W = W ] =1

M(Blindly guessing the sent codeword)

P[W = W ] = 1− ε

Since the random variable 1W=W has a huge mass under P and small mass under Q, this looks

like a great binary hypothesis test to distinguish the two distributions, PWXnY nW and QWXnY nW .Since any hypothesis test can’t beat the optimal Neyman-Pearson test, we get the upper bound

β1−ε(PWXnY nW , QWXnY nW ) ≤ 1

M(22.1)

(Recall that βα(P,Q) = infP [E]≥αQ[E]). Since the likelihood ratio is a sufficient statistic for thishypothesis test, we can test only between

PWXnY nW

QWXnY nW

=PWPXn|WPY n|XnPW |Y n

PWPXn|W (P ∗Y )nPW |Y n=

PW |XnPXnY nPW |Y n

PW |XnPXn(P ∗Y )nPW |Y n=

PXnY n

PXn(P ∗Y )n

Therefore, inequality above becomes

β1−ε(PXnY n , PXn(P ∗Y )n) ≤ 1

M(22.2)

Computing the LHS of this bound need not be easy, since generally we know PY |X and P ∗Y , butcan’t assume anything about PXn which depends on the code. (Note that Xn is the output of theencoder and uniformly distributed on the codebook for deterministic encoders). Certain tricks areneeded to remove the dependency on codebook. However, in case the channel is “symmetric” thedependence on the codebook disappears: this is shown in the following example for the BSC. Totreat the general case one simply decomposes the channel into symmetric subchannels (for example,by considering constant composition subcodes).

Example. For a BSC(δ)n, recall that

PY n|Xn(yn|xn) = PnZ (yn − xn), Zn ∼ Bern(δ)n

(P ∗Y )n(yn) = 2−n

From the Neyman Pearson test, the optimal HT takes the form

βα(PXnY n︸︷︷︸P

, PXn(P ∗Y )n︸︷︷︸Q

) = Q[log

PXnY n

PXn(P ∗Y )n≥ γ

]where α = P

[log

PXnY n

PXn(P ∗Y )n≥ γ

]

For the BSC, this becomes

logPXnY n

PXn(P ∗Y )n= log

PZn(yn − xn)

2−n

1Recall from Theorem 4.5 that the caod of a random transformation always exists and is unique, whereas a caidmay not exist.

236

So under each hypothesis P and Q, the difference Y n −Xn takes the form

Q : Y n −Xn ∼ Bern(1

2)n

P : Y n −Xn ∼ Bern(δ)n

Now all the relevant distributions are known, so we can compute βα

βα(PXnY n , PXn(P ∗Y )n) = βα(Bern(δ)n,Bern(1

2)n)

= 2−nD(Bern(δ)‖Bern( 12

))+o(n) (Stein’s Lemma Theorem 13.1)

= 2−nd(δ‖ 12

)+o(n)

Putting this all together, we see that any (n,M, ε) code for the BSC satisfies

2−nd(δ‖ 12

)+o(n) ≤ 1

M=⇒ logM ≤ nd(δ‖1

2) + o(n)

Since this is satisfied for all codes, it is also satisfied for the optimal code, so we get the conversebound

lim infn→∞

1

nlogM∗(n, ε) ≤ d(δ‖1

2) = log 2− h(δ)

For a general channel, this computation can be much more difficult. The expression for β in thiscase is

β1−ε(PXnPY n|Xn , PXn(P ∗Y )n) = 2−nD(PY |X‖P ∗Y |PX)+o(n) ≤ 1

M(22.3)

where PX is unknown (depending on the code).Explanation of (22.3): A statistician observes sequences of (Xn, Y n):

Xn = [ 0 1 2 0 0 1 2 2 ]

Y n = [ a b b a c c a b ]

On the marked three blocks, test between iid samples of PY |X=0 vs P ∗Y , which has exponentD(PY |X=0‖P ∗Y ). Thus, intuitively averaging over the composition of the codeword we get that theexponent of β is given by (22.3).

Recall that from the saddle point characterization of capacity (Theorem 4.4) for any distributionPX we have

D(PY |X‖P ∗Y |PX) ≤ C . (22.4)

Thus from (22.3) and (22.1):

logM ≤ nD(PY |X‖P ∗Y |PX) + o(n) ≤ nC + o(n)

Sketch 2: (More formal) Again, we will choose a dummy auxiliary channel QY n|Xn = (QY )n.However, choice of QY will depend on one of the two cases:

237

From here, we apply Chebyshev inequality and (22.5) or (22.7) to get

P[Sn ≤ nE[Sn|Xn] +K ′

√n|Xn

]≥ 1− K ′2

K.

If we set K ′ so large that 1− K′2K > 2ε then overall we get that

log β1−ε(PXnY n , PXn(QY )n) ≥ −nC −K ′√n− log ε .

Consequently, from (22.9) we conclude that

logM∗(n, ε) ≤ nC +O(√n) ,

implying the strong converse.

In summary, the take-away points for the strong converse are

1. Strong converse can be proven by using binary hypothesis testing.

2. The capacity saddle point (22.4) is key.

In the homework, we will explore in detail proofs of the strong converse for the BSC and the AWGNchannel.

22.2 Stationary memoryless channel without strong converse

It may seem that the strong converse should hold for an arbitrary stationary memoryless channel (itwas only showed for the discrete ones above). However, it turns out that there exist counterexamples.We construct one next.

Let the output alphabet be B = [0, 1]. The input A is going to be countably infinite. It will beconvenient to define it as

A = (j,m) : j,m ∈ Z+, 0 ≤ j ≤ m .The single-letter channel PY |X is defined in terms of probability density function as

pY |X(y|(j,m)) =

am,

jm ≤ y ≤

j+1m , ,

bm, otherwise ,

where am, bm are chosen to satisfy

1

mam + (1− 1

m)bm = 1 (22.10)

1

mam log am + (1− 1

m)bm log bm = C , (22.11)

where C > 0 is an arbitary fixed constant. Note that for large m we have

am =mC

logm(1 +O(

1

logm)) , (22.12)

bm = 1− C

logm+O(

1

log2m) (22.13)

239

It is easy to see that P ∗Y = Unif[0, 1] is the capacity-achieving output distribution and

supPX

I(X;Y ) = C .

Thus by Theorem 18.6 the capacity of the corresponding stationary memoryless channel is C. Wenext show that nevertheless the ε-capacity can be strictly greater than C.

Indeed, fix blocklength n and consider a single letter distribution PX assigning equal weightsto all atoms (j,m) with m = exp2nC. It can be shown that in this case, the distribution of asingle-letter information density is given by

i(X;Y ) ≈

2nC, w.p. 12n

0, w.p.1− 12n

Thus, for blocklength-n density we have

1

ni(Xn;Y n)→ 2CPoisson(1/2) .

Therefore, from Theorem 17.1 we get that for ε > 1− e−1/2 there exist (n,M, ε)-codes with

logM ≥ 2nC .

In particular,Cε ≥ 2C ∀ε > 1− e−1/2

22.3 Channel Dispersion

The strong converse tells us that logM∗(n, ε) = nC + o(n) ∀ε ∈ (0, 1). An engineer sees this, andestimates logM∗ ≈ nC. However, this doesn’t give any information about the dependence of logM∗

on the error probability ε, which is hidden in the o(n) term. We unravel this in the followingtheorem.

Theorem 22.2. Consider one of the following channels:

1. DMC

2. DMC with cost constraint

3. AWGN or parallel AWGN

The following expansion holds for a fixed 0 < ε < 1/2 and n→∞

logM∗(n, ε) = nC −√nV Q−1(ε) +O(log n)

where Q−1 is the inverse of the complementary standard normal CDF, the channel capacity isC = I(X∗;Y ∗) = E[i(X∗;Y ∗)], and the channel dispersion2 is V = Var[i(X∗;Y ∗)|X∗].

2There could be multiple capacity-achieving input distributions, in which case PX∗ should be chosen as the onethat minimizes Var[i(X∗;Y ∗)|X∗]. See [PPV10a] for more details.

240

Proof. For achievability, we have shown (Theorem 18.7) that logM∗(n, ε) ≥ nC −√nV Q−1(ε) by

refining the proof of the noisy channel coding theorem using the CLT.The converse statement is logM∗ ≤ − log β1−ε(PXnY n , PXn(P ∗Y )n). For the BSC, we showed

that the RHS of the previous expression is

− log β1−ε(Bern(δ)n,Bern(1

2)n) = nd(δ‖1

2) +√nV Q−1(ε) + o(

√n)

(see homework) where the dispersion is

V = VarZ∼Bern(δ)

[log

Bern(δ)

Bern(12)

(Z)

].

The general proof is omitted.

Remark: This expansion only applies for certain channels (as described in the theorem). If,for example, Var[i(X;Y )] =∞, then the theorem need not hold and there are other stable (non-Gaussian) distributions that we might converge to instead. Also notice that for DMC without costconstraint

Var[i(X∗;Y ∗)|X∗] = Var[i(X∗;Y ∗)]

since (capacity saddle point!) E[i(X∗;Y ∗)|X∗ = x] = C for PX∗-almost all x.

22.3.1 Applications

As stated earlier, direct computation of M∗(n, ε) by exhaustive search doubly exponential incomplexity, and thus is infeasible in most cases. However, we can get an easily computableapproximation using the channel dispersion via

logM∗(n, ε) ≈ nC −√nV Q−1(ε)

Consider a BEC (n = 500, δ = 1/2) as an example of using this approximation. For this channel,the capacity and dispersion are

C = 1− δV = δδ

Where δ = 1− δ. Using these values, our approximation for this BEC becomes

logM∗(500, 10−3) ≈ nC −√nV Q−1(ε) = nδ −

√nδδQ−1(10−3) ≈ 215.5 bits

In the homework, for the BEC(500, 1/2) we obtained bounds 213 ≤ logM∗(500, 10−3) ≤ 217, sothis approximation falls in the middle of these bounds.

Examples of Channel Dispersion

241

For a few common channels, the dispersions are

BEC: V (δ) = δδ log2 2

BSC: V (δ) = δδ log2 δ

δ

AWGN: V (P ) =P (P + 2)

2(P + 1)2log2 e (Real)

P (P + 2)

(P + 1)2log2 e (Complex)

Parallel AWGN: V (P, σ2) =L∑

j=1

VAWGN (Pjσ2j

) =log2 e

2

L∑

j=1

∣∣∣∣∣∣1−

(σ2j

T

)2∣∣∣∣∣∣

+

whereL∑

j=1

|T − σ2j |+ = P is the water-filling solution of the parallel AWGN

Punchline: Although the only machinery needed for this approximation is the CLT, the resultsproduced are incredibly useful. Even though logM∗ is nearly impossible to compute on its own, byonly finding C and V we are able to get a good approximation that is easily computable.

22.4 Normalized Rate

Suppose you’re given two codes k1 → n1 and k2 → n2, how do you fairly compare them? Perhapsthey have the following waterfall plots

k1 → n1 k2 → n2

SNR SNR

Pe Pe

10−4 10−4

P ∗ P ∗

After inspecting these plots, one may believe that the k1 → n1 code is better, since it requiresa smaller SNR to achieve the same error probability. However, there are many factors, such asblocklength, rate, etc. that don’t appear on these plots. To get a fair comparison, we can use thenotion of normalized rate. To each (n, 2k, ε)-code, define

Rnorm =k

log2M∗AWGN (n, ε, P )

≈ k

nC(P )−√nV (P )Q−1(ε)

Take ε = 10−4, and P (SNR) according to the water fall plot corresponding to Pe = 10−4, and wecan compare codes directly (see Fig. 22.1). This normalized rate gives another motivation for theexpansion given in Theorem 22.2.

22.5 Joint Source Channel Coding

Now we will examine a slightly different information transmission scenario called Joint SourceChannel Coding

242

102

103

104

105

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1Normalized rates of code families over AWGN, Pe=0.0001

Blocklength, n

Nor

mal

ized

rat

e

Turbo R=1/3Turbo R=1/6Turbo R=1/4VoyagerGalileo HGATurbo R=1/2Cassini/PathfinderGalileo LGAHermitian curve [64,32] (SDD)Reed−Solomon (SDD)BCH (Koetter−Vardy)Polar+CRC R=1/2 (List dec.)ME LDPC R=1/2 (BP)

102

103

104

105

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1Normalized rates of code families over BIAWGN, Pe=0.0001

Blocklength, n

Nor

mal

ized

rat

e

Turbo R=1/3Turbo R=1/6Turbo R=1/4VoyagerGalileo HGATurbo R=1/2Cassini/PathfinderGalileo LGABCH (Koetter−Vardy)Polar+CRC R=1/2 (L=32)Polar+CRC R=1/2 (L=256)Huawei NB−LDPCHuawei hybrid−CyclicME LDPC R=1/2 (BP)

Figure 22.1: Normalized rates for various codes. Plots generated via [Spe15].

243

Source Encoder(JSCC)

Sk Xn

Channel Y n Decoder(JSCC)

Sk

Definition 22.1. For a Joint Source Channel Code

• Goal: P[Sk 6= Sk] ≤ ε

• Encoder: f : Ak → X n

• Decoder: g : Yn → Ak

• Fundamental Limit (Optimal probability of error): ε∗JSCC(k, n) = inff,g P[Sk 6= Sk]

where the rate is R = kn (symbol per channel use).

Note: In channel coding we are interested in transmitting M messages and all messages are bornequal. Here we want to convey the source realizations which might not be equiprobable (hasredundancy). Indeed, if Sk is uniformly distributed on, say, 0, 1k, then we are back to the channelcoding setup with M = 2k under average probability of error, and ε∗JSCC(k, n) coincides withε∗(n, 2k) defined in Section 22.1.Note: Here, we look for a clever scheme to directly encode k symbols from A into a length n channelinput such that we achieve a small probability of error over the channel. This feels like a mix of twoproblems we’ve seen: compressing a source and coding over a channel. The following theorem showsthat compressing and channel coding separately is optimal. This is a relief, since it implies that wedo not need to develop any new theory or architectures to solve the Joint Source Channel Codingproblem. As far as the leading term in the asymptotics is concerned, the following two-stage schemeis optimal: First use the optimal compressor to eliminate all the redundancy in the source, then usethe optimal channel code to add redundancy to combat the noise in the data transmission.

Theorem 22.3. Let the source Sk be stationary memoryless on a finite alphabet with entropy H.Let the channel be stationary memoryless with finite capacity C. Then

ε∗JSCC(nR, n)

→ 0 R < C/H

6→ 0 R > C/Hn→∞.

Note: Interpretation: Each source symbol has information content (entropy) H bits. Each channeluse can convey C bits. Therefore to reliably transmit k symbols over n channel uses, we needkH ≤ nC.

Proof. Achievability. The idea is to separately compress our source and code it for transmission.Since this is a feasible way to solve the JSCC problem, it gives an achievability bound. Thisseparated architecture is

Skf1−→W

f2−→ Xn PY n|Xn−→ Y n g2−→ Wg1−→ Sk

Where we use the optimal compressor (f1, g1) and optimal channel code (maximum probability oferror) (f2, g2). Let W denote the output of the compressor which takes at most Mk values. Then

(From optimal compressor)1

klogMk > H + δ =⇒ P[Sk 6= Sk(W )] ≤ ε ∀k ≥ k0

(From optimal channel code)1

nlogMk < C − δ =⇒ P[W 6= m|W = m] ≤ ε ∀m,∀k ≥ k0

244

Using both of these,

P[Sk 6= Sk(W )] ≤ P[Sk 6= Sk,W = W ] + P[W 6= W ]

≤ P[Sk 6= Sk(W )] + P[W 6= W ] ≤ ε+ ε

And therefore if R(H + δ) < C − δ, then ε∗ → 0. By the arbitrariness of δ > 0, we conclude theweak converse for any R > C/H.

Converse: channel-substitution proof. Let QSkSk = USkPSk where USk is the uniformdistribution. Using data processing

D(PSkSk‖QSkSk) = D(PSk‖USk) +D(PS|Sk‖PS |PSk) ≥ d(1− ε‖ 1

|A|k )

Rearranging this gives

I(Sk; Sk) ≥ d(1− ε‖ 1

|A|k )−D(PSk‖USk)

≥ − log 2 + kε log |A|+H(Sk)− k log |A|= H(Sk)− log 2− kε log |A|

Which follows from expanding out the terms. Now, normalizing and taking the sup of both sidesgives

1

nsupXn

I(Xn;Y n) ≥ 1

nH(Sk)− εk

nlog |A|+ o(1)

letting R = k/n, this shows

C ≥ RH − εR log |A| =⇒ ε ≥ RH − CR log |A| > 0

where the last expression is positive when R > C/H.Converse: usual proof. Any JSCC encoder/decoder induces a Markov chain

Sk → Xn → Y n → Sk.

Applying data processing for mutual information

I(Sk; Sk) ≤ I(Xn;Y n) ≤ supPXn

I(Xn;Y n) = nC.

On the other hand, since P[Sk 6= Sk] ≤ εn, Fano’s inequality (Theorem 5.3) yields

I(Sk; Sk) = H(Sk)−H(Sk|Sk) ≥ kH − εn log |A|k − log 2.

Combining the two givesnC ≥ kH − εn log |A|k − log 2.

Since R = kn , dividing both sides by n and sending n→∞ yields

lim infn→∞

εn ≥RH − CR log |A| .

Therefore εn does not vanish if R > C/H.

245

§ 23. Channel coding with feedback

Criticism: Channels without feedback don’t exist (except for storage).Motivation: Consider the communication channel of the downlink transmission from a satellite

to earth. Downlink transmission is very expensive (power constraint at the satellite), but the uplinkfrom earth to the satellite is cheap which makes virtually noiseless feedback readily available atthe transmitter (satellite). In general, channel with noiseless feedback is interesting when suchasymmetry exists between uplink and downlink.

In the first half of our discussion, we shall follow Shannon to show that feedback gains “nothing”in the conventional setup, while in the second half, we look at situations where feedback gains a lot.

23.1 Feedback does not increase capacity for stationarymemoryless channels

Definition 23.1 (Code with feedback). An (n,M, ε)-code with feedback is specified by the encoder-decoder pair (f, g) as follows:

• Encoder: (time varying)

f1 : [M ]→ Af2 : [M ]× B → A

...

fn : [M ]× Bn−1 → A

• Decoder:

g : Bn → [M ]

such that P[W 6= W ] ≤ ε.

246

Here the symbol transmitted at time t depends on both the message and the history of receivedsymbols:

Xt = ft(W,Yt−1

1 )

Hence the probability space is as follows:

W ∼ uniform on [M ]

X1 = f1(W )PY |X−→ Y1

...

Xn = fn(W,Y n−11 )

PY |X−→ Yn

−→ W = g(Y n)

Definition 23.2 (Fundamental limits).

M∗fb(n, ε) = maxM : ∃(n,M, ε) code with feedback.

Cfb,ε = lim infn→∞

1

nlogM∗fb(n, ε)

Cfb = limε→0

Cfb,ε (Shannon capacity with feedback)

Theorem 23.1 (Shannon 1956). For a stationary memoryless channel,

Cfb = C = Ci = supPX

I(X;Y )

Proof. Achievability: Although it is obvious that Cfb ≥ C, we wanted to demonstrate that in factconstructing codes achieving capacity with full feedback can be done directly, without appealing to a(much harder) problem of non-feedback codes. Let πt(·) , PW |Y t(·|Y t) with the (random) posteriordistribution after t steps. It is clear that due to the knowledge of Y t on both ends, transmitter andreceiver have perfectly synchronized knowledge of πt. Now consider how the transmission progresses:

1. Initialize π0(·) = 1M

2. At (t+1)-th step, having knowledge of πt all messages are partitioned into classes Pa, accordingto the values ft+1(·, Y t):

Pa , j ∈ [M ] : ft+1(j, Y t) = a a ∈ A .

Then transmitter, possessing the knowledge of the true message W , selects a letter Xt+1 =ft+1(W,Y t).

3. Channel perturbs Xt+1 into Yt+1 and both parties compute the updated posterior:

πt+1(j) , πt(j)Bt+1(j) , Bt+1(j) ,PY |X(Yt+1|ft+1(j, Y t))∑

a∈A πt(Pa).

Notice that (this is the crucial part!) the random multiplier satisfies:

E[logBt+1(W )|Y t] =∑

a∈A

∑

y∈Bπt(Pa) log

PY |X(y|a)∑a∈A πt(Pa)a

= I(πt, PY |X) (23.1)

where πt(a) , πt(Pa) is a (random) distribution on A.

247

The goal of the code designer is to come up with such a partitioning Pa, a ∈ A that the speedof growth of πt(W ) is maximal. Now, analyzing the speed of growth of a random-multiplicativeprocess is best done by taking logs:

log πt(j) =t∑

s=1

logBs + log π0(j) .

Intutively, we expect that the process log πt(W ) resembles a random walk starting from − logM andhaving a positive drift. Thus to estimate the time it takes for this process to reach value 0 we needto estimate the upward drift. Appealing to intuition and the law of large numbers we approximate

log πt(W )− log π0(W ) ≈t∑

s=1

E[logBs] .

Finally, from (23.1) we conclude that the best idea is to select partitioning at each step in such away that πt ≈ P ∗X (caid) and this obtains

log πt(W ) ≈ tC − logM ,

implying that the transmission terminates in time ≈ logMC . The important lesson here is the

following: The optimal transmission scheme should map messages to channel inputs in such a waythat the induced input distribution PXt+1|Y t is approximately equal to the one maximizing I(X;Y ).This idea is called posterior matching and explored in detail in [SF11].1

Converse: we are left to show that Cfb ≤ Ci.Recall the key in proving weak converse for channel coding without feedback: Fano’s inequality

plus the graphical modelW → Xn → Y n → W . (23.2)

Then−h(ε) + ε logM ≤ I(W ; W ) ≤ I(Xn;Y n) ≤ nCi.

With feedback the probabilistic picture becomes more complicated as the following figure showsfor n = 3 (dependence introduced by the extra squiggly arrows):

W

X1

X2

X1

Y1

Y2

Y1

W

without feedback

W

X1

X2

X1

Y1

Y2

Y1

W

with feedback

1Note that the magic of Shannon’s theorem is that this optimal partitioning can also be done blindly, i.e. it ispossible to preselect partitions Pa in a way that is independent of πt (but dependent on t) and so that πt(Pa) ≈ P ∗X(a)with overwhelming probability and for all t ∈ [n].

248

So, while the Markov chain realtion in (23.2) is still true, the input-output relation is no longermemoryless2

PY n|Xn(yn|xn) 6=n∏

j=1

PY |X(yj |xj) (!)

There is still a large degree of independence in the channel, though. Namely, we have

(Y t−1,W )→Xt → Yt, t = 1, . . . , n (23.3)

W → Y n → W (23.4)

Then

−h(ε) + ε logM ≤ I(W ; W ) (Fano)

≤ I(W ;Y n) (Data processing applied to (23.4))

=n∑

t=1

I(W ;Yt|Y t−1) (Chain rule)

≤n∑

t=1

I(W,Y t−1;Yt) (I(W ;Yt|Y t−1) = I(W,Y t−1;Yt)− I(Y t−1;Yt))

≤n∑

t=1

I(Xt;Yt) (Data processing applied to (23.3))

≤ nCt

The following result (without proof) suggests that feedback does not even improve the speed ofapproaching capacity either (under fixed-length block coding) and can at most improve smallishlog n terms:

Theorem 23.2 (Dispersion with feedback). For weakly input-symmetric DMC (e.g. additive noise,BSC, BEC) we have:

logM∗fb(n, ε) = nC −√nV Q−1(ε) +O(log n)

(The meaning of this is that for such channels feedback can at most improve smallish log nterms.)

23.2* Alternative proof of Theorem 23.1 and Massey’s directedinformation

The following alternative proof emphasizes on data processing inequality and the comparison idea(auxiliary channel) as in Theorem 19.1.

Proof. It is obvious that Cfb ≥ C, we are left to show that Cfb ≤ Ci.1. Recap of the steps of showing the strong converse of C ≤ Ci in the last lecture: take any

(n,M, ε) code, compare the two distributions:

P : W → Xn → Y n → W (23.5)

Q : W → Xn Y n → W (23.6)

two key observations:2This is easy to see from the example where X2 = Y1 and thus PY1|X2 has no randomness.

249

a) Under Q, W ⊥⊥W , so that Q[W = W ] = 1M while P[W = W ] ≥ 1− ε.

b) The two graphical models give the factorization:

PW,Xn,Y n,W = PW,XnPY n|XnPW |Y n , QW,Xn,Y n,W = PW,XnPY nPW |Y n

thus D(P‖Q) = I(Xn;Y n) measures the information flow through the links Xn → Y n.

−h(ε) + ε logM = d(1− ε‖ 1

M)

dpi≤ D(P‖Q) = I(Xn;Y n)

mem−less,stat=

n∑

i=1

I(X;Y ) ≤ nCi

(23.7)

2. Notice that when feedback is present, Xn → Y n is not memoryless due to the transmissionprotocol, let’s unfold the probability space over time to see the dependence. As an example,the graphical model for n = 3 is given below:

If we define Q similarly as in the case without feedback, we will encounter a problem at thesecond last inequality in (23.7), as with feedback I(Xn;Y n) can be significantly larger than∑n

i=1 I(X;Y ). Consider the example where X2 = Y1, we have I(Xn;Y n) = +∞ independentof I(X;Y ).

We also make the observation that if Q is defined in (23.6), D(P‖Q) = I(Xn;Y n) measuresthe information flow through all the 6→ and links. This motivates us to find a proper Qsuch that D(P‖Q) only captures the information flow through all the 6→ links Xi → Yi : i =1, . . . , n, so that D(P‖Q) closely relates to nCi, while still guarantees that W ⊥⊥W , so thatQ[W 6= W ] = 1

M .

3. Formally, we shall restrict QW,Xn,Y n,W ∈ Q, where Q is the set of distributions that can befactorized as follows:

QW,Xn,Y n,W = QWQX1|WQY1QX2|W,Y1QY2|Y1

· · ·QXn|W,Y n−1QYn|Y n−1QW |Y n (23.8)

PW,Xn,Y n,W = PWPX1|WPY1|X1PX2|W,Y1

PY2|X2· · ·PXn|W,Y n−1PYn|XnPW |Y n (23.9)

250

Verify that W ⊥⊥W under Q: W and W are d-separated by Xn.

Notice that in the graphical models, when removing 9 we also added the directional linksbetween the Yi’s, these links serve to maximally preserve the dependence relationships betweenvariables when 9 are removed, so that Q is the “closest” to P while W ⊥⊥W is satisfied.

Now we have that for Q ∈ Q, d(1 − ε‖ 1M ) ≤ D(P‖Q), in order to obtain the least upper

bound, in Lemma 23.1 we shall show that:

infQ∈Q

D(PW,Xn,Y n,W ‖QW,Xn,Y n,W ) =n∑

t=1

I(Xt;Yt|Y t−1)

=n∑

t=1

EY t−1 [I(PXt|Y t−1 , PY |X)]

≤n∑

t=1

I(EY t−1 [PXt|Y t−1 ], PY |X) (concavity of I(·, PY |X))

=n∑

t=1

I(PXt , PY |X)

≤nCi.

Following the same procedure as in (a) we have

−h(ε) + ε logM ≤ nCi ⇒ logM ≤ nC + h(ε)

1− ε ⇒ Cfb,ε ≤C

1− ε ⇒ Cfb ≤ C.

4. Notice that the above proof is also valid even when cost constraint is present.

Lemma 23.1.

infQ∈Q

D(PW,Xn,Y n,W ‖QW,Xn,Y n,W ) =

n∑

t=1

I(Xt;Yt|Y t−1) (23.10)

(, ~I(Xn;Y n), directed information)

Proof. By chain rule, we can show that the minimizer Q ∈ Q must satisfy the following equalities:

QX,W = PX,W ,

QXt|W,Y t−1 = PXt|W,Y t−1 , (check!)

QW |Y n = PW |Y n .

and therefore

infQ∈Q

D(PW,Xn,Y n,W ‖QW,Xn,Y n,W )

= D(PY1|X1‖QY1 |X1) +D(PY2|X2,Y1

‖QY2|Y1|X2, Y1) + · · ·+D(PYn|Xn,Y n−1‖QYn|Y n−1 |Xn, Y

n−1)

= I(X1;Y1) + I(X2;Y2|Y1) + · · ·+ I(Xn;Yn|Y n−1)

251

23.3 When is feedback really useful?

Theorems 23.1 and 23.2 state that feedback does not improve communication rate neither asymptot-ically nor for moderate blocklengths. In this section, we shall examine three cases where feedbackturns out to be very useful.

23.3.1 Code with very small (e.g. zero) error probability

Theorem 23.3 (Shannon ’56). For any DMC PY |X ,

Cfb,0 = maxPX

miny∈B

log1

PX(Sy)(23.11)

where

Sy = x ∈ A : PY |X(y|x) > 0

denotes the set of input symbols that can lead to the output symbol y.

Note: For stationary memoryless channel,

C0

def.≤ Cfb,0

def.≤ Cfb = lim

ε→0Cfb,ε

Thm 23.1= C = lim

ε→0Cε

Shannon= Ci = sup

PX

I(X;Y )

All capacity quantities above are defined with (fixed-length) block codes.

Note:

1. In DMC for both zero-error capacities (C0 and Cfb,0) only the support of the transitionmatrix PY |X , i.e., whether PY |X(b|a) > 0 or not, matters. The value of PY |X(b|a) > 0is irrelevant. That is, C0 and Cfb,0 are functions of a bipartite graph between input andoutput alphabets. Furthermore, the C0 (but not Cfb,0!) is a function of the confusabilitygraph – a simple undirected graph on A with a 6= a′ connected by an edge iff ∃b ∈ Bs.t. PY |X(b|a)PY |X(b|a′) > 0.

2. That Cfb,0 is not a function of the confusability graph alone is easily seen from comparing thepolygon channel (next remark) with L = 3 (for which Cfb,0 = log 3

2) and the useless channelwith A = 1, 2, 3 and B = 1 (for which Cfb,0 = 0). Clearly in both cases confusabilitygraph is the same – a triangle.

3. Usually C0 is very hard to compute, but Cfb,0 can be obtained in closed form as in (23.11).

Example: (Polygon channel)

2

34

51

Bipartite graph Confusability graph

252

• Zero-error capacity C0:

– L = 3: C0 = 0

– L = 5: C0 = 12 log 5 (Shannon ’56-Lovasz ’79).

Achievability:

a) blocklength one: 1, 3, rate = 1 bit.

b) blocklength two: (1, 1), (2, 3), (3, 5), (4, 2), (5, 4), rate = 12 log 5 bit – optimal!

– L = 7: 3/5 log 7 ≤ C0 ≤ log 3.32 (Exact value unknown to this day)

– Even L = 2k: C0 = log L2 for all k (Why? Homework.).

– Odd L = 2k + 1: C0 = log L2 + o(1) as k →∞ (Bohman ’03)

• Zero-error capacity with feedback (proof: exercise!)

Cfb,0 = logL

2, ∀L,

which can be strictly bigger than C0.

4. Notice that Cfb,0 is not necessarily equal to Cfb = limε→0Cfb,ε = C. Here is an example when

C0 < Cfb,0 < Cfb = C

Example:

Then

C0 = log 2

Cfb,0 = maxδ− log max(

2

3δ, 1− δ) (P ∗X = (δ/3, δ/3, δ/3, δ))

= log5

2> C0 (δ∗ =

3

5)

On the other hand, Shannon capacity C = Cfb can be made arbitrarily close to log 4 bypicking the cross-over probability arbitrarily close to zero, while the confusability graph staysthe same.

Proof of Theorem 23.3. 1. Fix any (n,M, 0)-code. Denote the confusability set of all possiblemessages that could have produced the received signal yt = (y1, . . . , yt) for all t = 0, 1, . . . , nby:

Et(yt) , m ∈ [M ] : f1(m) ∈ Sy1 , f2(m, y1) ∈ Sy2 , . . . , fn(m, yt−1) ∈ Syt

Notice that zero-error means no ambiguity:

ε = 0⇔ ∀yn ∈ Bn, |En(yn)| = 1 or 0. (23.12)

253

2. The key quantities in the proof are defined as follows:

θfb = minPX

maxy∈B

PX(Sy),

P ∗X = argminPX

maxy∈B

PX(Sy)

The goal is to show

Cfb,0 = log1

θfb.

By definition, we have

∀PX , ∃y ∈ B, such that PX(Sy) ≥ θfb (23.13)

Notice the minimizer distribution P ∗X is usually not the caid in the usual sense. This definitionalso sheds light on how the encoding and decoding should be proceeded and serves to lowerbound the uncertainty reduction at each stage of the decoding scheme.

3. “≤” (converse): Let PXn be he joint distribution of the codewords. Denote E0 = [M ] – originalmessage set.

t = 1: For PX1 , by (23.13), ∃y∗1 such that:

PX1(Sy∗1 ) =|m : f1(m) ∈ Sy∗1||m ∈ [M ]| =

|E1(y∗1)||E0|

≥ θfb.

t = 2: For PX2|X1∈Sy∗1, by (23.13), ∃y∗2 such that:

PX2(Sy∗2 |X1 ∈ Sy∗1 ) =|m : f1(m) ∈ Sy∗1 , f2(m, y∗1) ∈ Sy∗2|

|m : f1(m) ∈ Sy∗1|=|E2(y∗1, y

∗2)|

|E1(y∗1)| ≥ θfb,

t = n: Continue the selection process up to y∗n which satisfies that:

PXn(Sy∗n |Xt ∈ Sy∗t for t = 1, . . . , n− 1) =|En(y∗1, . . . , y

∗n)|

|En−1(y∗1, . . . , y∗n−1)| ≥ θfb.

Finally, by (23.12) and the above selection procedure, we have

1

M≥ |En(y∗1, . . . , y

∗n)|

|E0|≥ θnfb

⇒ M ≤ −n log θfb

⇒ Cfb,0 ≤ − log θfb

4. “≥” (achievability)

Let’s construct a code that achieves (M,n, 0).

254

The above example with |A| = 3 illustrates that the encoder f1 partitions the space of allmessages to 3 groups. The encoder f1 at the first stage encodes the groups of messages intoa1, a2, a3 correspondingly. When channel outputs y1 and assume that Sy1 = a1, a2, then thedecoder can eliminate a total number of MP ∗X(a3) candidate messages in this round. The“confusability set” only contains the remaining MP ∗X(Sy1) messages. By definition of P ∗X weknow that MP ∗X(Sy1) ≤ Mθfb. In the second round, f2 partitions the remaining messagesinto three groups, send the group index and repeat.

By similar arguments, each interaction reduces the uncertainty by a factor of at least θfb.After n iterations, the size of “confusability set” is upper bounded by Mθnfb, if Mθnfb ≤ 1,3

then zero error probability is achieved. This is guaranteed by choosing logM = −n log θfb.Therefore we have shown that −n log θfb bits can be reliably delivered with n+O(1) channeluses with feedback, thus

Cfb,0 ≥ − log θfb

23.3.2 Code with variable length

Consider the example of BEC(δ) with feedback, send k bits in the following way: repeat sendingeach bit until it gets through the channel correctly. The expected number of channel uses for sendingk bits is given by

l = E[n] =k

1− δWe state the result for variable-length feedback (VLF) code without proof:

logM∗V LF (l, 0) ≥ lC

Notice that compared to the scheme without feedback, there is the improvement of√nV Q−1(ε) in

the order of O(√n), which is stronger than the result in Theorem 23.2.

This is also true in general [PPV10b]:

logM∗V LF (l, ε) =lC

1− ε +O(log l)

Example: For BSC(0.11), without feedback, n = 3000 is needed to achieve 90% of capacity C,while with VLF code l = En = 200 is enough to achieve that.

23.3.3 Code with variable power

Elias’ scheme To send a number A drawn from a Gaussian distribution N (0,VarA), Elias’scheme uses linear processing.

AWGN setup:

Yk = Xk + Zk, Zk ∼ N (0, σ2) i.i.d.

E[X2k ] ≤ P, power constraint in expectation

3Some rounding-off errors need to be corrected in a few final steps (because P ∗X may not be closely approximablewhen very few messages are remaining). This does not change the asymptotics though.

255

Note: If we insist the codeword satisfies power constraint almost surely instead on average, i.e.,∑nk=1X

2k ≤ nP a.s., then the scheme below does not work!

According to the orthogonality principle of the mininum mean-square estimation (MMSE) of Aat receiver side in every step:

A = An +Nn, Nn ⊥ Y n.

Morever, since all operations are lienar and everything is jointly Gaussian, Nn ⊥⊥ Y n. SinceXn ∝ Nn−1 ⊥⊥ Y n−1, the codeword we are sending at each time slot is independent of the history ofthe channel output (”innovation”), in order to maximize information transfer.

Note that Y n → An → A, and the optimal estimator An (a linear combination of Y n) is asufficient statistic of Y n for A under Gaussianity. Then

I(A;Y n) =I(A; An, Yn)

= I(A; An) + I(A;Y n|An)

= I(A; An)

=1

2log

Var(A)

Var(Nn).

where the last equality uses the fact that N follows a normal distribution. Var(Nn) can be computeddirectly using standard linear MMSE results. Instead, we determine it information theoretically:Notice that we also have

I(A;Y n) = I(A;Y1) + I(A;Y2|Y1) + · · ·+ I(A;Yn|Y n−1)

= I(X1;Y1) + I(X2;Y2|Y1) + · · ·+ I(Xn;Yn|Y n−1)

key= I(X1;Y1) + I(X2;Y2) + · · ·+ I(Xn;Yn)

= n1

2log(1 + P ) = nC

256

Therefore, with Elias’ scheme of sending A ∼ N (0,VarA), after the n-th use of the AWGN(P )channel with feedback,

VarNn = Var(An −A) = 2−2nC VarA =( P

P + σ2

)nVarA,

which says that the reduction of uncertainty in the estimation is exponential fast in n.

Schalkwijk-Kailath Elias’ scheme can also be used to send digital data. Let W ∼ uniform onM -PAM constellation in ∈ [−1, 1], i.e., −1,−1 + 2

M , · · · ,−1 + 2kM , · · · , 1. In the very first step W

is sent (after scaling to satisfy the power constraint):

X0 =√PW, Y0 = X0 + Z0

Since Y0 and X0 are both known at the encoder, it can compute Z0. Hence, to describe W it issufficient for the encoder to describe the noise realization Z0. This is done by employing the Elias’scheme (n− 1 times). After n− 1 channel uses, and the MSE estimation, the equivalent channeloutput:

Y0 = X0 + Z0, Var(Z0) = 2−2(n−1)C

Finally, the decoder quantizes Y0 to the nearest PAM point. Notice that

ε ≤ P[|Z0| >

1

2M

]= P

[2−(n−1)C |Z| >

√P

2M

]= 2Q

(2(n−1)C

√P

2M

)

⇒ logM ≥ (n− 1)C + log

√P

2− logQ−1(

ε

2)

= nC +O(1).

Hence if the rate is strictly less than capacity, the error probability decays doubly exponentiallyfast as n increases. More importantly, we gained an

√n term in terms of logM , since for the case

without feedback we have

logM∗(n, ε) = nC −√nV Q−1(ε) +O(log n) .

Example: P = 1 ⇒ channel capacity C = 0.5 bit per channel use. To achieve error probability

10−3, 2Q(

2(n−1)C

2M

)≈ 10−3, so e(n−1)C

2M ≈ 3, and logMn ≈ n−1

n C − log 8n . Notice that the capacity is

achieved to within 99% in as few as n = 50 channel uses, whereas the best possible block codeswithout feedback require n ≈ 2800 to achieve 90% of capacity.

Take-away message:Feedback is best harnessed with adaptive strategies. Although it does not increase capacity

under block coding, feedback greatly boosts reliability as well as reduces coding complexity.

257

§ 24. Capacity-achieving codes via Forney concatenation

Shannon’s Noisy Channel Theorem assures us the existence of capacity-achieving codes. However,exhaustive search for the code has double-exponential complexity: Search over all codebook of size2nR over all possible |X |n codewords.

Plan for today: Constructive version of Shannon’s Noisy Channel Theorem. The goal is to showthat for BSC, it is possible to achieve capacity in polynomial time. Note that we need to considerthree aspects of complexity

• Encoding

• Decoding

• Construction of the codes

24.1 Error exponents

Recall we have defined the fundamental limit

M∗(n, ε) = maxM : ∃(n,M, ε)-code

For notational convenience, let us define its functional inverse

ε∗(n,M) = infε : ∃(n,M, ε)-code

Shannon’s theorem shows that for stationary memoryless channels, εn , ε∗(n, exp(nR))→ 0 for anyR < C = supX I(X;Y ). Now we want to know how fast it goes to zero as n → ∞. It turns outthe speed is exponential, i.e., εn ≈ exp(−nE(R)) for some error exponent E(R) as a function R,which is also known as the reliability function of the channel. Determining E(R) is one of the mostlong-standing open problems in information theory. What we know are

• Lower bound on E(R) (achievability): Gallager’s random coding bound (which analyzes theML decoder, instead of the suboptimal decoder as in Shannon’s random coding bound or DTbound).

• Upper bound on E(R) (converse): Sphere-packing bound (Shannon-Gallager-Berlekamp), etc.

It turns out there exists a number Rcrit ∈ (0, C), called the critical rate, such that the lower andupper bounds meet for all R ∈ (Rcrit, C), where we obtain the value of E(R). For R ∈ (0, Rcrit), wedo not even know the existence of the exponent!

Deriving these bounds is outside the scope of this lecture. Instead, we only need the positivityof error exponent, i.e., for any R < C, E(R) > 0. On the other hand, it is easy to see thatE(C−) = 0 as a consequence of weak converse. Since as the rate approaches capacity from below,the communication becomes less reliable. The next theorem is a simple application of large deviation.

258

Theorem 24.1. For any DMC, for any R < C = supX I(X;Y ),

ε∗(n, exp(nR)) ≤ exp(−nE(R)), for some E(R) > 0.

Proof. Fix R < C so that C − R > 0. Let P ∗X be the capacity-achieving input distribution, i.e.,C = I(X∗;Y ∗). Recall Shannon’s random coding bound (DT/Feinstein work as well):

ε ≤ P (i(X;Y ) ≤ logM + τ) + exp(−τ).

As usual, we apply this bound with iid PXn = (P ∗X)n, logM = nR and τ = n(C−R)2 , to conclude the

achievability of

εn ≤ P(

1

ni(Xn;Y n) ≤ C +R

2

)+ exp

(−n(C −R)

2

).

Since i(Xn;Y n) =∑i(Xk;Yk) is an iid sum, and Ei(X;Y ) = C > (C + R)/2, the first term is

upper bounded by exp(−nψ∗T (R+C2 )) where T = i(X;Y ). The proof is complete since εn is smaller

than the sum of two exponentially small terms.

Note: Better bound can be obtained using DT bound. But to get the best lower bound on E(R)we know (Gallager’s random coding bound), we have to analyze the ML decoder.

24.2 Achieving polynomially small error probability

In the sequel we focus on BSC channel with cross-over probability δ, which is an additive-noise DMC.Fix R < C = 1 − h(δ) bits. Let the block length be n. Our goal is to achieve error probabilityεn ≤ n−α for arbitrarily large α > 0 in polynomial time.

To this end, fix some b > 1 to be specified later and pick m = b log n and divide the blockinto n

m sub-blocks of m bits. Applying Theorem 24.1, we can find [later on how to find] an(m, exp(Rm), εm)-code such that

εm ≤ exp(−mE(R)) = n−bE(R)

where E(R) > 0. Apply this code to each m-bit sub-block and apply ML decoding to each block.The encoding/decoding complexity is at most n

m exp(O(m)) = nO(1). To analyze the probability oferror, use union bound:

Pe ≤n

mεm ≤ n−bE(R)+1 ≤ n−α,

if we choose b ≥ α+1E(R) .

Remark 24.1. The final question boils down to how to find the shorter code of blocklength m inpoly(n)-time. This will be done if we can show that we can find good code (satisfying the Shannonrandom coding bound) for BSC of blocklenth m in exponential time. To this end, let us go throughthe following strategies:

1. Exhaustive search: A codebook is a subset of cardinality 2Rm out of 2m possible codewords.Total number of codebooks:

(2m

2Rm

)= exp(Ω(m2Rm)) = exp(Ω(nc log n)). The search space is

too big.

2. Linear codes: In Lecture 18 we have shown that for additive-noise channels on finite fields wecan focus on linear codes. For BSC, each linear code is parameterized by a generator matrix,with Rm2 entries. Then there are a total of 2Rm

2= nΩ(logn) – still superpolynomial and we

cannot afford the search over all linear codes.

259

3. Toeplitz generator matrices: In homework we see that it does not lose generality to focus onlinear codes with Toeplitz generator matrices, i.e., G such that Gij = Gi−1,j−1 for all i, j > 1.Toeplitz matrices are determined by diagonals. So there are at most 22m = nO(1) and we canfind the optimal one by exhaustive search in poly(n)-time.

Since the channel is additive-noise, linear codes + syndrome decoder leads to the same maximalprobability of error as average (Lecture 18).

24.3 Concatenated codes

Forney introduced the idea of concatenated codes in 1965 to build longer codes from shorter codeswith manageable complexity. It consists of an inner code and an outer code:

1. Cin : 0, 1k → 0, 1n, with rate kn

2. Cout : BK → BN for some alphabet B of cardinality 2k, with rate KN .

The concatenated code C : 0, 1kK → 0, 1nN works as follows (Fig. 24.1):

1. Collect the kK message bits into K symbols in the alphabet B, apply Cout componentwise toget a vector in BN

2. Map each symbol in B into k bits and apply Cin componentwise to get an nN -bit codeword.

The rate of the concatenated code is the product of the rates of the inner and outer codes: R = knKN .

Cout Cin

Cin

Cin

Cin

Cin

kK kDin

Din

Din

Din

Din

Doutn k kK

Figure 24.1: Concatenated code, where there are N inner encoder-decoder pairs.

24.4 Achieving exponentially small error probability

Forney proposed the following idea:

• Use an optimal code as the inner code

• Use a Reed-Solomon code as the outer code which can correct a constant fraction of errors.

260

Reed-Solomon (RS) codes are linear codes from FKq → FNq where the block length N = q− 1and the message length is K. Similar to the Reed-Muller code, the RS code treats the input(a0, a1, . . . , aK−1) as a polynomial p(x) =

∑K−1i=0 aiz

i over Fq of degree at most K−1, and encodes itby its values at all non-zero elements. Therefore the RS codeword is a vector (p(α) : α ∈ Fq\0) ∈FNq . Therefore the generator matrix of RS code is a Vandermonde matrix.

The RS code has the following advantages:

1. The minimum distance of RS code N −K + 1. So if we choose K = (1− ε)N , then RS codecan correct εN

2 errors.

2. The encoding and decoding (e.g., Berlekamp-Massey decoding algorithm) can be implementedin poly(N) time.

In fact, as we will see later, any efficient code which can correct a constant fraction of errors willsuffice as the outer code for our purpose.

Now we show that we can achieve any rate below capacity and exponentially small probabilityof error in polynomial time: Fix η, ε > 0 arbitrary.

• Inner code: Let k = (1− h(δ)− η)n. By Theorem 24.1, there exists a Cin : 0, 1k → 0, 1n,which is a linear (n, 2k, εn)-code and maximal error probability εn ≤ 2−nE(η). By Remark 24.1,Cin can be chosen to be a linear code with Toeplitz generator matrix, which can be found in2n time. The inner decoder is ML, which we can afford since n is small.

• Outer code: We pick the RS code with field size q = 2k with blocklength N = 2k − 1. Pickthe number of message bits to be K = (1− ε)N . Then we have Cout : FK

2k→ FN

2k.

Then we obtain a concatenated code C : 0, 1kK → 0, 1nN with blocklength L = nN = n2Cn forsome constant C and rate R = (1− ε)(1− h(δ)− η). It is clear that the code can be constructed in2n = poly(L) time and all encoding/decoding operations are poly(L) time.

Now we analyze the probability of error: Let us conditioned on the message bits (input to Cout).Since the outer code can correct εN

2 errors, an error happens only if the number of erroneous innerencoder-decoder pairs exceeds εN

2 . Since the channel is memoryless, each of the N pairs makes anerror independently1 with probability at most εn. Therefore the number of errors is stochasticallysmaller than Bin(N, εn), and we can upper bound the total probability of error using Chernoffbound:

Pe ≤ P[Bin(N, εn) ≥ εN

2

]≤ exp (−Nd(ε/2‖εn)) = exp (−Ω(N logN)) = exp(−Ω(L)).

where we have used ε = Ω(1) and hence εn ≤ exp(−Ω(n)) and d(ε/2‖εn) ≥ ε2 log ε

2εn= Ω(n) =

Ω(logN).

Note: For more details see the excellent exposition by Spielman [Spi97]. For modern constructionsusing sparse graph codes which achieve the same goal in linear time, see, e.g., [Spi96].

1Here controlling the maximal error probability of inner code is the key. If we only have average error probability,then given a uniform distributed input to the RS code, the output symbols (which are the inputs to the inner encoders)need not be independent, and Chernoff bound is not necessarily applicable.

261

Part V

Lossy data compression

262

§ 25. Rate-distortion theory

Big picture so far:

1. Lossless data compression: Given a discrete ergodic source Sk, we know how to encode topure bits W ∈ [2k].

2. Binary HT: Given two distribution P and Q, we know how to distinguish them optimally.

3. Channel coding: How to send bits over a channel [2k] 3W → X → Y .

4. JSCC: how to send discrete data optimally over a noisy channel.

Next topic, lossy data compression: Given X, find a k-bit representation W , X → W → X,

such that X is a good reconstruction of X.Real-world examples: codecs consist of a compressor and a decompressor

• Image: JPEG...

• Audio: MP3, CD...

• Video: MPEG...

25.1 Scalar quantization

Problem: Data isn’t discrete! Often, a signal (function) comes from voltage levels or othercontinuous quantities. The question of how to map (naturally occurring) continuous time/analogsignals into (electronics friendly) discrete/digital signals is known as quantization, or in informationtheory, as rate distortion theory.

Domain

Continuoustime

Discretetime

Signal

Range

Analog

Digital

QuantizationSampling

We will look at several ways to do quantization in the next few sections.

263

25.1.1 Scalar Uniform Quantization

The idea of qunatizing an inherently continuous-valued signal was most explicitly expounded in thepatenting of Pulse-Coded Modulation (PCM) by A. Reeves, cf. [Ree65] for some interesting historicalnotes. His argument was that unlike AM and FM modulation, quantized (digital) signals could besent over long routes without the detrimental accumulation of noise. Some initial theoretical analysisof the PCM was undertaken in 1947 by Oliver, Pierce, and Shannon (same Shannon), cf. [OPS48].

For a random variable X ∈ [−A/2, A/2] ⊂ R, the scalar uniform quantizer qU (X) with Nquantization points partitions the interval [−A/2, A/2] uniformly

N equally spaced points

−A2

A2

where the points are in −A2 + kAN , k = 0, . . . , N − 1.

What is the quality (or fidelity) of this quantization? Most of the time, mean squared error isused as the quality criterion:

D(N) = E|X − qU (X)|2

where D denotes the average distortion. Often R = log2N is used instead of N , so that we thinkabout the number of bits we can use for quantization instead of the number of points. To analyzethis scalar uniform quantizer, we’ll look at the high-rate regime (R 1). The key idea in the highrate regime is that (assuming a smooth density PX), each quantization interval ∆j looks nearly flat,so conditioned on ∆j , the distribution is accurately approximately by a uniform distribution.

Nearly flat forlarge partition

∆j

Let cj be the j-th quantization point, and ∆j be the j-th quantization interval. Here we have

DU (R) = E|X − qU (X)|2 =

N∑

j=1

E[|X − cj |2|X ∈ ∆j ]P[X ∈ ∆j ] (25.1)

(high rate approximation) ≈N∑

j=1

|∆j |212

P[X ∈ ∆j ] (25.2)

=(AN )2

12=A2

122−2R, (25.3)

where we used the fact that the variance of Uniform[−a, a] = a2/3.How much do we gain per bit?

10 log10 SNR = 10 log10

Var(X)

E|X − qU (X)|2

= 10 log10

12V ar(X)

A2+ (20 log10 2)R

= constant + (6.02dB)R

264

For example, when X is uniform on [−A2 ,

A2 ], the constant is 0. Every engineer knows the rule

of thumb “6dB per bit”; adding one more quantization bit gets you 6 dB improvement in SNR.However, here we can see that this rule of thumb is valid only in the high rate regime. (Consequently,widely articulated claims such as “16-bit PCM (CD-quality) provides 96 dB of SNR” should betaken with a grain of salt.)Note: The above deals with X with a bounded support. When X is unbounded, a wise thing to dois to allocate the quantization points to the range of values that are more likely and saturate thelarge values at the dynamic range of the quantizer. Then there are two contributions, known asthe granular distortion and overload distortion. This leads us to the question: Perhaps uniformquantization is not optimal?

25.1.2 Scalar Non-uniform Quantization

Since our source has density pX , a good idea might be to use more quantization points where pX islarger, and less where pX is smaller.

Often the way such quantizers are implemented is to take a monotone transformation of the sourcef(X), perform uniform quantization, then take the inverse function:

X U

X qU (U)

f

q qU

f−1

(25.4)

i.e., q(X) = f−1(qU (f(X))). The function f is usually called the compander (compressor+expander).One of the choice of f is the CDF of X, which maps X into uniform on [0, 1]. In fact, this companderarchitecture is optimal in the high-rate regime (fine quantization) but the optimal f is not the CDF(!). We defer this discussion till Section 25.1.4.

In terms of practical considerations, for example, the human ear can detect sounds with volumeas small as 0 dB, and a painful, ear-damaging sound occurs around 140 dB. Achieveing this ispossible because the human ear inherently uses logarithmic companding function. Furthermore,many natural signals (such as differences of consecutive samples in speech or music (but not samplesthemselves!)) have an approximately Laplace distribution. Due to these two factors, a very popularand sensible choice for f is the µ-companding function

f(X) = sign(X) ln(1+µ|X|)ln(1+µ)

265

which compresses the dynamic range, uses more bits for smaller |X|’s, e.g. |X|’s in the range ofhuman hearing, and less quantization bits outside this region. This results in the so-called µ-lawwhich is used in the digital telecommunication systems in the US, while in Europe a slightly differentcompander called the A-law is used.

25.1.3 Optimal Scalar Quantizers

Now we look for the optimal scalar quantizer given R bits for reconstruction. Formally, this is

Dscalar(R) = minq:|Im q|≤2R

E|X − q(X)|2 (25.5)

Intuitively, we would think that the optimal quantization regions should be contiguous; otherwise,given a point cj , our reconstruction error will be larger. Therefore in one dimension quantizers arepiecewise constant:

q(x) = cj1Tj≤x≤Tj+1

for some cj ∈ [Tj , Tj+1].Simple example: One-bit quantization of X ∼ N (0, σ2). Then optimal quantization points are

c1 = E[X|X ≥ 0] =√

2πσ, c2 = E[X|X ≤ 0] = −

√2πσ.

With ideas like this, in 1982 Stuart Lloyd developed an algorithm (called Lloyd’s algorithm)for iteratively finding optimal quantization regions and points. This works for both the scalar andvector cases, and goes as follows:

1. Pick any N = 2k points

2. Draw the Voronoi regions around the chosen quantization points (aka minimum distancetessellation, or set of points closest to cj), which forms a partition of the space.

3. Update the quantization points by the centroids (E[X|X ∈ D]) of each Voronoi region.

4. Repeat.

b

b

bb

bb

b

bb

b

Steps of Lloyd’s algorithm

Lloyd’s clever observation is that the centroid of each Voronoi region is (in general) different thanthe original quantization points. Therefore, iterating through this procedure gives the CentroidalVoronoi Tessellation (CVT - which are very beautiful objects in their own right), which can beviewed as the fixed point of this iterative mapping. The following theorem gives the results aboutLloyd’s algorithm

Theorem 25.1 (Lloyd).

1. Lloyd’s algorithm always converges to a Centroidal Voronoi Tessellation.

266

2. The optimal quantization strategy is always a CVT.

3. CVT’s need not be unique, and the algorithm may converge to non-global optima.

Remark 25.1. The third point tells us that Lloyd’s algorithm isn’t always guaranteed to give theoptimal quantization strategy.1 One sufficient condition for uniqueness of a CVT is the log-concavityof the density of X [Fleischer ’64]. Thus, for Gaussian PX , Lloyd’s algorithm outputs the optimalquantizer, but even for Gaussian, if N > 3, optimal quantization points are not known in closedform! So it’s hard to say too much about optimal quantizers. Because of this, we next look for anapproximation in the regime of huge number of points.

Remark 25.2 (k-means). A popular clustering method called k-means is the following: Given ndata points x1, . . . , xn ∈ Rd, the goal is to find k centers µ1, . . . , µk ∈ Rd to minimize the objectivefunction

n∑

i=1

minj∈[k]‖xi − µj‖2.

This is equivalent to solving the optimal vector quantization problem analogous to (25.5):

minq:|Im(q)|≤k

E‖X − q(X)‖2

where X is distributed according to the empirical distribution over the dataset, namely, 1n

∑ni=1 δxi .

Solving the k-means problem is NP-hard in the worst case, and Lloyd’s algorithm is commonly usedheuristic.

25.1.4 Fine quantization

[Panter-Dite ’51] Now we look at the high SNR approximation. For this, introduce a probabilitydensity function λ(x), which represents the density of our quantization points and allows us toapproximate summations by integrals2. Then the number of quantization points in any interval[a, b] is ≈ N

∫ ba λ(x)dx. For any point x, denote its distance to the closest quantization point by

∆(x). Then

N

∫ x+∆(x)

xλ(t)dt ≈ Nλ(x)∆(x) ≈ 1 =⇒ ∆(x) ≈ 1

Nλ(x).

With this approximation, the quality of reconstruction is

E|X − q(X)|2 =N∑

j=1

E[|X − cj |2|X ∈ ∆j ]P[X ∈ ∆j ]

≈N∑

j=1

P[X ∈ ∆j ]|∆j |2

12≈∫p(x)

∆2(x)

12dx

=1

12N2

∫p(x)λ−2(x)dx

1As a simple example one may consider PX = 13φ(x− 1) + 1

3φ(x) + 1

3φ(x+ 1) where φ(·) is a very narrow pdf,

symmetric around 0. Here the CVT with centers ± 23

is not optimal among binary quantizers (just compare to anyquantizer that quantizes two adjacent spikes to same value).

2This argument is easy to make rigorous. We only need to define reconstruction points cj as solutions of∫ cj

−∞λ(x) dx =

j

N.

267

To find the optimal density λ that gives the best reconstruction (minimum MSE) when X hasdensity p, we use Holder’s inequality:

∫p1/3 ≤ (

∫pλ−2)1/3(

∫λ)2/3. Therefore

∫pλ−2 ≥ (

∫p1/3)3,

with equality iff pλ−2 ∝ λ. Hence the optimizer is λ∗(x) = p1/3(x)∫p1/3dx

. Therefore when N = 2R,3

Dscalar(R) ≈ 1

122−2R

(∫p1/3(x)dx

)3

So our optimal quantizer density in the high rate regime is proportional to the cubic root of thedensity of our source. This approximation is called the Panter-Dite approximation. For example,

• When X ∈ [−A2 ,

A2 ], using Holder’s inequality again 〈1, p1/3〉 ≤ ‖1‖ 3

2‖p1/3‖3 = A2/3, we have

Dscalar(R) ≤ 1

122−2RA2 = DU (R)

where the RHS is the uniform quantization error given in (25.1). Therefore as long as thesource distribution is not uniform, there is strict improvement. For uniform distribution,uniform quantization is, unsurprisingly, optimal.

• When X ∼ N (0, σ2), this gives

Dscalar(R) ≈ σ22−2Rπ√

3

2(25.6)

Note: In fact, in scalar case the optimal non-uniform quantizer can be realized using the companderarchitecture (25.4) that we discussed in Section 25.1.2: As an exercise, use Taylor expansion toanalyze the quantization error of (25.4) when N →∞. The optimal compander f : R→ [0, 1] turns

out to be f(x) =∫ t−∞ p1/3(t)dt∫∞−∞ p1/3(t)dt

[Bennett ’48, Smith ’57].

25.1.5 Fine quantization and variable rate

So far we have been focusing on quantization with restriction on the cardinality of the image ofq(·). If one, however, intends to further compress the values q(X) via noiseless compressor, a morenatural constraint is to bound H(q(X)).

Koshelev [Kos63] discovered in 1963 that in the high rate regime uniform quantization isasymptotically optimal under the entropy constraint. Indeed, if q∆ is a uniform quantizer with cellsize ∆, then it is easy to see that

H(q∆(X)) = h(X)− log ∆ + o(1) , (25.7)

where h(X) = −∫pX(x) log pX(x) dx is the differential entropy of X. So a uniform quantizer with

H(q(X)) = R achieves

D =∆2

12≈ 2−2R 22h(X)

12.

On the other hand, any quantizer with unnormalized point density function Λ(x) (i.e. smoothfunction such that

∫ cj−∞ Λ(x)dx = j) can be shown to achieve (assuming Λ→∞ pointwise)

D ≈ 1

12

∫pX(x)

1

Λ2(x)dx (25.8)

H(q(X)) ≈∫pX(x) log

Λ(x)

pX(x)dx (25.9)

3In fact when R→∞, “≈” can be replaced by “= 1 + o(1)” [Zador ’56].

268

Now, from Jensen’s inequality we have

1

12

∫pX(x)

1

Λ2(x)dx ≥ 1

12exp−2

∫pX(x) log Λ(x) dx ≈ 2−2H(q(X)) 22h(X)

12,

concluding that uniform quantizer is asymptotically optimal.Furthermore, it turns out that for any source, even the optimal vector quantizers (to be considered

next) can not achieve distortion better that 2−2R 22h(X)

2πe – i.e. the maximal improvement they cangain (on any iid source!) is 1.53 dB (or 0.255 bit/sample). This is one reason why scalar uniformquantizers followed by lossless compression is an overwhelmingly popular solution in practice.

25.2 Information-theoretic vector quantization

Consider Gaussian distribution N (0, σ2). By doing optimal vector quantization in high dimensions(namely, compressing (X1, . . . , Xn) to 2nR points), rate-distortion theory will tell us that when n islarge, we can achieve the per-coordinate MSE:

Dvec(R) = σ22−2R

which, compared to (25.6), saves 4.35 dB (or 0.72 bit/sample). This should be rather surprising, solet’s reiterate: even when X1, . . . , Xn are iid, we can get better performance by quantizing the Xi’sjointly. One instance of this surprising effect is the following:

Hamming Game Given 100 unbiased bits, we want to look at them and scribble somethingdown on a piece of paper that can store 50 bits at most. Later we will be asked to guess theoriginal 100 bits, with the goal of maximizing the number of correctly guessed bits. What is the beststrategy? Intuitively, it seems the optimal strategy would be to store half of the bits then randomlyguess on the rest, which gives 25% BER. However, as we will show in the next few lectures, theoptimal strategy amazingly achieves a BER of 11%. How is this possible? After all we are guessingindependent bits and the utility function (BER) treats all bits equally. Some intuitive explanation:

1. Applying scalar quantization componentwise results in quantization region that are hypercubes,which might not be efficient for covering.

2. Concentration of measures removes many source realizations that are highly unlikely. Forexample, if we think about quantizing a single Gaussian X, then we need to cover large portionof R in order to cover the cases of significant deviations of X from 0. However, when we arequantizing many (X1, . . . , Xn) together, the law of large numbers makes sure that many Xj ’scannot conspire together and all produce large values. Indeed, (X1, . . . , Xn) concentrates neara sphere. Thus, we may exclude large portions of the space Rn from consideration.

Math Formalism A lossy compressor is an encoder/decoder pair (f, g) where

Xf−→W

g−→ X

• X ∈ X : continuous source

• W : discrete data

• X ∈ X : reconstruction

269

A distortion metric is a function d : X × X → R ∪ +∞ (loss function). There are variousformulations of the lossy compression problem:

1. Fixed length (fixed rate), average distortion: W ∈ [M ], minimize E[d(X, X)].

2. Fixed length, excess distortion: W ∈ [M ], minimize P[d(X, X) > D].

3. Variable length, max distortion: W ∈ 0, 1∗, d(X, X) ≤ D a.s., minimize E[length(W )] orH(X) = H(W ).

Note: In this course we focus on lossy compression with fixed length and average distortion. Thedifference between average distortion and excess distortion is analogous to average risk bound andhigh-probability bound in statistics/machine learning.

Definition 25.1. Rate-distortion problem is characterized by a pair of alphabets A, A, a single-letter distortion function d(·, ·) : A× A → R ∪ +∞ and a source – a sequence of A-valued r.v.’s(S1, S2, . . .). A separable distortion metric is defined for n-letter vectors by averaging the single-letterdistortions:

d(an, an) ,1

n

∑d(ai, ai)

An (n,M,D)-code is

• Encoder f : An → [M ]

• Decoder g : [M ]→ An

• Average distortion: E[d(Sn, g(f(Sn)))] ≤ D

Fundamental limit:

M∗(n,D) = minM : ∃(n,M,D)-code

R(D) = lim supn→∞

1

nlogM∗(n,D)

Now that we have the definition, we give the (surprisingly simple) general converse

Theorem 25.2 (General Converse). For all lossy codes X →W → X such that E[d(X, X)] ≤ D,we have

logM ≥ ϕX(D) , infPY |X :E[d(X,Y )]≤D

I(X;Y )

where W ∈ [M ].

Proof.

logM ≥ H(W ) ≥ I(X;W ) ≥ I(X; X) ≥ ϕX(D)

where the last inequality follows from the fact that PX|X is a feasible solution (by assumption).

Theorem 25.3 (Properties of ϕX).

1. ϕX is convex, non-increasing.

270

2. ϕX continuous on (D0,∞), where D0 = infD : ϕX(D) <∞.

3. If

d(x, y) =

D0 x = y

> D0 x 6= y

Then ϕX(D0) = I(X;X).

4. LetDmax = inf

x∈XEd(X, x).

Then ϕX(D) = 0 for all D > Dmax. If D0 > Dmax then also ϕX(Dmax) = 0.

Note: If Dmax = Ed(X, x) for some x, then x is the “default” reconstruction of X, i.e., the bestestimate when we have no information about X. Therefore D ≥ Dmax can be achieved for free.This is the reason for the notation Dmax despite that it is defined as an infimum.Example: (Gaussian with MSE distortion) For X ∼ N (0, σ2) and d(x, y) = (x − y)2, we have

ϕX(D) = 12 log+ σ2

D . In this case D0 = 0 which is not attained; Dmax = σ2 and if D ≥ σ2, we can

simply output X = 0 as the reconstruction which requires zero bits.

Proof.

1. Convexity follows from the convexity of PY |X 7→ I(PX , PY |X).

2. Continuity on interior of the domain follows from convexity.

3. The only way to satisfy the constraint is to take X = Y .

4. For any D > Dmax we can set X = x deterministically. Thus I(X; x) = 0. The second claimfollows from continuity.

In channel coding, we looked at the capacity and the information capacity. We define theInformation Rate-Distortion function in an analogous way here, which by itself is not an operationalquantity.

Definition 25.2. The Information Rate-Distortion function for a source is

Ri(D) = lim supn→∞

1

nϕSn(D) where ϕSn(D) = inf

PSn|Sn :E[d(Sn,Sn)]≤DI(Sn; Sn)

And D0 = infD : Ri(D) <∞.The reason for defining Ri(D) is because from Theorem 25.2 we immediately get:

Corollary 25.1. ∀D, R(D) ≥ Ri(D).

Naturally, the information rate-distortion function inherit the properties of ϕ:

Theorem 25.4 (Properties of Ri).

1. Ri(D) is convex, non-increasing

2. Ri(D) is continuous on (D0,∞), where D0 , infD : Ri(D) <∞.

271

3. If

d(x, y) =

D0 x = y

> D0 x 6= y

Then for stationary ergodic Sn, Ri(D) = H (entropy rate) or +∞ if Sk is not discrete.

4. Ri(D) = 0 for all D > Dmax, where

Dmax , lim supn→∞

infxn∈X

Ed(Xn, xn) .

If D0 < Dmax, then Ri(Dmax) = 0 too.

5. (Single letterization) If the source Si is i.i.d., then

Ri(D) = ϕS1(D) = infPS|S :E[d(S,S)]≤D

I(S; S)

Proof. Properties 1-4 follow directly from corresponding properties of φSn in Theorem 25.3 andproperty 5 will be established in the next section.

25.3* Converting excess distortion to average

Finally, we discuss how to build a compressor for average distortion if we have a compressor forexcess distortion, which we will not discuss in details in class.

Assumption Dp. Assume that for (S, d), there exists p > 1 such that Dp <∞, where

Dp , supn

infx

(E|d(Sn, x)|p)1/p < +∞

i.e. that our separable distortion metric d doesn’t grow too fast. Note that (by Minkowski’sinequality) for stationary memoryless sources we have a single-letter bound:

Dp ≤ infx

(E|d(S, x)|p)1/p (25.10)

Theorem 25.5 (Excess-to-Average). Suppose there exists X →W → X such that W ∈ [M ] andP[d(X, X) > D] ≤ ε. Suppose for some p ≥ 1 and x0 ∈ X , (E[d(X, x0)]p)1/p = Dp <∞. Then thereexists X →W ′ → X ′ code such that W ′ ∈ [M + 1] and

E[d(X, X ′)] ≤ D(1− ε) +Dpε1−1/p (25.11)

Remark 25.3. Theorem is only useful for p > 1, since for p = 1 the right-hand side of (25.11) doesnot converge to 0 as ε→ 0.

Proof. We transform the first code into the second by adding one codeword:

f ′(x) =

f(x) d(x, g(f(x))) ≤ DM + 1 o/w

g′(j) =

g(j) j ≤Mx0 j = M + 1

272

Then

E[d(X, g′ f ′(X)) ≤ E[d(X, X)|W 6= M + 1](1− ε) + E[d(X,x0)1W = M + 1](Holders Inequality) ≤ D(1− ε) +Dpε

1−1/p

273

§ 26. Rate distortion: achievability bounds

26.1 Recap

Recall from the last lecture:

R(D) = lim supn→∞

1

nlogM∗(n,D), (rate distortion function)

Ri(D) = lim supn→∞

1

nϕSn(D), (information rate distortion function)

and

ϕS(D) , infPS|S :E[d(S,S)]≤D

I(S; S)

ϕSn(D) = infPSn|Sn :E[d(Sn,Sn)]≤D

I(Sn; Sn)

Also, we showed the general converse: For any (M,D)-code X →W → X we have

logM ≥ ϕX(D)

=⇒ logM∗(n,D) ≥ ϕSn(D)

=⇒ R(D) ≥ Ri(D)

In this lecture, we will prove the achievability bound and establish the identity R(D) = Ri(D)for stationary memoryless sources.

First we show that Ri(D) can be easily calculated for memoryless source without going throughthe multi-letter optimization problem. This is the counterpart of Corollary 19.1 for channel capacity(with separable cost function).

Theorem 26.1 (Single-letterization). For stationary memoryless source Sn and separable distortiond,

Ri(D) = ϕS(D)

Proof. By definition we have that ϕSn(D) ≤ nϕS(D) by choosing a product channel: PSn|Sn =

(PS|S)n. Thus Ri(D) ≤ ϕS(D).

274

For the converse, take any PSn|Sn such that the constraint E[d(Sn, Sn)] ≤ D is satisfied, we have

I(Sn; Sn) ≥n∑

j=1

I(Sj , Sj) (Sn independent)

≥n∑

j=1

ϕS(E[d(Sj , Sj)])

≥ nϕS

1

n

n∑

j=1

E[d(Sj , Sj)]

(convexity of ϕS)

≥ nϕS(D) (ϕS non-increasing)

26.2 Shannon’s rate-distortion theorem

Theorem 26.2. Let the source Sn be stationary and memoryless, Sni.i.d.∼ PS, and suppose that

distortion metric d and the target distortion D satisfy:

1. d(sn, sn) is non-negative and separable

2. D > D0

3. Dmax is finite, i.e.Dmax , inf

sE[d(S, s)] <∞.

Then

R(D) = Ri(D) = infPS|S :E[d(S,S)]≤D

I(S; S). (26.1)

Remark 26.1. • Note that Dmax < ∞ does not imply that d(·, ·) only takes values in R,i.e. theorem permits d(a, a) =∞.

• It should be remarked that when Dmax =∞ (e.g. S ∼ Cauchy) typically R(D) =∞. Indeed,suppose that d(·, ·) is a metric (i.e. finite valued and satisfies triangle inequality). Then, forany x0 ∈ An we have

d(X, X) ≥ d(X,x0)− d(x0, X) .

Thus, for any finite codebook c1, . . . , cM we have maxj d(x0, cj) <∞ and therefore

E[d(X, X)] ≥ E[d(X,x0)]−maxjd(x0, cj) =∞ .

So that R(D) =∞ for any finite D. This observation, however, should not be interpreted asabsolute impossibility of compression for such sources. It is just not possible with fixed-ratecodes. As an example, for quadratic distortion and Cauchy-distributed S, Dmax =∞ since Shas infinite second-order moments. But it is easy to see that Ri(D) <∞ for any D ∈ (0,∞).In fact, in this case Ri(D) is a hyperbola-like curve that never touches either axis. A non-trivial compression can be attained with compressors Sn → W of bounded entropy H(W )(but unbounded alphabet of W ). Indeed if we take W to be a ∆-quantized version of S andnotice that differential entropy of S is finite, we get from (25.7) that Ri(∆) ≤ H(W ) < ∞.Interesting question: Is H(W ) = nRi(D) + o(n) attainable?

275

• Techniques in proving (26.1) for memoryless sources can be applied to prove it for “stationaryergodic” sources with changes similar to those we have discussed in lossless compression(Lecture 10).

Before giving a formal proof, we illustrate the intuition non-rigorously.

26.2.1 Intuition

Try to throw in M points C = c1, . . . , cM ∈ An which are drawn i.i.d. according to a productdistribution Qn

Swhere QS is some distribution on A. Examine the simple encoder-decoder pair:

encoder : f(sn) = argminj∈[M ]

d(sn, cj) (26.2)

decoder : g(j) = cj (26.3)

The basic idea is the following: Since the codewords are generated independently of the source,the probability that a given codeword offers good reconstruction is (exponentially) small, say, ε.However, since we have many codewords, the chance that there exists a good one can be of highprobability. More precisely, the probability that no good codeword exist is (1− ε)M , which can bevery close to zero as long as M 1

ε .

To explain the intuition further, let us consider the excess distortion of this code: P[d(Sn, Sn) >D]. Define

Psuccess , P[∃c ∈ C, s.t. d(Sn, c) ≤ D]

Then

Pfailure ,P[∀ci ∈ C, d(Sn, c) > D] (26.4)

≈P[∀ci ∈ C, d(Sn, c) > D|Sn ∈ Tn] (26.5)

( Tn is the set of typical strings with empirical distribution PSn ≈ PS )

=P[d(Sn, Sn) > D|Sn ∈ Tn]M (PSn,Sn = PnSQnS

) (26.6)

=(1− P[d(Sn, Sn) ≤ D|Sn ∈ Tn]︸︷︷︸since Sn ⊥⊥ Sn, this should be small

)M (26.7)

≈(1− 2−nE(QS))M (large deviation!) (26.8)

where it can be shown (similar to information projection in Lecture 14) that

E(QS) = minPS|S :E[d(S,S)]≤D

D(PS|S‖QS |PS) (26.9)

Thus we conclude that ∀QS ,∀δ > 0 we can pick M = 2n(E(QS)+δ) and the above code will havearbitrarily small excess distortion:

Pfailure = P[∀c ∈ C, d(Sn, c) > D]→ 0 as n→∞.

276

We optimize QS to get the smallest possible M :

minQS

E(QS) = minQS

minPS|S :E[d(S,S)]≤D

D(PS|S‖QS |PS) (26.10)

= minPS|S :E[d(S,S)]≤D

minQS

D(PS|S‖QS |PS) (26.11)

= minPS|S :E[d(S,S)]≤D

I(S; S)

= ϕS(D)

26.2.2 Proof of Theorem 26.2

Theorem 26.3 (Performance bound of average-distortion codes). Fix PX and suppose d(x, x) ≥ 0for all x, x. ∀PY |X , ∀γ > 0, ∀y0 ∈ A, there exists a code X →W → X, where W ∈ [M + 1] and

E[d(X, X)] ≤ E[d(X,Y )] + E[d(X, y0)]e−M/γ + E[d(X, y0)1i(X;Y )>log γ]

d(X, X) ≤ d(X, y0) a.s.

Note:

• This theorem says that from an arbitrary PY |X such that Ed(X,Y ) ≤ D, we can extract agood code with average distortion D plus some extra terms which will vanish in the asymptoticregime.

• The proof uses the random coding argument. The role of the deterministic y0 is a “fail-safe”codeword (think of y0 as the default reconstruction with Dmax = E[d(X, y0)]). We add y0 tothe random codebook for “damage control”, to hedge against the (highly unlikely and unlucky)event that we end up with a terrible codebook.

Proof. Similar to the previous intuitive argument, we apply random coding and generate thecodewords randomly and independently of the source:

C = c1, . . . , cMi.i.d.∼ PY ⊥⊥ X

and add the “fail-safe” codeword cM+1 = y0. We adopt the same encoder-decoder pair (26.2) –(26.3) and let X = g(f(X)). Then by definition,

d(X, X) = minj∈[M+1]

d(X, cj) ≤ d(X, y0).

To simplify notation, let Y be an independent copy of Y (similar to the idea of introducing unsentcodeword X in channel coding – see Lecture 17):

PX,Y,Y = PX,Y PY

277

where PY = PY . Recall the formula for computing the expectation of a random variable U ∈ [0, a]:E[U ] =

∫ a0 P[U ≥ u]du. Then the average distortion is

Ed(X, X) = E minj∈[M+1]

d(X, cj) (26.12)

= EXE[

minj∈[M+1]

d(X, cj)∣∣∣X]

(26.13)

= EX∫ d(X,y0)

0P[

minj∈[M+1]

d(X, cj) > u∣∣∣X]du (26.14)

≤ EX∫ d(X,y0)

0P[

minj∈[M ]

d(X, cj) > u∣∣∣X]du (26.15)

= EX∫ d(X,y0)

0P[d(X,Y ) > u|X]Mdu (26.16)

= EX∫ d(X,y0)

0(1− P[d(X,Y ) ≤ u|X]︸︷︷︸

,δ(X,u)

)Mdu (26.17)

Next we upper bound (1− δ(X,u))M as follows:

(1− δ(X,u))M ≤ e−M/γ + |1− γδ(X,u)|+ (26.18)

= e−M/γ +∣∣1− γE[exp−i(X;Y )1d(X,Y )≤u|X]

∣∣+ (26.19)

≤ e−M/γ + P[i(X;Y ) > log γ|X] + P[d(X,Y ) > u|X] (26.20)

where

• (26.18) uses the following trick in dealing with (1− δ)M for δ 1 and M 1. First, recallthe standard rule of thumb:

(1− εn)n ≈

0, εnn 1

1, εnn 1

In order to obtain firm bounds of similar flavor, consider

1− δMunion bound≤ (1− δ)M ≤ e−δM (log(1− δ) ≤ −δ)

≤ e−M/γ(γδ ∧ 1) + |1− γδ|+ (∀γ > 0)

≤ e−M/γ + |1− γδ|+

• (26.19) is simply change of measure using i(x; y) = log PY (y)PY |X(y|x) (i.e., conditioning-unconditioning

trick for information density, cf. Proposition 17.1.

278

• (26.20):

1− γ E[exp−i(X;Y )1d(X,Y )≤u|X] ≤ 1− γ E[exp−i(X;Y )1d(X,Y )≤u,i(X;Y )≤log γ|X]

≤ 1− E[1d(X,Y )≤u,i(X;Y )≤log γ|X]

= P[d(X,Y ) > u or i(X;Y ) > log γ|X]

≤ P[d(X,Y ) > u|X] + P[i(X;Y ) > log γ|X]

Plugging (26.20) into (26.17), we have

E[d(X, X)] ≤ EX∫ d(X,y0)

0(e−M/γ + P[i(X;Y ) > log γ|X] + P[d(X,Y ) > u|X])du

≤ E[d(X, y0)]e−M/γ + E[d(X, y0)P[i(X;Y ) > log γ|X]] + EX∫ ∞

0P[d(X,Y ) > u|X])du

= E[d(X, y0)]e−M/γ + E[d(X, y0)1i(X;Y )>log γ] + E[d(X,Y )]

As a side product, we have the following achievability for excess distortion.

Theorem 26.4 (Performance bound of excess-distortion codes). ∀PY |X , ∀γ > 0, there exists a code

X →W → X, where W ∈ [M ] and

P[d(X, X) > D] ≤ e−M/γ + P[d(X,Y ) > D ∪ i(X;Y ) > log γ]

Proof. Proceed exactly as in the proof of Theorem 26.3, replace (26.12) by P[d(X, X) > D] = P[∀j ∈[M ], d(X, cj) > D] = EX [(1− P[d(X,Y ) ≤ D|X])M ], and continue similarly.

Finally, we are able to prove Theorem 26.2 rigorously by applying Theorem 26.3 to iid sourcesX = Sn and n→∞:

Proof of Theorem 26.2. Our goal is the achievability: R(D) ≤ Ri(D) = ϕS(D).WLOG we can assume that Dmax = E[d(S, s0)] achieved at some fixed s0 – this is our default

reconstruction; otherwise just take any other fixed sequence so that the expectation is finite. Thedefault reconstruction for Sn is sn0 = (s0, . . . , s0) and E[d(Sn, sn0 )] = Dmax <∞ since the distortionis separable.

Fix some small δ > 0. Take any PS|S such that E[d(S, S)] ≤ D − δ. Apply Theorem 26.3 to

(X,Y ) = (Sn, Sn) with

PX = PSn

PY |X = PSn|Sn = (PS|S)n

logM = n(I(S; S) + 2δ)

log γ = n(I(S; S) + δ)

d(X,Y ) =1

n

n∑

j=1

d(Sj , Sj)

y0 = sn0

279

we conclude that there exists a compressor f : An → [M + 1] and g : [M + 1]→ An, such that

E[d(Sn, g(f(Sn)))] ≤ E[d(Sn, Sn)] + E[d(Sn, sn0 )]e−M/γ + E[d(Sn, sn0 )1i(Sn;Sn)>log γ]

≤ D − δ +Dmax e− exp(nδ)

︸︷︷︸→0

+E[d(Sn, sn0 )1En ]︸︷︷︸→0 (later)

, (26.21)

where

En = i(Sn; Sn) > log γ =

1

n

n∑

j=1

i(Sj ; Sj) > I(S; S) + δ

WLLN====⇒ P[En]→ 0

If we can show the expectation in (26.21) vanishes, then there exists an (n,M,D)-code with:

M = 2n(I(S;S)+2δ), D = D − δ + o(1) ≤ D.

To summarize, ∀PS|S such that E[d(S, S)] ≤ D − δ we have that:

R(D) ≤ I(S; S)

δ↓0==⇒ R(D) ≤ ϕS(D−) = ϕS(D). (continuity, since D > D0)

It remains to show the expectation in (26.21) vanishes. This is a simple consequence of the

uniform integrability of the sequence d(Sn, sn0 ). (Indeed, any sequence VnL1→ V is uniformly

integrable.) If you do not know what uniform integrability is, here is a self-contained proof.

Lemma 26.1. For any positive random variable U , define g(δ) = supH:P[H]≤δ E[U1H ]. Then1

EU <∞⇒ g(δ)δ→0−−−→ 0.

Proof. For any b > 0, E[U1H ] ≤ E[U1U>b] + bδ, where E[U1U>b]b→∞−−−→ 0 by dominated

convergence theorem. Then the proof is completed by setting b = 1/√δ.

Now d(Sn, sn0 ) = 1n

∑Uj , where Uj are iid copies of U . Since E[U ] = Dmax <∞ by assumption,

applying Lemma 26.1 yields E[d(Sn, sn0 )1En ] = 1n

∑E[Uj1En ] ≤ g(P[En])→ 0, since P[En]→ 0. We

are done proving the theorem.

Note: It seems that in Section 26.2.1 and in Theorem 26.2 we applied different relaxations inshowing the lower bound, how come they turn out to yield the same tight asymptotic result?

This is because the key to both proofs is to estimate the exponent (large deviations) of theunderlined probabilities in (26.7) and (26.17), respectively. To get the right exponent, as we know,the key is to apply tilting (change of measure) to the distribution solving the information projectionproblem (26.9). In the case, when PY = (QS)n = (PS)n is chosen as the solution to rate-distortion

optimization inf I(S; S), the resulting tilting is precisely given by 2−i(X;Y ).

1In fact, ⇒ is ⇔.

280

26.3* Covering lemma

Goal:

In other words:

Approximate P with Q such that for any function f , ∀x, we have:

P[f(An, Bn) ≤ x] ≈ Q[f(An, Bn) ≤ x], |W | ≤ 2nR.

what is the minimum rate R to achieve this?Some remarks:

1. The minimal rate will depend (although it is not obvious) on whether the encoder An →Wknows about the test that the tester is running (or equivalently whether he knows the functionf(·, ·)).

2. If the function is known to be of the form f(An, Bn) =∑n

j=1 f1(Aj , Bj), then evidently thejob of the encoder is the following: For any realization of the sequence An, we need to generatea sequence Bn such that joint composition (empirical distribution) is very close to PA,B.

3. If R = H(A), we can compress An and send it to “B side”, who can reconstruct An perfectlyand use that information to produce Bn through PBn|An .

4. If R = H(B), “A side” can generate Bn according to PnA,B and send that Bn sequence to the“B side”.

5. If A ⊥⊥ B, we know that R = 0, as “B side” can generate Bn independently.

281

Our previous argument turns out to give a sharp answer for the case when encoder is aware ofthe tester’s algorithm. Here is a precise result:

Theorem 26.5 (Covering Lemma). ∀PA,B and R > I(A;B), let C = c1, . . . , cM where eachcodeword cj is i.i.d. drawn from distribution PnB. ∀ε > 0, for M ≥ 2n(I(A;B)+ε) we have that:

P[∃c ∈ C such that PAn,c ≈ PA,B]→ 1

Stronger form: ∀F

P[∃c : (An, c) ∈ F ] ≥ P[(An, Bn) ∈ F ] + o(1)︸︷︷︸uniform in F

Proof. Following similar arguments of the proof for Theorem 26.3, we have

P[∀c ∈ C : (An, c) 6∈ F ] ≤ e−γ + P[(An, Bn) 6∈ F ∪ i(An;Bn) > log γ]= P[(An, Bn) 6∈ F ] + o(1)

⇒ P[∀c ∈ C : (An, c) ∈ F ] ≥ P[(An, Bn) ∈ F ] + o(1)

Note: [Intuition] To generate Bn, there are around 2nH(B) high probability sequences; for each An

sequence, there are around 2nH(B|A) Bn sequences that have the same joint distribution, therefore, it

is sufficient to describe the class of Bn for each An sequence, and there are around 2nH(B)

2nH(B|A) = 2nI(A;B)

classes.Although Covering Lemma is a powerful tool, it does not imply that the constructed joint

distribution QAnBn can fool any permutation invariant tester. In other words, it is not guaranteedthat

supF⊂An×Bn,permut.invar.

|QAn,Bn(F )− PnA,B(F )| → 0 .

Indeed, a sufficient statistic for a permutation invariant tester is a joint type PAn,c. Our code

satisfies PAn,c ≈ PA,B, but it might happen that PAn,c although close to PA,B still takes highlyunlikely values (for example, if we restrict all c to have the same composition P0, the tester caneasily detect the problem since PnB-measure of all strings of composition P0 cannot exceed O(1/

√n)).

Formally, to fool permutation invariant tester we need to have small total variation between thedistribution on the joint types under P and Q. (It is natural to conjecture that rate R = I(A;B)should be sufficient to achieve this requirement, though).

A related question is about the minimal possible rate (i.e. cardinality of W ∈ [2nR]) required tohave small total variation:

TV(QAn,Bn , PnAB) ≤ ε (26.22)

Note that condition (26.22) guarantees that any tester (permutation invariant or not) is fooled tobelieve he sees the truly iid (An, Bn). The minimal required rate turns out to be (Cuff’2012):

R = minA→U→B

I(A,B;U)

a quantity known as Wyner’s common information C(A;B). Showing that Wyner’s commoninformation is a lower-bound is not hard. Indeed, since QAn,Bn ≈ PnAB (in TV) we have

I(QAt−1,Bt−1 , QAtBt|At−1,Bt−1) ≈ I(PAt−1,Bt−1 , PAtBt|At−1,Bt−1) = 0

282

(Here one needs to use finiteness of the alphabet of A and B and the bounds relating H(P )−H(Q)with TV(P,Q)). We have (under Q!)

nR = H(W ) ≥ I(An, Bn;W ) (26.23)

≥T∑

t=1

I(At, Bt;W )− I(At, Bt;At−1Bt−1) (26.24)

≈T∑

t=1

I(At, Bt;W ) (26.25)

& nC(A;B) (26.26)

where in the last step we used the crucial observation that under Q there is a Markov chain

At →W → Bt

and that Wyner’s common information PA,B 7→ C(A;B) should be continuous in the total variationdistance on PA,B. Showing achievability is a little more involved.

283

§ 27. Evaluating R(D). Lossy Source-Channel separation.

Last time: For stationary memoryless (iid) sources and separable distortion, under the assumptionthat Dmax <∞.

R(D) = Ri(D) = infPS|S :Ed(S,S)≤D

I(S; S).

27.1 Evaluation of R(D)

So far we’ve proved some properties about the rate distortion function, now we’ll compute its valuefor a few simple statistical sources. We’ll do this in a somewhat unsatisfying way: guess the answer,then verify its correctness. At the end, we’ll show that there is a pattern behind this method.

27.1.1 Bernoulli Source

Let S ∼ Ber(p), p ≤ 1/2, with Hamming distortion d(S, S) = 1S 6= S and alphabets A = A =0, 1. Then d(sn, sn) = 1

n‖sn − sn‖Hamming is the bit-error rate.Claim:

R(D) = |h(p)− h(D)|+ (27.1)

Proof. Since Dmax = p, in the sequel we can assume D < p for otherwise there is nothing to show.(Achievability) We’re free to choose any PS|S , so choose S = S + Z, where S ∼ Ber(p′) ⊥⊥ Z ∼

Ber(D), and p′ is such that

p′ ∗D = p′(1−D) + (1− p′)D = p,

i.e., p′ = p−D1−2D . In other words, the backward channel PS|S is BSC(D). This induces some forward

channel PS|S . Then,

I(S; S) = H(S)−H(S|S) = h(p)− h(D)

Since one such PS|S exists, we have the upper bound R(D) ≤ h(p)− h(D).

(Converse) First proof: For any PS|S such that P [S 6= S] ≤ D ≤ p ≤ 12 ,

I(S; S) = H(S)−H(S|S)

= H(S)−H(S + S|S)

≥ H(S)−H(S + S)

= h(p)− h(P [S 6= S])

≥ h(p)− h(D)

284

Second proof: Here is a more general strategy. Denote the random transformation from the

achievability proof by P ∗S|S . Now we need to show that there is no better QS|S with EQ[d(S, S)] ≤ D

and a smaller mutual information. Then consider the chain:

R(D) ≤ I(PS , QS|S) = D(QS|S‖PS |QS)

= D(QS|S‖PS|S |QS) + EQ

[log

PS|S

PS

]

(Marginal QSS = PSQS|S) = D(QS|S‖PS|S |QS) +H(S) + EQ[logD1S 6= S+ log D1S = S]

And we can minimize this expression by taking QS|S = PS|S , giving

≥ 0 +H(S) + P [S = S] log(1−D) + P [S 6= S] logD ≥ h(p)− h(D) (D ≤ 1/2) (27.2)

Since the upper and lower bound agree, we have R(D) = |h(p)− h(D)|+.

For example, when p = 1/2, D = .11, then R(D) = 1/2 bit. In the Hamming game where wecompressed 100 bits down to 50, we indeed can do this while achieving 11% average distortion,compared to the naive scheme of storing half the string and guessing on the other half, whichachieves 25% average distortion.

Interpretation: By WLLN, the distribution PnS = Ber(p)n concentrates near the Hammingsphere of radius np as n grows large. The above result about Hamming sources tells us that theoptimal reconstruction points are from Pn

S= Ber(p′)n where p′ < p if p < 1/2 and p′ = 1/2 if

p = 1/2, which concentrates on a smaller sphere of radius np′ (note the reconstruction points aresome exponentially small subset of this sphere).

S(0, np)

S(0, np′)

Hamming Spheres

It is interesting to note that none of the reconstruction points are the same as any of the typicalsource realizations.

27.1.2 Gaussian Source

The Gaussian source is defines as A = A = R, S ∼ N (0, σ2), d(a, a) = |a− a|2 (MSE distortion).Claim:

R(D) =1

2log+ σ2

D(27.3)

Proof. Since Dmax = σ2, in the sequel we can assume D < σ2 for otherwise there is nothing to show.

285

(Achievability) Choose S = S + Z , where S ∼ N (0, σ2 −D) ⊥⊥ Z ∼ N (0, D). In other words,the backward channel PS|S is AWGN with noise power D. Since everything is jointly Gaussian, the

forward channel can be easily found to be PS|S = N (σ2−Dσ2 S, σ

2−Dσ2 D). Then

I(S; S) =1

2log

σ2

D=⇒ R(D) ≤ 1

2log

σ2

D

(Converse) Let PS|S be any conditional distribution such that EP |S − S|2 ≤ D. Denote theforward channel in the achievability by P ∗

S|S . We use the same trick as before

I(PS , PS|S) = D(PS|S‖P ∗S|S |PS) + EP

[log

P ∗S|S

PS

]

≥ EP

[log

P ∗S|S

PS

]

= EP

log

1√2πD

e−(S−S)2

2D

1√2πσ2

e−S2

2σ2

=1

2log

σ2

D+

log e

2EP

[S2

σ2− |S − S|

2

D

]

≥ 1

2log

σ2

D.

Again, the upper and lower bounds agree.

The interpretation in the Gaussian case is very similar to the case of the Hamming source. As ngrows large, our source distribution concentrates on S(0,

√nσ2) (n-sphere in Euclidean space rather

than Hamming), and our reconstruction points on S(0,√n(σ2 −D)). So again the picture is two

nested sphere.How sensitive is the rate-distortion formula to the Gaussianity assumption of the source? The

following is a counterpart of Theorem 19.6 for channel capacity:

Theorem 27.1. Assume that ES = 0 and VarS = σ2. Let the distortion metric be quadratic:d(s, s) = (s− s)2. Then

1

2log+ σ2

D−D(PS‖N (0, σ2)) ≤ R(D) = inf

PS|S :E(S−S)2≤DI(S; S) ≤ 1

2log+ σ2

D.

Note: This result is in exact parallel to what we proved in Theorem 19.6 for additive-noise channelcapacity:

1

2log

(1 +

P

σ2

)≤ sup

PX :EX2≤PI(X;X + Z) ≤ 1

2log

(1 +

P

σ2

)+D(PZ‖N (0, σ2)).

where EZ = 0 and VarZ = σ2.Note: A simple consequence of Theorem 27.1 is that for source distributions with a density, the rate-distortion function grows according to 1

2 log 1D in the low-distortion regime as long as D(PS‖N (0, σ2))

is finite. In fact, the first inequality, known as the Shannon lower bound, is asymptotically tight, i.e.,R(D) = 1

2 log σ2

D −D(PS‖N (0, σ2)) + o(1) as D → 0. Therefore in this regime performing uniformscalar quantization with accuracy 1√

Dis in fact asymptotically optimal within an o(1) term.

286

Proof. Again, assume D < Dmax = σ2. Let SG ∼ N (0, σ2).

“≤”: Use the same P ∗S|S = N (σ

2−Dσ2 S, σ

2−Dσ2 D) in the achievability proof of Gaussian rate-

distortion function:

R(D) ≤ I(PS , P∗S|S)

= I(S;σ2 −Dσ2

S +W ) W ∼ N (0,σ2 −Dσ2

D)

≤ I(SG;σ2 −Dσ2

SG +W ) by Gaussian saddle point (Theorem 4.6)

=1

2log

σ2

D.

“≥”: For any PS|S such that E(S − S)2 ≤ D. Let P ∗S|S = N (S,D) denote AWGN with noise

power D. Then

I(S; S) = D(PS|S‖PS |PS)

= D(PS|S‖P ∗S|S |PS) + EP

[log

P ∗S|S

PSG

]−D(PS‖PSG)

≥ EP

log

1√2πD

e−(S−S)2

2D

1√2πσ2

e−S2

2σ2

−D(PS‖PSG)

≥ 1

2log

σ2

D−D(PS‖PSG).

Remark: The theory of quantization and the rate distortion theory at large have played asignificant role in pure mathematics. For instance, Hilbert’s thirteenth problem was partially solvedby Arnold and Kolmogorov after they realized that they could classify spaces of functions lookingat the optimal quantizer for such functions.

27.2* Analog of saddle-point property in rate-distortion

In the computation of R(D) for the Hamming and Gaussian source, we guessed the correct formof the rate distortion function. In both of their converse arguments, we used the same trick toestablish that any other PS|S gave a larger value for R(D). In this section, we formalize this trick,in an analogous manner to the saddle point property of the channel capacity. Note that typicallywe don’t need any tricks to compute R(D), since we can obtain a solution in a parametric form tothe unconstrained convex optimization

minPS|S

I(S; S) + λE[d(S, S)]

In fact there are also iterative algorithms (Blahut-Arimoto) that computes R(D). However, forthe peace of mind it is good to know there are some general reasons why tricks like we used inHamming/Gaussian actually are guaranteed to work.

287

Theorem 27.2. 1. Suppose PY ∗ and PX|Y ∗ PX are found with the property that E[d(X,Y ∗)] ≤ Dand for any PXY with E[d(X,Y )] ≤ D we have

E[log

dPX|Y ∗

dPX(X|Y )

]≥ I(X;Y ∗) . (27.4)

Then R(D) = I(X;Y ∗).2. Suppose that I(X;Y ∗) = R(D). Then for any regular branch of conditional probability PX|Y ∗

and for any PXY satisfying

• E[d(X,Y )] ≤ D and

• PY PY ∗ and

• I(X;Y ) <∞

the inequality (27.4) holds.

Remarks:

1. The first part is a sufficient condition for optimality of a given PXY ∗ . The second part gives anecessary condition that is convenient to narrow down the search. Indeed, typically the set ofPXY satisfying those conditions is rich enough to infer from (27.4):

logdPX|Y ∗

dPX(x|y) = R(D)− θ[d(x, y)−D] ,

for a positive θ > 0.

2. Note that the second part is not valid without assuming PY PY ∗ . A counterexample tothis and various other erroneous (but frequently encountered) generalizations is the following:A = 0, 1, PX = Bern(1/2), A = 0, 1, 0′, 1′ and

d(0, 0) = d(0, 0′) = 1− d(0, 1) = 1− d(0, 1′) = 0 .

The R(D) = |1− h(D)|+, but there are a bunch of non-equivalent optimal PY |X , PX|Y andPY ’s.

Proof. First part is just a repetition of the proofs above, so we focus on part 2. Suppose there existsa counterexample PXY achieving

I1 = E[log

dPX|Y ∗

dPX(X|Y )

]< I∗ = R(D) .

Notice that whenever I(X;Y ) <∞ we have

I1 = I(X;Y )−D(PX|Y ‖PX|Y ∗ |PY ) ,

and thusD(PX|Y ‖PX|Y ∗ |PY ) <∞ . (27.5)

Before going to the actual proof, we describe the principal idea. For every λ we can define a jointdistribution

PX,Yλ = λPX,Y + (1− λ)PX,Y ∗ .

288

Then, we can compute

I(X;Yλ) = E[log

PX|YλPX

(X|Yλ)

]= E

[log

PX|YλPX|Y ∗

PX|Y ∗

PX

](27.6)

= D(PX|Yλ‖PX|Y ∗ |PYλ) + E[PX|Y ∗(X|Yλ)

PX

](27.7)

= D(PX|Yλ‖PX|Y ∗ |PYλ) + λI1 + (1− λ)I∗ . (27.8)

From here we will conclude, similar to Prop. 4.1, that the first term is o(λ) and thus for sufficientlysmall λ we should have I(X;Yλ) < R(D), contradicting optimality of coupling PX,Y ∗ .

We proceed to details. For every λ ∈ [0, 1] define

ρ1(y) ,dPYdPY ∗

(y) (27.9)

λ(y) ,λρ1(y)

λρ1(y) + λ(27.10)

P(λ)X|Y=y = λ(y)PX|Y=y + λ(y)PX|Y ∗=y (27.11)

dPYλ = λdPY + λdPY ∗ = (λρ1(y) + λ)dPY ∗ (27.12)

D(y) = D(PX|Y=y‖PX|Y ∗=y) (27.13)

Dλ(y) = D(P(λ)X|Y=y‖PX|Y ∗=y) . (27.14)

Notice:On ρ1 = 0 : λ(y) = D(y) = Dλ(y) = 0

and otherwise λ(y) > 0. By convexity of divergence

Dλ(y) ≤ λ(y)D(y)

and therefore1

λ(y)Dλ(y)1ρ1(y) > 0 ≤ D(y)1ρ1(y) > 0 .

Notice that by (27.5) the function ρ1(y)D(y) is non-negative and PY ∗-integrable. Then, applyingdominated convergence theorem we get

limλ→0

∫

ρ1>0dPY ∗

1

λ(y)Dλ(y)ρ1(y) =

∫

ρ1>0dPY ∗ρ1(y) lim

λ→0

1

λ(y)Dλ(y) = 0 (27.15)

where in the last step we applied the result from Lecture 4

D(P‖Q) <∞ =⇒ D(λP + λQ‖Q) = o(λ)

since for each y on the set ρ1 > 0 we have λ(y)→ 0 as λ→ 0.On the other hand, notice that∫

ρ1>0dPY ∗

1

λ(y)Dλ(y)ρ1(y)1ρ1(y) > 0 =

1

λ

∫

ρ1>0dPY ∗(λρ1(y) + λ)Dλ(y) (27.16)

=1

λ

∫

ρ1>0dPYλDλ(y) (27.17)

=1

λ

∫

YdPYλDλ(y) =

1

λD(P

(λ)X|Y ‖PX|Y ∗ |PYλ) , (27.18)

289

where in the penultimate step we used Dλ(y) = 0 on ρ1 = 0. Hence, (27.15) shows

D(P(λ)X|Y ‖PX|Y ∗ |PYλ) = o(λ) , λ→ 0 .

Finally, since

P(λ)X|Y PYλ = PX ,

we have

I(X;Yλ) = D(P(λ)X|Y ‖PX|Y ∗ |PYλ) + λE

[log

dPX|Y ∗

dPX(X|Y )

]+ λE

[log

dPX|Y ∗

dPX(X|Y ∗)

](27.19)

= I∗ + λ(I1 − I∗) + o(λ) , (27.20)

contradicting the assumptionI(X;Yλ) ≥ I∗ = R(D) .

27.3 Lossy joint source-channel coding

The lossy joint source channel coding problem refers to the fundamental limits of lossy compressionfollowed by transmission over a channel.

Problem Setup: For an A-valued (S1, S2, . . . and distortion metric d : Ak × Ak → R, alossy JSCC is a pair (f, g) such that

Skf−→ Xn ch.−→ Y n g−→ Sk

Definition 27.1. (f, g) is a (k, n,D)-JSCC if E[d(Sk, Sk)] ≤ D.

Source JSCC enc Channel JSCC dec Sk

R = kn

Sk Xn Y n

where ρ is the bandwidth expansion factor :

ρ =n

kchannel uses/symbol.

Our goal is to minimize ρ subject to a fidelity guarantee by designing the encoder/decoder pairssmartly. The asymptotic fundamental limit for a lossy JSCC is

ρ∗(D) = lim supn→∞

minnk

: ∃(k, n,D)− code

For simplicity in this lecture we will focus on JSCC for stationary memoryless sources withseparable distortion + stationary memoryless channels.

290

27.3.1 Converse

The converse for the JSCC is quite simple. Note that since there is no ε under consideration, thestrong converse is the same as the weak converse. The proof architecture is identical to the weakconverse of lossless JSCC which uses Fano’s inequality.

Theorem 27.3 (Converse). For any source such that

Ri(D) = limk→∞

1

kinf

PSk|Sk :E[d(Sk,Sk)]≤D

I(Sk; Sk)

we have

ρ∗(D) ≥ Ri(D)

Ci

Remark: The requirement of this theorem on the source isn’t too stringent; the limit expressionfor Ri(D) typically exists for stationary sources (like for the entropy rate)

Proof. Take a (k, n,D)-code Sk → Xn → Y n → Sk. Then

infPSk|Sk

I(Sk; Sk) ≤ I(Sk; Sk) ≤ I(Xk;Y k) ≤ supPXn

I(Xn;Y n)

Which follows from data processing and taking inf/sup. Normalizing by 1/k and taking the liminfas n→∞

(LHS) lim infn→∞

1

nsupPXn

I(Xn;Y n) = Ci

(RHS) lim infn→∞

1

kninf

PSkn |Skn

I(Skn ; Skn) = Ri(D)

And therefore, any sequence of (kn, n,D)-codes satisfies

lim supn→∞

n

kn≥ Ri(D)

Ci

Note: Clearly the assumptions in Theorem 27.3 are satisfied for memoryless sources. If the sourceS is iid Bern(1/2) with Hamming distortion, then Theorem 27.3 coincides with the weak conversefor channel coding under bit error rate in Theorem 16.4:

k ≤ nC

1− h(pb)

which we proved using ad hoc techniques. In the case of channel with cost constraints, e.g., theAWGN channel with C(SNR) = 1

2 log(1 + SNR), we have

pb ≥ h−1

(1− C(SNR)

R

)

This is often referred to as the Shannon limit in plots comparing the bit-error rate of practical codes.See, e.g., Fig. 2 from [RSU01] for BIAWGN (binary-input) channel. This is erroneous, since thepb above refers to the bit-error of data bits (or systematic bits), not all of the codeword bits. Thelatter quantity is what typically called BER in the coding-theoretic literature.

291

27.3.2 Achievability via separation

The proof strategy is similar to the lossless JSCC: We construct a separated lossy compressionand channel coding scheme using our tools from those areas, i.e., let the JSCC encoder to bethe concatenation of a loss compressor and a channel encoder, and the JSCC decoder to be theconcatenation of a channel decoder followed by a loss compressor, then show that this separatedconstruction is optimal.

Theorem 27.4. For any stationary memoryless source (PS ,A, A, d) satisfying assumption A1(below), and for any stationary memoryless channel PY |X ,

ρ∗(D) =R(D)

C

Note: The assumption on the source is to control the distortion incurred by the channel decodermaking an error. Although we know that this is a low-probability event, without any assumptionon the distortion metric, we cannot say much about its contribution to the end-to-end averagedistortion. This will not be a problem if the distortion metric is bounded (for which Assumption A1is satisfied of course). Note that we do not have this nuisance in the lossless JSCC because we atmost suffer the channel error probability (union bound).

The assumption is rather technical which can be skipped in the first reading. Note that it istrivially satisfied by bounded distortion (e.g., Hamming), and can be shown to hold for Gaussiansource and MSE distortion.

Proof. The converse direction follows from the previous theorem. For the other direction, weconstructed a separated compression / channel coding scheme. Take

Sk →W → Sk compressor to W ∈ [2kR(D)+o(k)] with E[d(Sk, Sk)] ≤ DW → Xn → Y n → W maximal probability of error channel code (assuming kR(D) ≤ nC + o(n))

with P[W 6= W ] ≤ ε ∀PW

So that the overall system is

Sk −→W −→ Xn −→ Y n −→ W −→ Sk

Note that here we need a maximum probability of error code since when we concatenate thesetwo schemes, W at the input of the channel is the output of the source compressor, which is notguaranteed to be uniform. Now that we have a scheme, we must analyze the average distortion toshow that it meets the end-to-end distortion constraint. We start by splitting the expression intotwo cases

E[d(Sk, Sk)] = E[d(Sk, Sk(W ))1W = W] + E[d(Sk, Sk(W ))1W 6= W]

By assumption on our lossy code, we know that the first term is ≤ D. In the second term, we knowthat the probability of the event W 6= W is small by assumption on our channel code, but wecannot say anything about E[d(Sk, Sk(W ))] unless, for example, d is bounded. But by Lemma 27.1(below), ∃ code Sk →W → Sk such that

(1) E[d(Sk, Sk)] ≤ D holds

(2) d(ak0, Sk) ≤ L for all quantization outputs Sk, where ak0 = (a0, . . . , a0) is some fixed string of

length k from the Assumption A1 below.

292

The second bullet says that all points in the reconstruction space are “close” to some fixed string.Now we can deal with the troublesome term

E[d(Sk, Sk(W ))1W 6= W] ≤ E[1W 6= Wλ(d(Sk, ak0) + d(ak0, Sk))]

(by point (2) above) ≤ λE[1W 6= Wd(Sk, ak0)] + λE[1W 6= WL]

≤ λo(1) + λLε→ 0 as ε→ 0

where in the last step we applied the same uniform integrability argument that showed vanishing ofthe expectation in (26.21) before.

In all, our scheme meets the average distortion constraint. Hence we conclude that for ∀ρ >R(D)C , ∃ sequence of (k, n,D + o(1))-codes.

The following assumption is critical to the previous theorem:Assumption A1: For a source (PS ,A, A, d), ∃λ ≥ 0, a0 ∈ A, a0 ∈ A such that

1. d(a, a) ≤ λ(d(a, ao) + d(a0, a)) ∀a, a (generalized triangle inequality)

2. E[d(S, a0)] <∞ (so that Dmax <∞ too).

3. E[d(a0, S)] <∞ for any output distribution PS achieving the rate-distortion function R(D)at some D.

4. d(a0, a0) <∞.

This assumption says that the spaces A and A have “nice centers”, in the sense that the distancebetween any two points is upper bounded by a constant times the distance from the centers to eachpoint (see figure below).

b

bb

b

a0

A A

a0

aa

But the assumption isn’t easy to verify, or clear which sources satisfy the assumptions. Because ofthis, we now give a few sufficient conditions for Assumption A1 to hold.

Trivial Condition: If the distortion function is bounded, then the assumption A1 holdsautomatically. In other words, if we have a discrete source with finite alphabet |A|, |A| <∞ anda finite distortion function d(a, a) < ∞, then A1 holds. More generally, we have the followingcriterion.

Theorem 27.5 (Criterion for satisfying A1). If A = A and d(a, a) = ρq(a, a) for some metric ρwith q ≥ 1, and Dmax , inf a0 E[d(S, a0)] <∞, then A1 holds.

Proof. Take a0 = a0 that achieves finite Dp (in fact, any points can serve as centers in a metricspace). Then

(1

2ρ(a, a))q ≤

(1

2ρ(a, a0) +

1

2ρ(a0, a)

)q

(Jensen’s) ≤ 1

2ρq(a, a0) +

1

2ρq(a0, a)

293

And thus d(a, a) ≤ 2q−1(d(a, a0) + d(a0, a)). Taking λ = 2q−1 verifies (1) and (2) in A1. To verify(3), we can use this generalized triangle inequality for our source

d(a0, S) ≤ 2q−1(d(a0, S) + d(S, S))

Then taking the expectation of both sides gives

E[d(a0, S)] ≤ 2q−1(E[d(a0, S)] + E[d(S, S)])

≤ 2q−1(Dmax +D) <∞

So that condition (3) in A1 holds.

So we see that metrics raised to powers (e.g. squared Euclidean norm) satisfy the condition A1.The lemma used in Theorem 27.4 is now given.

Lemma 27.1. Fix a source satisfying A1 and an arbitrary PS|S. Let R > I(S; S), L > maxE[d(a0, S)], d(a0, a0)and D > E[d(S, S)]. Then, there exists a (k, 2kR, D)-code such that for every reconstruction pointx ∈ Ak we have d(ak0, x) ≤ L.

Proof. Let X = Ak, X = Ak and PX = P kS , PY |X = P kS|S . Then apply the achievability bound for

excess distortion from Theorem 26.4 with

d1(x, x) =

d(x, x) d(ak0, x) ≤ L+∞ o/w

Note that this is NOT a separable distortion metric. Also note that without any change in d1-distortion we can remove all (if any) reconstruction points x with d(ak0, x) > L. Furthermore, fromthe WLLN we have for any D > D′ > E[d(S, S′)]

P[d1(X,Y ) > D′] ≤ P[d(Sk, Sk) > D′] + P[d(ak0, Sk) > L]→ 0

as k →∞ (since E[d(S, S)] < D′ and E[a0, S] < L). Thus, overall we get M = 2kR reconstructionpoints (c1, . . . , cM ) such that

P[ minj∈[M ]

d(Sk, cj) > D′]→ 0

and d(ak0, cj) ≤ L. By adding cM+1 = (a0, . . . , a0) we get

E[ minj∈[M+1]

d(Sk, cj)] ≤ D′ + E[d(Sk, cM+1)1minj∈[M ]

d(Sk, cj) > D′] = D′ + o(1) ,

where the last estimate follows from uniform integrability as in the vanishing of expectation in (26.21).Thus, for sufficiently large n the expected distortion is ≤ D, as required.

To summarize the results in this section, under stationarity and memorylessness assumptions onthe source and the channel, we have shown that the following separately-designed scheme achievesthe optimal rate for lossy JSCC: First compress the data, then encode it using your favorite channelcode, then decompress at the receiver.

Source JSCC enc ChannelR(D) bits/sec ρC bits/sec

294

27.4 What is lacking in classical lossy compression?

Examples of some issues with the classical compression theory:

• compression: we can apply the standard results in compression of a text file, but it is extremelydifficult for image files due to the strong spatial correlation. For example, the first sentenceand the last in Tolstoy’s novel are pretty uncorrelated. But the regions in the upper-leftand bottom-right corners of one image can be strongly correlated. Thus for practicing thelossy compression of videos and images the key problem is that of coming up with a good“whitening” basis.

• JSCC: Asymptotically the separation principle sounds good, but the separated systems canbe very unstable - no graceful degradation. Consider the following example of JSCC.

Example: Source = Bern(12), channel = BSC(δ).

1. separate compressor and channel encoder designed for R(D)C(δ) = 1

2. a simple JSCC:ρ = 1, Xi = Si

295

Part VI

Advanced topics

296

§ 28. Applications to statistical decision theory

In this lecture we discuss applications of information theory to statistical decision theory. Althoughthis lecture only focuses on statistical lower bound (converse result), let us remark in passing thatthe impact of information theory on statistics is far from being only on proving impossibility results.Many procedures are based on or inspired by information-theoretic ideas, e.g., those based onmetric entropy, pairwise comparison, maximum likelihood estimator and analysis, minimum distanceestimator (Wolfowitz), maximum entropy estimators, EM algorithm, minimum description length(MDL) principle, etc.

We discuss two methods: LeCam-Fano (hypothesis testing) method and the rate-distortion(mutual information) method.

We begin with the decision-theoretic setup of statistical estimation. The general paradigm isthe following:

θ︸︷︷︸parameter

→ X︸︷︷︸data

→ θ︸︷︷︸estimator

The main ingredients are

• Parameter space: Θ 3 θ

• Statistical model: PX|θ : θ ∈ Θ, which is a collection of distributions indexed by theparameter

• Estimator: θ = θ(X)

• Loss function: `(θ, θ) measures the inaccuracy.

The goal is make random variable `(θ, θ) small either in probability or in expectation, uniformlyover the unknown parameter θ. To this end, we define the minimax risk

R∗ = infθ

supθ∈Θ

Eθ[`(θ, θ)].

Here Eθ denotes averaging with respect to the randomness of X ∼ Pθ.Ideally we want to compute R∗ and find the minimax optimal estimator that achieves the

minimax risk. This tasks can be very difficult especially in high dimensions, in which case we willbe content with characterizing the minimax rate, which approximates R∗ within multiplicativeuniversal constant factors, and the estimator that achieves a constant factor of R∗ will be calledrate-optimal.

As opposed to the worst-case analysis of the minimax risk, the Bayes approach is an average-caseanalysis by considering the average risk of an estimator over all θ ∈ Θ. Let the prior π be aprobability distribution on Θ, from which the parameter θ is drawn. Then, the average risk (w.r.tπ) is defined as

Rπ(θ) = Eθ∼πRθ(θ) = Eθ,X`(θ, θ).

297

The Bayes risk for a prior π is the minimum that the average risk can achieve, i.e.

R∗π = infθRπ(θ).

By the simple logic of “maximum ≥ average”, we have

R∗ ≥ R∗π (28.1)

and in fact R∗ = supπ∈M(Θ)R∗π whenever the minimax theorem holds, where M(Θ) denotes the

collection of all probability distributions on Θ. In other words, solving the minimax problem can bedone by finding the least-favorable (Bayesian) prior. Almost all of the minimax lower bounds boildown to bounding from below the Bayes risk for some prior. When this prior is uniform on just twopoints, the method is known under a special name of (two-point) LeCam or LeCam-Fano method.

Note also that when `(θ, θ) = ‖θ− θ‖22 is the quadratic `2 risk, the optimal estimator achieving R∗πis easy to describe: θ∗ = E[θ|X]. This fact, however, is of limited value, since typically conditionalexpectation is very hard to analyze.

28.1 Fano, LeCam and minimax risks

We demonstrate the LeCam-Fano method on the following example:

• Parameter space θ ∈ [0, 1]

• Observation model Xi – i.i.d. Bern(θ)

• Quadratic loss function:`(θ, θ) = (θ − θ)2

• Fundamental limit:R∗(n) , sup

θ0∈[0,1]infθE[(θ(Xn)− θ)2|θ = θ0]

A natural estimator to consider is the empirical mean:

θemp(Xn) =

1

n

∑

i

Xi

It achieves the loss

supθ0

E[(θemp − θ)2|θ = θ0] = supθ0

θ0(1− θ0)

n=

1

4n. (28.2)

The question is how close this is to the optimal.First, recall the Cramer-Rao lower bound : Consider an arbitrary statistical estimation problem

θ → X → θ with θ ∈ R and PX|θ(dx|θ0) = f(x|θ)µ(dx) with f(x|θ) is differentiable in θ. Then for

any θ(x) with E[θ(X)|θ] = θ + b(θ) and smooth b(θ) we have

E[(θ − θ)2|θ = θ0] ≥ b(θ0)2 +(1 + b′(θ0))2

JF (θ0), (28.3)

where JF (θ0) = Var[∂ ln f(X|θ)∂θ |θ = θ0] is the Fisher information (4.7). In our case, for any unbiased

estimator (i.e. b(θ) = 0) we have

E[(θ − θ)2|θ = θ0] ≥ θ0(1− θ0)

n,

298

and we can see from (28.2) that θemp is optimal in the class of unbiased estimators.Can biased estimators do better? The answer is yes. Consider

θbias =1− εnn

∑

i

(Xi −1

2) +

1

2,

where choice of εn > 0 “shrinks” the estimator towards 12 and regulates the bias-variance tradeoff.

In particular, setting εn = 1√n+1

achieves the minimax risk

supθ0

E[(θbias − θ)2|θ = θ0] =1

4(√n+ 1)2

, (28.4)

which is better than the empirical mean (28.2), but only slightly.How do we show that arbitrary biased estimators can not do significantly better? This is where

LeCam-Fano method comes handy. Suppose some estimator θ achieves

E[(θ − θ)2|θ = θ0] ≤ ∆2n (28.5)

for all θ0. Then, setup the following probability space:

W → θ → Xn → θ → W

• W ∼ Bern(1/2)

• θ = 1/2 + κ(−1)W∆n where κ > 0 is to be specified later

• Xn is i.i.d. Bern(θ)

• θ is the given estimator

• W = 0 if θ > 1/2 and W = 1 otherwise

The idea here is that we use our high-quality estimator to distinguish between two hypothesesθ = 1/2± κ∆n. Notice that for probability of error we have:

P[W 6= W ] = P[θ > 1/2|θ = 1/2− κ∆n] ≤ E[(θ − θ)2]

κ2∆2n

≤ 1

κ2

where the last steps are by Chebyshev and (28.5), respectively. Thus, from Fano’s inequalityTheorem 5.3 we have

I(W ; W ) ≥(

1− 1

κ2

)log 2− h(κ−2) .

On the other hand, from data-processing and golden formula we have

I(W ; W ) ≤ I(θ;Xn) ≤ D(PXn|θ‖Bern(1/2)n|Pθ)

Computing the last divergence we get

D(PXn|θ‖Bern(1/2)n|Pθ) = nd(1/2− κ∆n‖1/2) = n(log 2− h(1/2− κ∆n))

As ∆n → 0 we haveh(1/2− κ∆n) = log 2− 2 log e · (κ∆n)2 + o(∆2

n) .

299

So altogether, we get that for every fixed κ we have(

1− 1

κ2

)log 2− h(κ−2) ≤ 2n log e · (κ∆n)2 + o(n∆2

n) .

In particular, by optimizing over κ we get that for some constant c ≈ 0.015 > 0 we have

∆2n ≥

c

n+ o(1/n) .

Together with (28.2), we have

0.015

n+ o(1/n) ≤ R∗(n) ≤ 1

4n,

and thus the empirical-mean estimator is rate-optimal.We mention that for this particular problem (estimating mean of Bernoulli samples) the minimax

risk is known exactly:

R∗(n) =1

4(1 +√n)2

(28.6)

but obtaining this requires different methods.1 In fact, even showing R∗(n) = 14n + o(1/n) requires

careful priors on θ (unlike the simple two-point prior we used above).2

We demonstrated here the essense of the Fano method of proving lower (impossibility) boundsin statistical decision theory. Namely, given an estimation task we select a prior, uniform on finitelymany θ’s, which on one hand yields a rather small information I(θ;X) and on the other hand hassufficiently separated points which thus should be distinguishable by a good estimator. For moresee [Yu97].

A natural (and very useful) generalization is to consider non-discrete prior Pθ, and use thefollowing natural chain of inequalities

f(Pθ, R) ≤ I(θ; θ) ≤ I(θ;Xn) ≤ supPθ

I(θ;Xn) ,

wheref(Pθ, R) , infI(θ; θ) : Pθ|θ s.t. E[`(θ, θ)] ≤ R

is the rate-distortion function. This method we discuss next.

1The easiest way to get this is to apply (28.1). . Fortunately, in this case if π is the β-distribution, computation ofconditional expectation can be performed in closed form, and optimizing parameters of the β-distribution one recoversa lower bound that together with (28.4) establishes (28.6). Note that the resulting worst-case π is not uniform, and infact β →∞ (i.e. π concentrates in a small region around θ = 1/2).

2It follows from the following Bayesian Cramer-Rao lower bound [GL95] : For any estimator θ and for any priorπ(θ)dθ with smooth density π we have

Eθ∼π[(θ(X)− θ)2] ≥ (log e)2

E[JF (θ)] + JF (π),

where JF (θ) is as in (28.3), JF (π) , (log e)2∫ (π′(θ))2

π(θ)dθ. Then taking π supported on a n−1/4-neighborhood

surrounding a given point θ0 we get that E[JF (θ)] = nθ0(1−θ0)

+ o(n) and JF (π) = o(n), yielding

R∗(n) ≥ θ0(1− θ0)

n+ o(1/n) .

This is a rather general phenomenon: Under regularity assumptions in any iid estimation problem θ → Xn → θ withquadratic loss we have

R∗(n) =1

infθ JF (θ)+ o(1/n) .

300

28.2 Mutual information method

The main workhorse will be

1. Data processing inequality

2. Rate-distortion theory

3. Capacity and mutual information bound

To illustrate the mutual information method and its execution in various problems, we will discussthree vignettes:

• Denoise a vector;

• Denoise a sparse vector;

• Community detection.

Here’s the main idea of the mutual information method. Fix some prior π and we turn to lowerbound R∗π. The unknown θ is distributed according to π. Let θ be a Bayes optimal estimator thatachieves the Bayes risk R∗π.

The mutual information method consists of applying the data processing inequality to theMarkov chain θ → X → θ:

infPθ|θ:E`(θ,θ)≤R∗π

I(θ; θ) ≤ I(θ, θ)dpi≤ I(θ;X). (28.7)

Note that

• The leftmost quantity can be interpreted as the minimum amount of information required foran estimation task, which is reminiscent of rate-distortion function.

• The rightmost quantity can be interpreted as the amount of information provided by thedata about the parameter. Sometimes it suffices to further upper-bound it by capacity of thechannel θ 7→ X:

I(θ;X) ≤ supπ∈M(Θ)

I(θ;X). (28.8)

• This chain of inequalities is reminiscent of how we prove the converse in joint-source channelcoding (Section 27.3), with the capacity-like upper bound and rate-distortion-like lower bound.

• Only the lower bound is related to the loss function.

• Sometimes we need a smart choice of the prior.

301

28.2.1 Denoising (Gaussian location model)

The setting is the following: given n noisy observations of a high-dimensional vector θ ∈ Rp,

Xii.i.d.∼ N(θ, Ip), i = 1, . . . , n (28.9)

The loss is simply the quadratic error: `(θ, θ) = ‖θ − θ‖22. Next we show that

R∗ =p

n, ∀p, n. (28.10)

Upper bound. Consider the estimator X = 1n

∑ni=1Xi. Then X ∼ N(θ, 1

nIp) and clearlyE‖X − θ‖22 = p/n.

Lower bound. Consider a Gaussian prior θ ∼ N (0, σ2Ip). Instead of evaluating the exact Bayesrisk (MMSE) which is a simple exercise, let’s implement the mutual information method (28.7).Given any estimator θ. Let D = E‖θ − θ‖22. Then

p

2log

σ2

D/p= inf

Pθ|θ:E‖θ−θ‖22≤DI(θ; θ) ≤ I(θ, θ) ≤ I(θ;X)

suff stat= I(θ; X) =

p

2log

(1 +

σ2

1/n

).

where the left inequality follows from the Gaussian rate-distortion function (27.3) and the single-letterization result (Theorem 26.1) that reduces p dimensions to one dimension. Putting everythingtogether we have

R∗ ≥ R∗π ≥pσ2

1 + nσ2.

Optimizing over σ2 (by sending it to ∞), we have R∗ ≥ p/n.

28.2.2 Denoising sparse vectors

Here the setting is identical to (28.9), expect that we have the prior knowledge that θ is sparse, i.e.,

θ ∈ Θ , all p-dim k-sparse vectors = θ ∈ Rp : ‖θ‖0 ≤ k

where ‖θ‖0 =∑

i∈[p] 1θi 6=0 is the sparisty (number of nonzeros) of θ.The minimax rate of denoising k-sparse vectors is given by the following

R∗ k

nlog

ep

k, ∀k, p, n. (28.11)

Before proceeding to the proof, a quick observation is that we have the oracle lower bound R∗ ≥ kn

follows from (28.10), since if the support of the θ is known which reduces the problem to k dimensions.Thus, the meaning of statement (28.11) is that the lack of knowledge of the support contributes(merely) a log factor.

To show this, again, by passing to sufficient statistics, it suffices to consider the observationX ∼ N(θ, 1

nIp). For simplicity we only consider n = 1 below.Upper bound. (Sketch) The rate is achieved by thresholding the observation X that only keep

the large entries. The intuition is that since the ground truth θ has many zeros, we should kill thesmall entries in X. Since ‖Z‖∞ ≤ (2 + ε)

√log p with high probability, hard thresholding estimator

that sets all entries of X with magnitude ≤ (2 + ε)√

log p achieves a mean-square error of O(k log p),which is rate optimal unless k = Ω(p), in which case we can simply apply the original X as theestimator.

302

Lower bound. In view of the oracle lower bound, it suffices to consider k = O(p). Next weassume k ≤ p/16. Consider a p-dimensional Hamming sphere of radius k, i.e.

B = b ∈ 0, 1p : wH(b) = k,

where wH(b) is the Hamming weights of b. Let b be drawn uniformly from the set B and θ = τb,

where τ =√

k100 log p

k . Thus, we have the following Markov chain which represents our problem

model,b→ θ → X → θ → b.

Note that the channel θ → X is just p uses of the AWGN channel, with power τ2kp , and thus

by Theorem 4.6 and single-letterization (Theorem 5.1) we have

I(θ; θ) ≤ I(θ;X) ≤ p

2log

(1 +

τ2k

p

)≤ sup

θ∈G

log e

2‖θ‖22 = ckτ2 ,

for some c > 0. We note that related techniques have been used in proving lower bound for stablerecovery in noiseless compressed sensing [PW12].

To give a lower bound for I(θ; θ), consider

b = argminb∈B

‖θ − τb‖22.

Since b is the minimizer of ‖θ − τb‖22, we have,

‖τ b− θ‖2 ≤ ‖τ b− θ‖2 + ‖θ − θ‖2 ≤ 2‖θ − θ‖2.

Thus,τ2dH(b, b) = ‖τ b− θ‖22 ≤ 4‖θ − θ‖22,

where dH denotes the Hamming distance between b and b. Suppose that E‖θ − θ‖22 = ετ2k. Thenwe have EdH(b, b) ≤ 4εk. Our goal is to show that ε is at least a small constant by the mutualinformation method. First,

I(b; b) ≥ minEdH(b,b)≤4εk

I(b; b).

Before we bound the RHS, let’s first guess its behavior. Note that it is the rate-distortion functionof the random vector b, which is uniform over B, the Hamming sphere of radius k, and each entry isBern(k/p). Had the entries been iid, then rate-distortion theory ((27.1) and Theorem 26.1) wouldyield that the RHS is simply p(h(k/p) − h(4εk/p)). Next, following the proof of (27.1), we showthat this behavior is indeed correct:

minEdH(b,b)≤4εk

I(b; b) = H(b)− maxEdH(b,b)≤4εk

H(b|b)

= log

(p

k

)− max

EdH(b,b)≤4εkH(b⊕ b|b)

≥ log

(p

k

)− max

EwH(W )≤4εkH(W ).

The maximum-entropy problem is easy to solve:

maxEwH(W )=m,W∈0,1p

H(W ) = ph

(m

p

). (28.12)

303

The solution isW = Bern(m/p)⊗p. One way to get this is to writeH(W ) = p log 2−D(PW ‖Bern(1/2)⊗p)and apply Theorem 14.3 with X = wH(W ), to get that optimal PW (w) ∼ expcwH(w). In theend we get Combine this with the previous bound, we get

I(b; b) ≥ log

(p

k

)− ph(

4εk

p).

On the other hand, we have

I(b; b) ≤ I(θ;Y ) ≤ cτ2 = c′k logp

k.

Note that h(α) −α logα for α < 14 . WLOG, since k ≤ p

16 , we have ε ≥ c0 for some universalconstant c0. Therefore

R∗ ≥ ετ2k & k logp

k.

Combining with the result in the oracle lower bound, we have the desired.

R∗ & k + k logp

k

or for general n ≥ 1

R∗ &k

nlog

ep

k.

Remark 28.1. Let R∗k,p = R∗. For the case k = o(p), the sharp asymptotics is

R∗k,p ≥ (2 + op(1))k logp

k.

To prove this result, we need to first show that for the case k = 1,

R∗1,p ≥ (2 + op(1)) log p.

Next, show that for any k, the minimax risk is lower bounded by the Bayesian risk with the blockprior. The block prior is that we divide the p-coordinate into k blocks, and pick one coordinatefrom each p/k-coordinate uniformly. With this prior, one can show

R∗k,p ≥ kR∗1,p/k = (2 + op(1))k logp

k.

28.2.3 Community detection

We only consider the problem of a single hidden community. Given a graph of n vertices, a communityis a subset of vertices where the edges tend to be denser than everywhere else. Specifically, weconsider the planted dense subgraph model (i.e., the stochastic block model with a single community).Let the community C be uniformly drawn from all subsets of [n] of cardinality k. The graph isgenerated by independently connecting each pair of vertices, with probability p if both belong to thecommunity C∗, and with probability q otherwise. Equivalently, in terms of the adjacency matrix A,Aij ∼ Bern(p) if i, j ∈ C and Bern(q) otherwise. Assume p > q. Thus the subgraph induced by C∗

is likely to be denser than the rest of the graph. We are interested in the large-graph asymptotics,where both the network size n and the community size k grow to infinity.

Given the adjacency matrix A, the goal is to recover the hidden community C almost perfectly,i.e., achieving

E[|C4C|] = o(k) (28.13)

304

Given the network size n and the community size k, there exists a sharp condition on the edgedensity (p, q) that says the community needs to be sufficient denser than the outside. It turnsout this is precisely described by the binary divergence d(p‖q). Under the assumption that p/q isbounded, e.g., p = 2q, the information-theoretic necessary condition is

k · d(p‖q)→∞ and lim infn→∞

kd(p‖q)log n

k

≥ 2. (28.14)

This condition is tight in the sense that if in the above “≥” is replaced by “>”, then there exists anestimator (e.g., the maximal likelihood estimator) that achieves (28.13).

Next we only prove the necessity of the second condition in (28.14), again using the mutualinformation method. Let ξ and ξ be the indicator vector of the community C and the estimatorC, respectively. Thus ξ = (ξ1, . . . , ξn) is uniformly drawn from the set x ∈ 0, 1n : wH(x) = k.Therefore ξi’s are individually Bern(k/n). Let E[dH(ξ, ξ)] = εnk, where εn → 0 by assumption.Consider the following chain of inequalities, which lower bounds the amount of information requiredfor a distortion level εn:

I(A; ξ)dpi≥ I(ξ; ξ) ≥ min

E[d(ξ,ξ)]≤εnkI(ξ; ξ) ≥ H(ξ)− max

E[d(ξ,ξ)]≤εnkH(ξ ⊕ ξ)

(28.12)= log

(n

k

)− nh

(εnk

n

)≥ k log

n

k(1 + o(1)),

where the last step follows from the bound(nk

)≥(nk

)k, the assumption k/n is bounded away from

one, and the bound h(p) ≤ −p log p+ p for p ∈ [0, 1].On the other hand, to bound the mutual information, we use the golden formula Corollary 3.1

and choose a simple reference Q:

I(A; ξ) = minQ

D(PA|ξ‖Q|Pξ)

≤ D(PA|ξ‖Bern(q)⊗(n2)|Pξ)

=

(k

2

)d(p‖q).

Combining the last two displays yields lim infn→∞(k−1)D(P‖Q)

log(n/k) ≥ 2.

305

§ 29. Multiple-access channel

29.1 Problem motivation and main results

Note: In network community, people are mostly interested in channel access control mechanismsthat help to detect or avoid data packet collisions so that the channel is shared among multipleusers.

The famous ALOHA protocal achieves∑

i

Ri ≈ 0.37C

where C is the (single-user) capacity of the channel.1

In information theory community, the goal is to achieve∑

i

Ri > C

The key to achieve this is to use coding so that collisions are resolvable.In the following discussion we shall focus on the case with two users. This is without loss of

much generality, as all the results can easily be extended to N users.

Definition 29.1.

• Multiple-access channel: PY n|An,Bn : An × Bn → Yn, n = 1, 2, . . . .

• a (n,M1,M2, ε) code is specified by

f1 : [M1]→ An, f2 : [M2]→ Bng : Yn → [M1]× [M2]

1Note that there is a lot of research on how to achieve just these 37%. Indeed, ALOHA in a nutshell simplypostulates that every time a user has a packet to transmit, he should attempt transmission in each time slot withprobability p, independently. The optimal setting of p is the inverse of the number of actively trying users. Thus, it isnon-trivial how to learn the dynamically changing number of active users without requiring a central authority. Thisis how ideas such as exponential backoff etc arise.

306

W1,W2 ∼ uniform, and the codes achieves

P[W1 6= W1⋃W2 6= W2] ≤ ε

• Fundamental limit of capacity region

R∗(n, ε) = (R1, R2) : ∃ a (n, 2nR1 , 2nR2 , ε)-code

• Asymptotics:

Cε = cl(

lim infn→∞

R∗(n, ε))

where cl denotes the closure of a set.

Note: lim inf and lim sup of a sequence of sets An:

lim infn

An = ω : ω ∈ An, ∀n ≥ n0lim sup

nAn = ω : ω infinitely occur

•C = lim Cε =

⋂

ε>0

Cε

Theorem 29.1 (Capacity region).

Cε =co⋃

PA,PB

Penta(PA, PB) (29.1)

=[ ⋃

PU,A,B=PUPA|UPB|U

Penta(PA|U , PB|U |PU )]

(29.2)

where co is the set operator of constructing the convex hull followed by taking the closure, andPenta(·, ·) is defined as follows:

Penta(PA, PB) =

(R1, R2) :

0 ≤ R1 ≤ I(A;Y |B)0 ≤ R2 ≤ I(B;Y |A)R1 +R2 ≤ I(A,B;Y )

Penta(PA|U , PB|U |PU ) =

(R1, R2) :

0 ≤ R1 ≤ I(A;Y |B,U)0 ≤ R2 ≤ I(B;Y |A,U)R1 +R2 ≤ I(A,B;Y |U)

307

Note: the two forms in (29.1) and (29.2) are equivalent without cost constraints. In the case whenconstraints such as Ec1(A) ≤ P1 and Ec2(B) ≤ P2 are present, only the second expression yieldsthe true capacity region.

29.2 MAC achievability bound

First, we introduce a lemma which will be used in the proof of Theorem 29.1.

Lemma 29.1. ∀PA, PB, PY |A,B such that PA,B,Y = PAPBPY |A,B, and ∀γ1, γ2, γ12 > 0, ∀M1,M2,there exists a (M1,M2, ε) MAC code such that:

ε ≤P[i12(A,B;Y ) ≤ log γ12

⋃i1(A;Y |B) ≤ log γ1

⋃i2(B;Y |A) ≤ log γ2

]

+ (M1 − 1)(M2 − 1)e−γ12 + (M1 − 1)e−γ1 + (M2 − 1)e−γ2 (29.3)

Proof. We again use the idea of random coding.Generate the codebooks

c1, . . . , cM1 ∈ A, d1, . . . , dM2 ∈ B

where the codes are drawn i.i.d from distributions: c1, . . . , cM1 ∼ i.i.d. PA, d1, . . . , dM2 ∼ i.i.d. PB .The decoder operates in the following way: report (m,m′) if it is the unique pair that satisfies:

(P12) i12(cm, dm′ ; y) > log γ12

(P1) i1(cm; y|dm′) > log γ1

(P2) i2(dm′ ; y|cm) > log γ2

Evaluate the expected error probability:

EPe(cM11 , dM2

1 ) = P[(W1,W2) violate (P12) or (P1) or (P2)⋃∃ impostor (W ′1,W

′2) that satisfy (P12) and (P1) and (P2)

]

by symmetry of random codes, we have

Pe = E[Pe|W1 = m,W2 = m′] = P[(m,m′) violate (P12) or (P1) or (P2)⋃∃ impostor (i 6= m, j 6= m′) that satisfy (P12) and (P1) and (P2)

]

308

⇒ Pe ≤ P[i12(A,B;Y ) ≤ log γ12

⋃i1(A;Y |B) ≤ log γ1

⋃i2(B;Y |A) ≤ log γ2

]+ P[E12] + P[E1] + P[E2]

where

P[E12] = P[∃(i 6= m, j 6= m′) s.t. i12(cm, dm′ ; y) > log γ12]≤ (M1 − 1)(M2 − 1)P[i12(A,B;Y ) > log γ12]

= E[e−i12(A,B;Y )1i12(A,B;Y ) > log γ12]≤ e−γ12

P[E2] = P[∃(j 6= m′) s.t. i2(dj ; y|ci) > log γ2]≤ (M2 − 1)P[i2(B;Y |A) > log γ2]

= EA[e−i2(B;Y |A)1i2(B;Y |A) > log γ2|A]

≤ EA[e−γ2 |A] = e−γ2

similarly P[E1] ≤ e−γ1

Note: [Intuition] Consider the decoding step when a random codebook is used. We observe Yand need to solve an M -ary hypothesis testing problem: Which of PY |A=cm,B=dm′m,m′∈[M1]×[M2]

produced the sample Y ?Recall that in P2P channel coding, we had a similar problem and the M-ary hypothesis testing

problem was converted to M binary testing problems:

PY |X=cj vs PY−j ,∑

i 6=j

1

M − 1PY |X=ci ≈ PY

i.e. distinguish cj (hypothesis H0) against the average distribution induced by all other codewords(hypothesis H1), which for a random coding ensemble cj ∼ PX is very well approximated byPY = PY |X PX . The optimal test for this problem is roughly

PY |X=c

PY& log(M − 1) =⇒ decide PY |X=cj (29.4)

since the prior for H0 is 1M , while the prior for H1 is M−1

M .The proof above followed the same idea except that this time because of the two-dimensional

grid structure:

309

there are in fact binary HT of three kinds

(P12) ∼ test PY |A=cm,B=dm′ vs1

(M1 − 1)(M2 − 1)

∑

i 6=m

∑

j 6=m′PY |A=ci,B=dj ≈ PY


M1 − 1

∑

i 6=mPY |A=ci,B=dm′ ≈ PY |B=dm′


M2 − 1

∑

j 6=m′PY |A=cm,B=dj ≈ PY |A=cm

And analogously to (29.4) the optimal tests are given by comparing the respective informationdensities with logM1M2, logM1 and logM2.

Another observation following from the proof is that the following decoder would also achieveexactly the same performance:

• Step 1: rule out all cells (i, j) with i12(ci, dj ;Y ) . logM1M2.

• Step 2: If the messages remaining are NOT all in one row or one column, then FAIL.

• Step 3a: If the messages remaining are all in one column m′ then declare W2 = m′. Rule outall entries in that column with i1(ci;Y |dm′) . logM1. If more than one entry remains, FAIL.Otherwise declare the unique remaining entry m as W1 = m.

• Step 3b: Similarly with column replaced by row, i1 with i2 and logM1 with logM2.

The importance of this observation is that in the regime when RHS of (29.3) is small, thedecoder always finds it possible to basically decode one message, “subtract” its influence and thendecode the other message. Which of the possibilities 3a/3b appears more often depends on theoperating point (R1, R2) inside C.

29.3 MAC capacity region proof

Proof. 1. Show C is convex.

Take (R1, R2) ∈ Cε/2, and take (R′1, R′2) ∈ Cε/2.

We merge the (n, 2nR1 , 2nR2 , ε/2) code and the (n, 2nR1 , 2nR2 , ε/2) code in the following time-sharing way: in the first n channels, use the first set of codes, and in the last n channels, usethe second set of codes.Thus we formed a new (2n, 2R1+R′1 , 2R2+R′2 , ε) code, we know that

1

2Cε/2 +

1

2Cε/2 ⊂ Cε

take limit at both sides1

2C +

1

2C ⊂ C

also we know that C ⊂ 12C + 1

2C, therefore C = 12C + 1

2C is convex.

Note: the set addition is defined in the following way:

A+ B , (a+ b) : a ∈ A, b ∈ B

310

2. Achievability

STP: ∀PA, PB, ∀(R1, R2) ∈ Penta(PA, PB), ∃(n, 2nR1 , 2nR2 , ε)code.

Apply Lemma 29.1 with:

PA → PnA, PB → PnB, PY |A,B → PnY |A,B

M1 = 2nR1 , M2 = 2nR2 ,

log γ12 = n(I(A,B;Y )− δ), log γ1 = n(I(A;Y |B)− δ), log γ2 = n(I(B;Y |A)− δ).

we have that there exists a (M1,M2, ε) code with

ε ≤P[ 1

n

n∑

k=1

i12(Ak, Bk;Yk) ≤ log γ12 − δ⋃ 1

n

n∑

k=1

i1(Ak;Yk|Bk) ≤ log γ1 − δ

⋃ 1

n

n∑

k=1

i2(Bk;Yk|Ak) ≤ log γ2 − δ]

︸︷︷︸1©

+ (2nR1 − 1)(2nR2 − 1)e−γ12 + (2nR1 − 1)e−γ1 + (2nR2 − 1)e−γ2

︸︷︷︸2©

by WLLN, the first part goes to zero, and for any (R1, R2) such that R1 < I(A;Y |B) − δand R2 < I(B;Y |A)− δ and R1 +R2 < I(A,B;Y )− δ, the second part goes to zero as well.Therefore, if (R1, R2) ∈ interior of the pentagon, there exists a (M1,M2, ε = o(1)) code.

3. Weak converse

Q[W1 = W1,W2 = W2] =1

M1M2, P[W1 = W1,W2 = W2] ≥ 1− ε

d-proc:

d(1− ε‖ 1

M1M2) ≤ inf

Q∈(∗)D(P‖Q) = I(An, Bn;Y n)

⇒R1 +R2 ≤1

nI(An, Bn;Y n) + o(1)

311

To get separate bounds, we apply the same trick to evaluate the information flow from thelink between A→ Y and B → Y separately:

Q1[W2 = W2] =1

M2, P[W2 = W2] ≥ 1− ε

d-proc:

d(1− ε‖ 1

M2) ≤ inf

Q1∈(∗1)D(P‖Q1) = I(Bn;Y n|An)

⇒R2 ≤1

nI(Bn;Y n|An) + o(1)

similarly we can show that

R2 ≤1

nI(An;Y n|Bn) + o(1)

For memoryless channels, we know that 1nI(An, Bn;Y n) ≤ 1

n

∑k I(Ak, Bk;Yk). Similarly,

since given Bn the channel An → Y n is still memoryless we have

I(An;Y n|Bn) ≤n∑

k=1

I(Ak;Yk|Bn) =n∑

k=1

I(Ak;Yk|Bk)

Notice that each (Ai, Bi) pair corresponds to (PAk , PBk), ∀k define

Pentak(PAk , PBk) =

(R1,k, R2,k) :

0 ≤ R1,k ≤ I(Ak;Yk|Bk)0 ≤ R2,k ≤ I(Bk;Yk|Ak)

R1,k +R2,k ≤ I(Ak, Bk;Yk)

therefore

(R1, R2) ∈[ 1

n

∑

k

Pentak

]

⇒C ∈ co⋃

PA,PB

Penta

312

§ 30. Examples of MACs. Maximal Pe and zero-error capacity.

30.1 Recap

Last time we defined the multiple access channel as the sequence of random transformations

PY n|AnBn : An × Bn → Yn, n = 1, 2, . . .

Furthermore, we showed that its capacity region is

C = (R1, R2) : ∃(n, 2nR1 , 2nR2 , ε)−MAC code = co⋃

PAPB

Penta(PA, PB)

were co denotes the convex hull of the sets Penta, and Penta is

Penta(PA, PB) =

R1 ≤ I(A;Y |B)

R2 ≤ I(B;Y |A)

R1 +R2 ≤ I(A,B;Y )

So a general MAC and one Penta region looks like

An

Bn

PY n|AnBn Y n

R1

R2

I(A;Y |B)

I(B;Y |A)I(A,B;Y )

Note that the union of Pentas need not look like a Penta region itself, as we will see in a laterexample.

30.2 Orthogonal MAC

The trivial MAC is when each input sees its own independent channel: PY |AB = PY |APY |B wherethe receiver sees (YA, YB). In this situation, we expect that each transmitter can achieve it’s owncapacity, and no more than that. Indeed, our theorem above shows exactly this:

Penta(PA, PB) =

R1 ≤ I(A;Y |B) = I(A;Y )

R2 ≤ I(B;Y |A) = I(B;Y )

R1 +R2 ≤ I(A,B;Y )

Where in this case the last constraint is not applicable; it does not restrict the capacity region.

313

A

B

YA

R1

R2

C2PY |A

PY |BC1

YB

Hence our capacity region is a rectangle bounded by the individual capacities of each channel.

30.3 BSC MAC

Before introducing this channel, we need a definition and a theorem:

Definition 30.1 (Sum Capacity). Csum , maxR1 +R2 : (R1, R2) ∈ CTheorem 30.1. Csum = maxA⊥⊥B I(A,B;Y )

Proof. Since the max above is achieved by an extreme point on one of the Penta regions, we candrop the convex closure operation to get

maxR1 +R2 : (R1, R2) ∈ co⋃

Penta(PA, PB) = maxR1 +R2 : (R1, R2) ∈⋃

Penta(PA, PB)maxPA,PB

R1 +R2 : (R1, R2) ∈ Penta(PA, PB) ≤ maxPA,PB

I(A,B;Y )

Where the last step follows from the definition of Penta. Now we need to show that the constrainton R1 + R2 in Penta is active at at least one point, so we need to show that I(A,B;Y ) ≤I(A;Y |B) + I(B;Y |A) when A ⊥⊥ B, which follows from applying Kolmogorov identities

I(A;Y,B) = 0 + I(A;Y |B) = I(A;Y ) + I(A;B|Y ) =⇒ I(A;Y ) ≤ I(A;Y |B)

=⇒ I(A,B;Y ) = I(A;Y ) + I(B;Y |A) ≤ I(A;Y |B) + I(B;Y |A)

Hence maxPA,PBR1 +R2 : (R1, R2) ∈ Penta(PA, PB) = maxPAPB I(A,B;Y )

We now look at the BSC MAC, defined by

Y = A+B + Z mod 2

Z ∼ Ber(δ)

A,B ∈ 0, 1

Since the output Y can only be 0 or 1, the capacity of this channel can be no larger than 1 bit. IfB doesn’t transmit at all, then A can achieve capacity 1− h(δ) (and B can achieve capacity whenA doesn’t transmit), so that R1, R2 ≤ 1− h(δ). By time sharing we can obtain any point betweenthese two. This gives an inner bound on the capacity region. For an outer bound, we use Theorem30.1, which gives

Csum = maxPAPB

I(A,B;Y ) = maxPAPB

I(A,B;A+B + Z)

= maxPAPB

H(A+B + Z)−H(Z) = 1− h(δ)

Hence R1 +R2 ≤ 1− h(δ), so by this outer bound, we can do no better than time sharing betweenthe two individual channel capacity points.

314

A

BR1

R2

1− h(δ)

1− h(δ)

Y

Remark: Even though this channel seems so simple, there are still hidden things about it, whichwe’ll see later.

30.4 Adder MAC

Now we analyze the Adder MAC, which is a noiseless channel defined by:

Y = A+B (over Z)

A,B ∈ 0, 1Intuitively, the game here is that when both A and B send either 0 or 1, we receiver 0 or 2 and candecode perfectly. However, when A sends 0 and B send 1, the situation is ambiguous. To analyzethis channel, we start with an interesting fact

Interesting Fact 1: Any deterministic MAC (Y = f(A,B)) has Csum = maxH(Y ). To seethis, just expand I(A,B;Y ).

Therefore, the sum capacity of this MAC is

Csum = maxA⊥⊥B

H(A+B) = H

(1

4,1

2,1

4

)=

3

2bits

Which is achieved when both A and B are Ber(1/2). With this, our capacity region is

Penta(Ber(1/2),Ber(1/2)) =

R1 ≤ I(A;Y |B) = H(A) = 1

R2 ≤ I(B;Y |A) = H(B) = 1

R1 +R2 ≤ I(A,B;Y ) = 3/2

So the channel can be described by

A

BR1

R2

1

1

YR1 +R2 ≤ 3/2

Now we can ask: how do we achieve the corner points of the region, e.g. R1 = 1/2 and R2 = 1?The answer gives insights into how to code for this channel. Take the greedy codebook B = 0, 1n(the entire space), then the channel A→ Y is a DMC:

0

1

0

1

2

12

12

12

12

315

Which we recognize as a BEC(1/2) (no preference to either −1 or 1), which has capacity 1/2. Howdo we decode? The idea is successive cancellation, where first we decode A, then remove A from Y ,then decode B.

An

Bn

Y n DecBEC(1/2)

An

Bn

Using this strategy, we can use a single user code for the BEC (an object we understand well) toattain capacity.

30.5 Multiplier MAC

The Multiplier MAC is defined as

Y = AB

A ∈ 0, 1, B ∈ −1, 1

Note that A = |Y | can always be resolved, and B can be resolved whenever A = 1. To find thecapacity region of this channel, we’ll use another interesting fact:

Interesting Fact 2: If A = g(Y ), then each Penta(PA, PB) is a rectangle with

Penta(PA, PB) =

R1 ≤ H(A)

R2 ≤ I(A,B;Y )−H(A)

Proof. Using the assumption that A = g(Y ) and expanding the mutual information

I(A;Y |B) + I(B;Y |A) = H(A)−H(Y |A)−H(Y |A,B) = H(A, Y )−H(Y |A,B)

= H(Y )−H(Y |A,B) = I(A,B;Y )

Therefore the R1 +R2 constraint is not active, so our region is a rectangle.

By symmetry, we take PB = Ber(1/2). When PA = Ber(p), the output has H(Y ) = p+ h(p).Using the above fact, the capacity region for the Multiplier MAC is

C = co⋃

R1 ≤ H(A) = h(p)

R2 ≤ H(Y )−H(A) = p

We can view this as the graph of the binary entropy function on its side, parametrized by p:

1/2

1

1

R1

R2

To achieve the extreme point (1, 1/2) of this region, we can use the same scheme as for the AdderMAC: take the codebook of A to be 0, 1n, then B sees a BEC(1/2). Again, successive cancellationdecoding can be used.

For future reference we note:

316

Lemma 30.1. The full capacity region of multiplier MAC is achieved with zero error.

Proof. For a given codebook D of user B the number of messages that user A can send equals thetotal number of erasure patters that codebook D can tolerate with vanishing probability of error.Fix rate R2 < 1 and let D be a row-span of a random linear nR2×n binary matrix. Then randomlyerase each column with probability 1−R2 − ε. Since on average there will be n(R2 + ε) columnsleft, the resulting matrix is still full-rank and the decoding is possible. In other words,

P[D is decodable,# of erasures ≈ n(1−R2 − ε)]→ 1 .

Hence, by counting the total number of erasures, for a random linear code we have

E[# of decodable erasure patterns for D] ≈ 2nh(1−R2−ε)+o(n) .

And result follows by selecting a random element of the D-ensemble and then taking the codebookof user A to be the set of decodable erasure patterns for a selected D.

30.6 Contraction MAC

The Contraction MAC is defined as

0, 1, 2, 3 ∋ A

−,+ ∋ B

Erasure Yb

b

b

b b

b

bb

b

b

bb

b

b0123

0123

e1

23

1

2

e2

B = −B = +

Here, B is received perfectly, We can use the fact above to see that the capacity region is

C =

R1 ≤ 3

2

R2 ≤ 1

For future reference we note the following:

Lemma 30.2. The zero-error capacity of the contraction MAC satisfies

R1 ≤ h(1/3) + (2/3− p) log 2 , (30.1)

R2 ≤ h(p) (30.2)

for some p ∈ [0, 1/2]. In particular, the point R1 = 32 , R2 = 1 is not achievable with zero error.

Proof. Let C and D denote the zero-error codebooks of two users. Then for each string bn ∈ +,−ndenote

Ubn = an : aj ∈ 0, 1 if bj = +, aj ∈ 2, 3 if bj = − .Then clearly for each bn we have

|Ubn | ≤ 2d(bn,D) ,

where d(bn, D) denotes the minimum Hamming distance from string bn to the set D. Then,

|C| ≤∑

bn

2d(bn,D) (30.3)

=n∑

j=0

2j |bn : d(bn, D) = j| (30.4)

317

For a given cardinality |D| the set that maximizes the above sum is the Hamming ball. Hence,R2 = h(p) +O(1) implies

R2 ≤ maxq∈[p,1]

h(q) + (q − p) log 2 = h(1/3) + (2/3− p) log 2 .

30.7 Gaussian MAC

Perhaps the most important MAC is the Gaussian MAC. This is defined as

Y = A+B + Z

Z ∼ N (0, 1)

E[A2] ≤ P1, E[B2] ≤ P2

Evaluating the mutual information, we see that the capacity region is

I(A;Y |B) = I(A;A+ Z) ≤ 1

2log(1 + P1)

I(B;Y |A) = I(B;B + Z) ≤ 1

2≤ (1 + P2)

I(A,B;Y ) = h(Y )− h(Z) ≤ 1

2log(1 + P1 + P2)

A

B

Y

Z

R1

R2

12 log(1 + P1)

12 log(1 + P2)

12 log(1 + P1 + P2)

Where the region is Penta(N (0, P1),N (0, P2)). How do we achieve the rates in this region? We’lllook at a few schemes.

1. TDMA: A and B switch off between transmitting at full rate and not transmitting at all. Thisachieves any rate pair in the form

R1 = λ1

2log(1 + P1), R2 = λ

1

2log(1 + P2)

Which is the dotted line on the plot above. Clearly, there are much better rates to be gainedby smarter schemes.

2. FDMA (OFDM): Dividing users into different frequency bands rather than time windowsgives an enormous advantage. Using frequency division, we can attain rates

R1 = λ1

2log

(1 +

P1

λ

), R2 = λ

1

2log

(1 +

P2

λ

)

318

In fact, these rates touch the boundary of the capacity region at its intersection with theR1 = R2 line. The optimal rate occurs when the power at each transmitter makes the noiselook white:

P1

λ=P2

λ=⇒ λ∗ =

P1

P1 + P2

While this touches the capacity region at one point, it doesn’t quite reach the corner points.Note, however, that practical systems (e.g. cellular networks) typically employ power controlthat ensures received powers Pi of all users are roughly equal. In this case (i.e. when P1 = P2)the point where FDMA touches the capacity boundary is at a very desirable location ofsymmetric rate R1 = R2. This is one of the reasons why modern standards (e.g. LTE 4G) donot employ any specialized MAC-codes and use OFDM together with good single-user codes.

3. Rate Splitting/Successive Cancellation: To reach the corner points, we can use successivecancellation, similar to the decoding schemes in the Adder and Multiplier MACs. We can userates:

R2 =1

2log(1 + P2)

R1 =1

2(log(1 + P1 + P2)− log(1 + P2)) =

1

2log

(1 +

P1

1 + P2

)

The second expression suggests that A transmits at a rate for an AWGN channel that haspower constraint P1 and noise 1 + P2, i.e. the power used by B looks like noise to A.

A

B

Y

Z

Dec A A

BDec B

Theorem 30.2. There exists a successive-cancellation code (i.e. (E1, E2, D1, D2)) that achievesthe corner points of the Gaussian MAC capacity region.

Proof. Random coding: Bn ∼ N (0, P2)n. Since An now sees noise 1 + P2, there exists a codefor A with rate R1 = 1

2 log(1 + P1/(1 + P2)).

This scheme (unlike the above two) can tolerate frame un-synchronization between the twotransmitters. This is because any chunk of length n has distribution N (0, P2)n. It hasgeneralizations to non-corner points and to arbitrary number of users. See [RU96] for details.

30.8 MAC Peculiarities

Now that we’ve seen some nice properties and examples of MACs, we’ll look at cases where MACsdiffer from the point to point channels we’ve seen so far.

1. Max probability of error 6= average probability of error.

Theorem 30.3. C(max) 6= C

319

Proof. The key observation for deterministic MAC is that C(max) = C0 (zero error capacity)when ε ≤ 1/2. This is because when any two strings can be confused, the maximum probabilityof error

maxm,m′

P[W1 6= m ∪ W2 6= m′|W1 = m,W2 = m′]

Must be larger than 1/2.

For some of the channels we’ve seen

• Contraction MAC: C0 6= C

• Multiplier MAC: C0 = C

• Adder MAC: C0 6= C. For this channel, no one yet can show that C0,sum < 3/2. Theidea is combinatorial in nature: produce two sets (Sidon sets) such that all pairwise sumsbetween the two do not overlap.

2. Separation does not hold: In the point to point channel, through joint source channel codingwe saw that an optimal architecture is to do source coding then channel coding separately.This doesn’t hold for the MAC. Take as a simple example the Adder MAC with a correlatedsource and bandwidth expansion factor ρ = 1. Let the source (S, T ) have joint distribution

PST =

[1/3 1/30 1/3

]

We encode Sn to channel input An and Tn to channel input Bn. The simplest possible schemeis to not encoder at all; simply take Sj = Aj and Tj = Bj . Take the decoder

S T

Yj = 0 =⇒ 0 0

Yj = 1 =⇒ 0 1

Yj = 2 =⇒ 1 1

Which gives P[Sn = Sn, Tn = Tn] = 1, since we are able to take advantage of the zero entryin joint distribution of our correlated source.

Can we achieve this with a separated source? Amazingly, even though the above scheme is sosimple, we can’t! The compressors in the separated architecture operate in the Slepian-Wolfregion (see Theorem 9.7)

R1 ≥ H(S|T )

R2 ≥ H(T |S)

R1 +R2 ≥ H(S, T ) = log 3

Hence the sum rate for compression must be ≥ log 3, while the sum rate for the Adder MACmust be ≤ 3/2, so these two regions do not overlap, hence we can not operate at a bandwidthexpansion factor of 1 for this source and channel.

320

Slepian Wolf

1

1

1/2

1/2

3/2

log 3

R1

R2

Adder MAC

3. Linear codes beat generic ones: Consider a BSC-MAC and suppose that two users A andB have independet k-bit messages W1,W2 ∈ Fk2. Suppose the receiver is only interested inestimating W1 +W2. What is the largest ratio k/n? Clearly, separation can achieve

k/n ≈ 1

2(log 2− h(δ))

by simply creating a scheme in which both W1 and W2 are estimated and then their sum iscomputed.

A more clever solution is however to encode

An = G ·W1 ,

Bn = G ·W2 ,

Y n = An +Bn + Zn = G(W1 +W2) + Zn .

where G is a generating matrix of a good k-to-n linear code. Then, provided that

k < n(log 2− h(δ)) + o(n)

the sum W1 +W2 is decodable (see Theorem 18.2). Hence even for a simple BSC-MAC thereexist clever ways to exceed MAC capacity for certain scenarios. Note that this “distributedcomputation” can also be viewed as lossy source coding with a distortion metric that is onlysensitive to discrepancy between W1 +W2 and W1 + W2.

4. Dispersion is unknown: We have seen that for the point-to-point channel, not only we knowthe capacity, but the next-order terms (see Theorem 22.2). For the MAC-channel only thecapacity is known. In fact, let us define

R∗sum(n, ε) , supR1 +R2 : (R1, R2) ∈ R∗(n, ε) .

Now, take Adder-MAC as an example. A simple exercise in random-coding with PA = PB =Ber(1/2) shows

R∗sum(n, ε) ≥ 3

2log 2−

√1

4nQ−1(ε) log 2 +O(

log n

n) .

In the converse direction the situation is rather sad. In fact the best bound we have is onlyslightly better than the Fano’s inequality [Ahl82]. Namely for each ε > 0 there is a constantKε > 0 such that

R∗sum(n, ε) ≤ 3

2log 2 +Kε

log n√n.

321

So it is not even known if sum-rate approaches sum-capacity from above or from below asn→∞! What is even more surprising, is that the dependence of the residual term on ε is notclear at all. In fact, despite the decades of attempts, even for ε = 0 the best known bound todate is just the Fano’s inequality(!)

R∗sum(n, 0) ≤ 3

2.

322

§ 31. Random number generators

Let’s play the following game: Given a stream of Bern(p) bits, with unknown p, we want to turnthem into pure random bits, i.e., independent fair coin flips Bern(1/2). Our goal is to find a universalway to extract the most number of bits.

In 1951 von Neumann [vN51] proposed the following scheme: Divide the stream into pairs ofbits, output 0 if 10, output 1 if 01, otherwise do nothing and move to the next pair. Since both 01and 10 occur with probability pq (where q = 1− p throughout this lecture), regardless of the valueof p, we obtain fair coin flips at the output. To measure the efficiency of von Neumann’s scheme,note that, on average, we have 2n bits in and 2pqn bits out. So the efficiency (rate) is pq. Thequestion is: Can we do better?

Several variations:

1. Universal v.s. non-universal: know the source distribution or not.

2. Exact v.s. approximately fair coin flips: in terms of total variation or Kullback-Leiblerdivergence

We only focus on the universal generation of exactly fair coins.

31.1 Setup

Recall from Lecture 8 that 0, 1∗ = ∪k≥00, 1k = ∅, 0, 1, 00, 01, . . . denotes the set of all finite-length binary strings, where ∅ denotes the empty string. For any x ∈ 0, 1∗, let l(x) denote thelength of x.

Let’s first introduce the definition of random number generator formally. If the input vector isX, denote the output (variable-length) vector by Y ∈ 0, 1∗. Then the desired property of Y isthe following: Conditioned on the length of Y being k, Y is uniformly distributed on 0, 1k.Definition 31.1 (Extractor). We say Ψ : 0, 1∗ → 0, 1∗ is an extractor if

1. Ψ(x) is a prefix of Ψ(y) if x is a prefix of y.

2. For any n and any p ∈ (0, 1), if Xni.i.d.∼ Bern(p), then Ψ(Xn) ∼ Bern(1/2)k conditioned onl(Ψ(Xn)) = k.

The rate of Ψ is

rΨ(p) = lim supn→∞

E[l(Ψ(Xn))]

n, Xni.i.d.∼ Bern(p).

Note that the von Neumann scheme above defines a valid extractor ΨvN (with ΨvN(x2n+1) =ΨvN(x2n)), whose rate is rvN(p) = pq. Clearly this is wasteful, because even if the input bits arealready fair, we only get 25% in return.

323

31.2 Converse

No extractor has a rate higher than the binary entropy function. The proof is simply data processinginequality for entropy and the converse holds even if the extractor is allowed to be non-universal(depending on p).

Theorem 31.1. For any extractor Ψ and any p ∈ (0, 1),

rΨ(p) ≥ h(p) = p log2

1

p+ q log2

1

q.

Proof. Let L = Ψ(Xn). Then

nh(p) = H(Xn) ≥ H(Ψ(Xn)) = H(Ψ(Xn)|L) +H(L) ≥ H(Ψ(Xn)|L) = E [L] bits,

where the step follows from the assumption on Ψ that Ψ(Xn) is uniform over 0, 1k conditionedon L = k.

The rate of von Neumann extractor and the entropy bound are plotted below. Next we presenttwo extractors, due to Elias [Eli72] and Peres [Per92] respectively, that attain the binary entropyfunction. (More precisely, both construct a sequence of extractors whose rate approaches the entropybound).

0 12

1p

1 bit

rate

rvN

31.3 Elias’ construction of RNG from lossless compressors

Main idea The intuition behind Elias’ scheme is the following:

1. For iid Xn, the probability of each string only depends on its type, i.e., the number of 1’s.Therefore conditioned on the number of 1’s, Xn is uniformly distributed (over the type class).This observation holds universally for any p.

2. Given a uniformly distributed random variable on some finite set, we can easily turn it intovariable-length fair coin flips. For example:

• If U is uniform over 1, 2, 3, we can map 1 7→ ∅, 2 7→ 0 and 3 7→ 1.

• If U is uniform over 1, 2, . . . , 11, we can map 1 7→ ∅, 2 7→ 0, 3 7→ 1, and 4, . . . , 11 7→3-bit strings.

Lemma 31.1. Given U uniformly distributed on [M ], there exists f : [M ] → 0, 1∗ such thatconditioned on l(f(U)) = k, f(U) is uniformly over 0, 1k. Moreover,

log2M − 4 ≤ E[l(f(U))] ≤ log2M bits.

324

Proof. We defined f by partitioning [M ] into subsets whose cardinalities are powers of two, andassign elements in each subset to binary strings of that length. Formally, denote the binary expansionof M by M =

∑ni=0mi2

i, where the most significant bit mn = 1 and n = blog2Mc + 1. Thosenon-zero mi’s defines a partition [M ] = ∪tj=0Mj , where |Mi| = 2ij . Map the elements of Mj to

0, 1ij . Finally, notice that uniform distribution conditioned on any subset is still uniform.To prove the bound on the expected length, the upper bound follows from the same entropy

argument log2M = H(U) ≥ H(f(U)) ≥ H(f(U)|l(f(U))) = E[l(f(U))], and the lower boundfollows from

E[l(f(U))] =1

M

n∑

i=0

mi2i · i = n− 1

M

n∑

i=0

mi2i(n− i) ≥ n− 2n

M

n∑

i=0

2i−n(n− i) ≥ n− 2n+1

M≥ n− 4,

where the last step follows from n ≤ log2M + 1.

Elias’ extractor Let w(xn) define the Hamming weight (number of ones) of a binary string. LetTk = xn ∈ 0, 1n : w(xn) = k define the Hamming sphere of radius k. For each 0 ≤ k ≤ n, weapply the function f from Lemma 31.1 to each Tk. This defines a mapping ΨE : 0, 1∗ → 0, 1∗and then we extend it to ΨE : 0, 1n → 0, 1∗ by applying the mapping per n-bit block anddiscard the last incomplete block. Then it is clear that the rate is given by 1

nE[l(ΨE(Xn))]. ByLemma 31.1, we have

E log

(n

w(Xn)

)− 4 ≤ E[l(ΨE(Xn))] ≤ E log

(n

w(Xn)

)

Using Stirling’s approximation (see, e.g., [Ash65, Lemma 4.7.1]), we have

2nh(p)

√8npq

≤(n

k

)≤ 2nh(p)

√2πnpq

(31.1)

where p = 1− q = k/n ∈ (0, 1). Since w(Xn) ∼ Bin(n, p), we have

E[l(ΨE(Xn))] = nh(p) +O(log n).

Therefore the extraction rate approaches the optimum h(p) as n→∞.

31.4 Peres’ iterated von Neumann’s scheme

Main idea Recycle the bits thrown away in von Neumann’s scheme and iterate. What did vonNeumann’s extractor discard: (a) bits from equal pairs; (b) location of the distinct pairs. To achievethe entropy bound, we need to extract the randomness out of these two parts as well.

First some notations: Given x2n, let k = l(ΨvN(x2n)) denote the number of consecutive distinctbit-pairs.

• Let 1 ≤ m1 < . . . < mk ≤ n denote the locations such that x2mj 6= x2mj−1.

• Let 1 ≤ i1 < . . . < in−k ≤ n denote the locations such that x2ij = x2ij−1.

• yj = x2mj , vj = x2ij , uj = x2j ⊕ x2j+1.

Here yk are the bits that von Neumann’s scheme outputs and both vn−k and un are discarded. Notethat un is important because it encodes the location of the yk and contains a lot of information.Therefore von Neumann’s scheme can be improved if we can extract the randomness out of bothvn−k and un.

325

Peres’ extractor For each t ∈ N, recursively define an extractor Ψt as follows:

• Set Ψ1 to be von Neumann’s extractor ΨvN, i.e., Ψ1(x2n+1) = Ψ1(x2n) = yk.

• Define Ψt by Ψt(x2n) = Ψt(x

2n+1) = (Ψ1(x2n),Ψt−1(un),Ψt−1(vn−k)).

Example: Input x = 100111010011 of length 2n = 12. Output recursively:

y︷︸︸︷(011)

u︷︸︸︷(110100)

v︷︸︸︷(101)

(1)(010)(10)(0)

(1)(0)

Next we (a) verify Ψt is a valid extractor; (b) evaluate its efficiency (rate). Note that the bitsthat enter into the iteration are no longer i.i.d. To compute the rate of Ψt, it is convenient tointroduce the notion of exchangeability. We say Xn are exchangeable if the joint distribution isinvariant under permutation, that is, PX1,...,Xn = PXπ(1),...,Xπ(n)

for any permutation π on [n]. Inparticular, if Xi’s are binary, then Xn are exchangeable if and only if the joint distribution onlydepends on the Hamming weight, i.e., PXn=xn = p(w(xn)). Examples: Xn is iid Bern(p); Xn isuniform over the Hamming sphere Tk.

As an example, ifX2n are i.i.d. Bern(p), then conditioned on L = k, V n−k is iid Bern(p2/(p2+q2)),since L ∼ Binom(n, 2pq) and

P[Y k = y, Un = u, V n−k = v|L = k] =pk+2mqn−k−2m

(nk

)(p2 + q2)n−k(2pq)k

= 2−k ·(n

k

)−1

·( p2

p2 + q2

)m( q2

p2 + q2

)n−k−m

= P[Y k = y|L = k]P[Un = u|L = k]P[V n−k = v|L = k],

where m = w(v). In general, when X2n are only exchangeable, we have the following:

Lemma 31.2 (Ψt preserves exchangebility). Let X2n be exchangeable and L = Ψ1(X2n). Then con-

ditioned on L = k, Y k, Un and V n−k are independent and exchangeable. Furthermore, Y ki.i.d.∼ Bern(12)

and Un is uniform over Tk.

Proof. If suffices to show that ∀y, y′ ∈ 0, 1k, u, u′ ∈ Tk and v, v′ ∈ 0, 1n−k such that w(v) = w(v′),we have

P[Y k = y, Un = u, V n−k = v|L = k] = P[Y k = y′, Un = u′, V n−k = v′|L = k].

Note that the string X2n and the triple (Y k, Un, V n−k) are in one-to-one correspondence of eachother. Indeed, to reconstruct X2n, simply read the k distinct pairs from Y and fill them accordingto the locations of ones in U and fill the remaining equal pairs from V . Finally, note that u, y, v andu′, y′, v′ correspond to two input strings x and x′ of identical Hamming weight (w(x) = k + 2w(v))and hence of identical probability due to the exchangeability of X2n. [Examples: (y, u, v) =(01, 1100, 01)⇒ x = (10010011), (y, u, v) = (11, 1010, 10)⇒ x′ = (01110100).]

Computing the marginals, we conclude that both Y k and Un are uniform over their respectivesupport set.

Lemma 31.3 (Ψt is an extractor). Let X2n be exchangeable. Then Ψt(X2n)

i.i.d.∼ Bern(1/2) condi-tioned on l(Ψt(X

2n)) = m.

326

Proof. Note that Ψt(X2n) ∈ 0, 1∗. It is equivalent to show that for all sm ∈ 0, 1m,

P[Ψt(X2n) = sm] = 2−mP[l(Ψt(X

2n)) = m].

Proceed by induction on t. The base case of t = 1 follows from Lemma 31.2 (the distribution of theY part). Assume Ψt−1 is an extractor. Recall that Ψt(X

2n) = (Ψ1(X2n),Ψt−1(Un),Ψt−1(V n−k))and write the length as L = L1 + L2 + L3, where L2 ⊥⊥ L3|L1 by Lemma 31.2. Then

P[Ψt(X2n) = sm]

=m∑

k=0

P[Ψt(X2n) = sm|L1 = k]P[L1 = k]

Lemma 31.2=

m∑

k=0

m−k∑

r=0

P[L1 = k]P[Y k = sk|L1 = k]P[Ψt−1(Un) = sk+rk+1|L1 = k]P[Ψt−1(V n−k) = smk+r+1|L1 = k]

induction=

m∑

k=0

m−k∑

r=0

P[L1 = k]2−k2−rP[L2 = r|L1 = k]2−(m−k−r)P[L3 = m− k − r|L1 = k]

= 2−mP[L = m].

Next we compute the rate of Ψt. Let X2ni.i.d.∼ Bern(p). Then by SLLN, 12n l(Ψ1(X2n)) , Ln

2n

converges a.s. to pq. Assume, again by induction, that 12n l(Ψt−1(X2n))

a.s.−−→rt−1(p), with r1(p) = pq.Then

1

2nl(Ψt(X

2n)) =Ln2n

+1

2nl(Ψt−1(Un)) +

1

2nl(Ψt−1(V n−Ln)).

Note that Uni.i.d.∼ Bern(2pq), V n−Ln |Lni.i.d.∼ Bern(p2/(p2 + q2)) and Ln

a.s.−−→∞. Then the inductionhypothesis implies that 1

n l(Ψt−1(Un))a.s.−−→rt−1(2pq) and 1

2(n−Ln) l(Ψt−1(V n−Ln))a.s.−−→rt−1(p2/(p2 +

q2)). We obtain the recursion:

rt(p) = pq +1

2rt−1(2pq) +

p2 + q2

2rt−1

(p2

p2 + q2

), (Trt−1)(p), (31.2)

where the operator T maps a continuous function on [0, 1] to another. Furthermore, T is monotonein the senes that f ≤ g pointwise then Tf ≤ Tg. Then it can be shown that rt convergesmonotonically from below to the fixed-point of T , which turns out to be exactly the binaryentropy function h. Instead of directly verifying Th = h, next we give a simple proof: Consider

X1, X2i.i.d.∼ Bern(p). Then 2h(p) = H(X1, X2) = H(X1⊕X2, X1) = H(X1⊕X2)+H(X1|X1⊕X2) =

h(p2 + q2) + 2pqh(12) + (p2 + q2)h( p2

p2+q2 ).The convergence of rt to h are shown in Fig. 31.1.

31.5 Bernoulli factory

Given a stream of Bern(p) bits with unknown p, for what kind of function f : [0, 1]→ [0, 1] can wesimulate iid bits from Bern(f(p)). Our discussion above deals with f(p) ≡ 1

2 . The most famousexample is whether we can simulate Bern(2p) from Bern(p), i.e., f(p) = 2p ∧ 1. Keane and O’Brien[KO94] showed that all f that can be simulated are either constants or “polynomially bounded awayfrom 0 or 1”: for all 0 < p < 1, minf(p), 1− f(p) ≥ minp, 1− pn for some n ∈ N. In particular,doubling the bias is impossible.

327

0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

Figure 31.1: Rate function rt for t = 1, 4, 10 versus the binary entropy function.

The above result deals with what f(p) can be simulated in principle. What type of computationaldevices are needed for such as task? Note that since r1(p) is quadratic in p, all rate functions rtthat arise from the iteration (31.2) are rational functions (ratios of polynomials), converging tothe binary entropy function as Fig. 31.1 shows. It turns out that for any rational function f thatsatisfies 0 < f < 1 on (0, 1), we can generate independent Bern(f(p)) from Bern(p) using either ofthe following schemes with finite memory [MP05]:

1. Finite-state machine (FSM): initial state (red), intermediate states (white) and final states(blue, output 0 or 1 then reset to initial state).

2. Block simulation: let A0, A1 be disjoint subsets of 0, 1k. For each k-bit segment, output 0 iffalling in A0 or 1 if falling in A1. If neither, discard and move to the next segment. The blocksize is at most the degree of the denominator polynomial of f .

The next table gives examples of these two realizations:

328

Goal Block simulation FSM

f(p) = 1/2 A0 = 10;A1 = 01

1

0

0

1

1

0

01

f(p) = 2pq A0 = 00, 11;A1 = 01, 10 0 1

0

1

01

01

f(p) = p3

p3+q3 A0 = 000;A1 = 111

0

1

0

1

0

1

1

0

0

1

1

0

Exercise: How to generate f(p) = 1/3?It turns out that the only type of f that can be simulated using either FSM or block simulation

is rational function. For f(p) =√p, which satisfies Keane-O’Brien’s characterization, it cannot be

simulated by FSM or block simulation, but it can be simulated by pushdown automata (PDA),which are FSM operating with a stack (infinite memory) [MP05].

What is the optimal Bernoulli factory with the best rate is unclear. Clearly, a converse is theentropy bound h(p)

h(f(p)) , which can be trivial (bigger than one).

31.6 Related problems

31.6.1 Generate samples from a given distribution

The problem of how to turn pure bits into samples of a given distribution P is in a way the oppositedirection of what we have been considering so far. This can be done via Knuth-Yao’s tree algorithm:Starting at the root, flip a fair coin for each edge and move down the tree until reaching a leaf nodeand outputting the symbol. Let L denote the number of flips, which is a random variable. ThenH(P ) ≤ E[L] ≤ H(P ) + 2 bits.

Examples:

• To generate P = [1/2, 1/4, 1/4] on a, b, c, use the finite tree: E[L] = 1.5.

329

a

0

b

1

c

1

1

• To generate P = [1/3, 2/3] on a, b (note that 2/3 = 0.1010 . . . , 1/3 = 0.0101 . . .), use theinfinite tree: E[L] = 2 (geometric distribution)

a

0

b

0

a

0

...

1

1

1

31.6.2 Approximate random number generator

The goal is to design f : X n → 0, 1k s.t. f(Xn) is close to fair coin flips in distribution in certainperformance metric (TV or KL or min-entropy). One formulation is that D(Pf(Xn)‖Uniform) = o(k).

Intuitions: The connection to lossless data compression is as follows: A good compressorsqueezes out all the redundancy of the source. Therefore its output should be close to pure bits,otherwise we can compress it furthermore. So good lossless compressors should act like goodapproximate random number generators.

330

§ 32. Entropy method in combinatorics and geometry

A commonly used method in combinatorics for bounding the number of certain objects from aboveinvolves a smart application of Shannon entropy. Usually the method proceeds as follows: in orderto count the cardinality of a given set C, we draw an element uniformly at random from C, whoseentropy is given by log |C|. To bound |C| from above, we describe this random object by a randomvector X = (X1, . . . , Xn), e.g., an indicator vector, then proceed to compute or upper-bound thejoint entropy H(X1, . . . , Xn) using methods we learned in Part I.

Notably, three methods of increasing precision are as follows:

• Marginal bound:

H(X1, . . . , Xn) ≤n∑

i=1

H(Xi)

• Pairwise bound (Shearer’s lemma and generalization Theorem 1.4):

H(X1, . . . , Xn) ≤ 2

n− 2

∑

i<j

H(Xi, Xj)

• Chain rule (exact calculation):

H(X1, . . . , Xn) =n∑

i=1

H(Xi|X1, . . . , Xi−1)

Next, we give three applications using the above three methods, respectively, in increasing difficulties:

• Enumerating binary vectors of a given average weights

• Counting triangles and other subgraphs

• Bregman’s theorem

Finally, to demonstrate how entropy method can also be used for questions in Euclidean spaces,we prove the Loomis-Whitney and Bollobas-Thomason theorems based on analogous properties ofdifferential entropy.

32.1 Binary vectors of average weights

Lemma 32.1 (Massey [Mas74]). Let C ⊂ 0, 1n and let p be the average fraction of 1’s in C, i.e.

p =1

|C|∑

x∈C

w(x)

n,

where w(x) is the Hamming weight (number of 1’s) of x ∈ 0, 1n. Then |C| ≤ 2nh(p).

331

Remark 32.1. This result holds even if p > 1/2.

Proof. Let X = (X1, . . . , Xn) be drawn uniformly at random from C. Then

log |C| = H(X) = H(X1, . . . , Xn) ≤n∑

i

H(Xi) =n∑

i=1

h(pi),

where pi = P [Xi = 1] is the fraction of vertices whose i-th bit is 1. Note that

p =1

n

n∑

i=1

pi,

since we can either first average over vectors in C or first average across different bits. By Jensen’sinequality and the fact that x 7→ h(x) is concave,

n∑

i=1

h(pi) ≤ nh(

1

n

n∑

i=1

pi

)= nh(p).

Hence we have shown that log |C| ≤ nh(p).

Theorem 32.1.k∑

j=0

(n

j

)≤ 2nh(k/n), k ≤ n/2.

Proof. We take C = x ∈ 0, 1n : w(x) ≤ k and invoke the previous lemma, which says that

k∑

j=0

(n

j

)= |C| ≤ 2nh(p) ≤ 2nh(k/n),

where the last inequality follows from the fact that x 7→ h(x) is increasing for x ≤ 1/2.

Remark 32.2. Alternatively, we can prove the theorem using the large-deviation bound in Part III.By the Chernoff bound on the binomial tail (see Example after Theorem 14.1),

LHS

2n= P(Bin(n, 1/2) ≤ k) ≤ 2−nd( k

n‖ 1

2) = 2−n(1−h(k/n)) =

RHS

2n.

32.2 Shearer’s lemma & counting subgraphs

Recall that a special case of Shearer’s lemma Theorem 1.4 (or Han’s inequality Theorem 1.3) says:

H(X1, X2, X3) ≤ 1

2[H(X1, X2) +H(X2, X3) +H(X1, X3)].

A classical application of this result (see Remark 1.1) is to bound cardinality of a set in R3 givencardinalities of its projections.

For graphs H and G, define N(H,G) to be the number of copies of H in G.1 For example,

N( , ) = 4, N( , ) = 4.

1To be precise, here N(H,G) is the injective homomorphism number, that is, the number of injective mappingsϕ : V (H)→ V (G) which preserve edges, i.e., (ϕ(v)ϕ(w)) ∈ E(G) whenever (vw) ∈ E(H). Note that if H is a clique,every homomorphism is automatically injective.

332

If we know G has m edges, what is the maximal number of H that are contained in G? To studythis quantity, let’s define

N(H,m) = maxG:|E(G)|≤m

N(H,G).

We will show that for the maximal number of triangles satisfies

N(K3,m) m3/2. (32.1)

To show that N(H,m) & m3/2, consider G = Kn which has m = |E(G)| =(n2

) n2 and

N(K3,Kn) =(n3

) n3 m3/2.

To show the upper bound, fix a graph G = (V,E) with m edges. Draw a labeled triangleuniformly at random and denote the vertices by (X1, X2, X3). Then by Shearer’s Lemma,

log(3!N(K3, G)) = H(X1, X2, X3) ≤ 1

2[H(X1, X2) +H(X2, X3) +H(X1, X3)] ≤ 3

2log(2m).

Hence

N(K3, G) ≤√

2

3m3/2. (32.2)

Remark 32.3. Interestingly, linear algebra argument yields the exactly same upper bound as(32.2): Let A be the adjacency matrix of G with eigenvalues λi. Then

2|E(G)| = tr(A2) =∑

λ2i

6N(K3, G) = tr(A3) =∑

λ3i

By Minkowski’s inequality, (6N(K3, G))1/3 ≤ (2|E(G)|)1/2 which yields N(K3, G) ≤√

23 m

3/2.

Using Shearer’s Theorem 1.4 Friedgut and Kahn [FK98] obtained the counterpart of (32.1) forarbitrary H, which was first proved by Alon [Alo81]. We first introduce the fractional coveringnumber of a graph. For a graph H = (V,E), define the fractional covering number as the value ofthe following linear program:2

ρ∗(H) = minw

∑

e∈Ew(e) :

∑

e∈E, v∈ew(e) ≥ 1, ∀v ∈ V,w(e) ∈ [0, 1]

(32.3)

Theorem 32.2.c0(H)mρ∗(H) ≤ N(H,m) ≤ c1(H)mρ∗(H). (32.4)

For example, for triangles we have ρ∗(K3) = 3/2 and Theorem 32.2 is consistent with (32.1).

Proof. Upper bound: Let |V (H)| = n and let w∗(e) be the solution for ρ∗(H). For any G with medges, draw a subgraph of G, uniformly at random from all those that are isomorphic to H, andlabel its vertices by X = (X1, . . . , Xn). Now define a random 2-subset S of [n] by sampling an

edge e from E(H) with probability w∗(e)ρ∗(H) . By the definition of ρ∗(H) we have for any i ∈ [n] that

P[i ∈ S] ≥ 1ρ∗(H) . We are now ready to apply Shearer’s Theorem 1.4:

1

ρ∗(H)log(N(H,G)n!) = H(X)

≤H(XS |S) ≤ log(2m) ,

2If the “∈ [0, 1]” constraints in (32.3) and (32.5) are replaced by “∈ 0, 1”, we obtain the covering number ρ(H)and the independence number α(H) of H, respectively.

333

where the last bound is as before: if S = v, w then XS = (Xv, Xw) takes one of 2m values. Overall,we get N(H,G) ≥ 1

n!(2m)ρ∗(H).

Lower bound: It amounts to construct a graph G with m edges for which N(H,G) ≥c(H)|e(G)|ρ∗(H). Consider the dual LP of (32.3)

α∗(H) = maxψ

∑

v∈V (H)

ψ(v) : ψ(v) + ψ(w) ≤ 1, ∀(vw) ∈ E,ψ(v) ∈ [0, 1]

(32.5)

i.e., the fractional packing number. By the duality theorem of LP, we have α∗(H) = ρ∗(H). Thegraph G is constructed as follows: for each vertex v of H, replicate it for m(v) times. For each edgee = (vw) of H, replace it by a complete bipartite graph Km(v),m(w). Then the total number of edgesof G is

|E(G)| =∑

(vw)∈E(H)

m(v)m(w).

Furthermore, N(G,H) ≥∏v∈V (H)m(v). To minimize the exponent logN(G,H)log |E(G)| , fix a large number

M and let m(v) =⌈Mψ(v)

⌉, where ψ is the maximizer in (32.5). Then

|E(G)| ≤∑

(vw)∈E(H)

4Mψ(v)+ψ(w) ≤ 4M |E(H)|

N(G,H) ≥∏

v∈V (H)

Mψ(v) = Mα∗(H)

and we are done.

32.3 Bregman’s Theorem

Next, we present Bregman’s Theorem [Bre73] and an elegant proof given by Radhakrishnan [Rad97].

Definition 32.1. A perfect matching is a 1-regular spanning subgraph.

The permanent of an n× n matrix A is defined as

perm(A) =∑

π∈Sn

n∏

i=1

aiπ(i) ,

where Sn denotes the group of all permutations of n letters. For a bipartite graph G with n verticeson the left and right respectively, the number of perfect matchings in G is given by perm(A), whereA is the adjacency matrix.Example:

perm

= 1, perm

= 2

Theorem 32.3 (Bregman’s Theorem). For any n× n bipartite graph,

perm(A) ≤n∏

i=1

(di!)1di ,

where di is the degree of left vertex i (i.e. sum of the ith row).

334

Example: Consider G = Kn,n. Then perm(G) = n!, and the RHS is [(n!)1/n]n = n! as well. Moregenerally, if G consists of n/d copies of Kd,d, then Bregman’s bound is tight and perm = (d!)n/d.

First attempt: We select a perfect matching uniformly at random which matches the ith leftvertex to the Xi

th right one. Let X = (X1, . . . , Xn). Then

log perm(A) = H(X) = H(X1, . . . , Xn) ≤n∑

i=1

H(Xi) ≤n∑

i=1

log(di).

Hence perm(A) ≤∏i di. This is much worse than Bregman’s bound since by Stirling’s formula

n∏

i=1

(di!)1di ∼

(n∏

i=1

di

)e−

∑di .

where∑di is the total number of edges.

Second attempt: The hope is to the chain rule to expand the joint entropy and bound theconditional entropy more carefully. Let’s write

H(X1, . . . , Xn) =n∑

i=1

H(Xi|X1, . . . , Xi−1) ≤n∑

i=1

E[logNi].

where Ni, as a random variable, denotes the number of possible values Xi can take conditionedon X1, . . . , Xi−1, i.e., how many possible matchings for left vertex i given the outcome of where1, . . . , i − 1 are matched to. However, it is hard to proceed from this point as we only know thedegree information, not the graph itself. In fact, since we do not know the relative positions of thevertices, why should we order like this? The key idea is to label the vertices randomly, apply chainrule in this random order and average.

To this end, pick π uniformly at random from Sn and independent of X. Then

log perm(A) = H(X) = H(X|π)

= H(Xπ(1), . . . , Xπ(n)|π)

=n∑

k=1

H(Xπ(k)|Xπ(1), . . . , Xπ(k−1), π)

=n∑

k=1

H(Xk|Xj : π−1(j) < π−1(k), π)

≤n∑

k=1

E logNk,

where Nk denotes the number of possible matchings for vertex k given the outcome of Xj : π−1(j) <π−1(k) are matched to and the expectation is with respect to (X,π). The key observation is:

Lemma 32.2. Nk is uniformly distributed on [dk].

Example: Consider the graph G on the right. Consider k = 1. Then dk = 2.Depending on the random ordering, if π = 1 ∗ ∗, then Nk = 2w.p. 1/3; if π = ∗ ∗ 1, then Nk = 1 w.p. 1/3; if π = 213, thenNk = 2 w.p. 1/3; if π = 312, then Nk = 1 w.p. 1/3. Combiningeverything, indeed Nk is equally likely to be 1 or 2. 3

2

3

2

1 1

335

Thus,

E(X,π) logNk =1

dk

dk∑

i=1

log i = log(di!)1di

and hence

log perm(A) ≤n∑

k=1

log(di!)1di = log

n∏

i=1

(di!)1di .

Finally, we prove Lemma 32.2:

Proof. Note that Xi = σ(i) for some random permutation σ. Let T = ∂(k) be the neighbors of k.Then

Nk = |T\σ(j) : π−1(j) < π−1(k)|which is a function of (σ, π). In fact, conditioned on any realization of σ, Nk is uniform over [dk].To see this, note that σ−1(T ) is a fixed subset of [n] of cardinality dk, and k ∈ σ−1(T ). On theother hand, S , j : π−1(j) < π−1(k) is a uniformly random subset of [n]\k. Then

Nk = |σ−1(T )\S| = 1 + |σ−1(T )\k ∩ S|︸︷︷︸uniform(0,...,dk−1)

,

which uniform over [dk].

32.4 Euclidean geometry: Bollobas-Thomason andLoomis-Whitney

The following famous result shows that n-dimensional rectangles simultaneously minimize volumesof all coordinate projections:3

Theorem 32.4 (Bollobas-Thomason Box Theorem). Let K ⊂ Rn be a compact set. For S ⊂ [n],denote by KS ⊂ RS the projection of K onto those coordinates indexed by S. Then there exists arectangle A s.t. LebA = LebK and for all S ⊂ [n]:

LebAS ≤ LebKSProof. Let Xn be uniformly distributed on K. Then h(Xn) = log LebK. Let A be a rectangle ofsize a1 × · · · × an where

log ai = h(Xi|Xi−1) .

Then, we have by 1. in Theorem 1.6

h(XS) ≤ log LebKS.On the other hand, by the chain rule

h(XS) =

n∑

i=1

1i ∈ Sh(Xi|X[i−1]∩S)

≥∑

i∈Sh(Xi|Xi−1)

= log∏

i∈Sai

= log LebAS3Note that since K is compact, its projection and slices are all compact and hence measurable.

336

Corollary 32.1 (Loomis-Whitney). Let K be a compact subset of Rn and let Kjc denote theprojection of K onto coordinates in [n] \ j. Then

LebK ≤n∏

j=1

LebKjc1

n−1 . (32.6)

Proof. Apply previous theorem to construct rectangle A and note that

LebK = LebA =n∏

j=1

LebAjc1

n−1

By previous theorem, LebAjc ≤ LebKjc.

The meaning of Loomis-Whitney inequality is best understood by introducing the average widthof K in direction j: wj ,

LebKLebKjc

. Then (32.6) is equivalent to

LebK ≥n∏

j=1

wj ,

i.e. that volume of K is greater than volume of the rectangle of average widths.

337

Bibliography

[AFTS01] I.C. Abou-Faycal, M.D. Trott, and S. Shamai. The capacity of discrete-time memorylessrayleigh-fading channels. IEEE Transaction Information Theory, 47(4):1290 – 1301, 2001.

[Ahl82] Rudolf Ahlswede. An elementary proof of the strong converse theorem for the multiple-access channel. J. Combinatorics, Information and System Sciences, 7(3), 1982.

[Alo81] Noga Alon. On the number of subgraphs of prescribed type of graphs with a givennumber of edges. Israel J. Math., 38(1-2):116–130, 1981.

[AN07] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191.American Mathematical Soc., 2007.

[AS08] Noga Alon and Joel H. Spencer. The Probabilistic Method. John Wiley & Sons, 3rdedition, 2008.

[Ash65] Robert B. Ash. Information Theory. Dover Publications Inc., New York, NY, 1965.

[BF14] Ahmad Beirami and Faramarz Fekri. Fundamental limits of universal lossless one-to-onecompression of parametric sources. In Information Theory Workshop (ITW), 2014 IEEE,pages 212–216. IEEE, 2014.

[Bla74] R. E. Blahut. Hypothesis testing and information theory. IEEE Trans. Inf. Theory,20(4):405–417, 1974.

[BNO03] Dimitri P Bertsekas, Angelia Nedic, and Asuman E. Ozdaglar. Convex analysis andoptimization. Athena Scientific, Belmont, MA, USA, 2003.

[Boh38] H. F. Bohnenblust. Convex regions and projections in Minkowski spaces. Ann. Math.,39(2):301–308, 1938.

[Bre73] Lev M Bregman. Some properties of nonnegative matrices and their permanents. SovietMath. Dokl., 14(4):945–949, 1973.

[Bro86] L. D. Brown. Fundamentals of statistical exponential families with applications instatistical decision theory. In S. S. Gupta, editor, Lecture Notes-Monograph Series,volume 9. Institute of Mathematical Statistics, Hayward, CA, 1986.

[CB90] B. S. Clarke and A. R. Barron. Information-theoretic asymptotics of Bayes methods.IEEE Trans. Inf. Theory, 36(3):453–471, 1990.

[CB94] Bertrand S Clarke and Andrew R Barron. Jeffreys’ prior is asymptotically least favorableunder entropy risk. Journal of Statistical planning and Inference, 41(1):37–60, 1994.

[C11] Erhan Cinlar. Probability and Stochastics. Springer, New York, 2011.

338

[Cho56] Noam Chomsky. Three models for the description of language. IRE Trans. Inform. Th.,2(3):113–124, 1956.

[CK81a] I. Csiszar and J. Korner. Graph decomposition: a new key to coding theorems. IEEETrans. Inf. Theory, 27(1):5–12, 1981.

[CK81b] I. Csiszar and J. Korner. Information Theory: Coding Theorems for Discrete MemorylessSystems. Academic, New York, 1981.

[CS83] J. Conway and N. Sloane. A fast encoding method for lattice codes and quantizers. IEEETransactions on Information Theory, 29(6):820–824, Nov 1983.

[Csi67] I. Csiszar. Information-type measures of difference of probability distributions andindirect observation. Studia Sci. Math. Hungar., 2:229–318, 1967.

[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory, 2nd Ed. Wiley-Interscience, New York, NY, USA, 2006.

[Doo53] Joseph L. Doob. Stochastic Processes. New York Wiley, 1953.

[Eli55] Peter Elias. Coding for noisy channels. IRE Convention Record, 3:37–46, 1955.

[Eli72] P. Elias. The efficient construction of an unbiased random sequence. Annals of Mathe-matical Statistics, 43(3):865–870, 1972.

[ELZ05] Uri Erez, Simon Litsyn, and Ram Zamir. Lattices which are good for (almost) everything.IEEE Transactions on Information Theory, 51(10):3401–3416, Oct. 2005.

[EZ04] U. Erez and R. Zamir. Achieving 12 log(1 + SNR) on the AWGN channel with lattice

encoding and decoding. IEEE Trans. Inf. Theory, IT-50:2293–2314, Oct. 2004.

[FHT03] A. A. Fedotov, P. Harremoes, and F. Topsøe. Refinements of Pinsker’s inequality.Information Theory, IEEE Transactions on, 49(6):1491–1498, Jun. 2003.

[FJ89] G.D. Forney Jr. Multidimensional constellations. II. Voronoi constellations. IEEEJournal on Selected Areas in Communications, 7(6):941–958, Aug 1989.

[FK98] Ehud Friedgut and Jeff Kahn. On the number of copies of one hypergraph in another.Israel J. Math., 105:251–256, 1998.

[FMG92] Meir Feder, Neri Merhav, and Michael Gutman. Universal prediction of individualsequences. IEEE Trans. Inf. Theory, 38(4):1258–1270, 1992.

[Gil10] Gustavo L Gilardoni. On pinsker’s and vajda’s type inequalities for csiszar’s-divergences.Information Theory, IEEE Transactions on, 56(11):5377–5386, 2010.

[GKY56] I. M. Gel’fand, A. N. Kolmogorov, and A. M. Yaglom. On the general definition of theamount of information. Dokl. Akad. Nauk. SSSR, 11:745–748, 1956.

[GL95] Richard D Gill and Boris Y Levit. Applications of the van Trees inequality: a BayesianCramer-Rao bound. Bernoulli, pages 59–79, 1995.

[Har] Sergiu Hart. Overweight puzzle. http://www.ma.huji.ac.il/˜hart/puzzle/overweight.html.

339

[Hoe65] Wassily Hoeffding. Asymptotically optimal tests for multinomial distributions. TheAnnals of Mathematical Statistics, pages 369–401, 1965.

[HV11] P. Harremoes and I. Vajda. On pairs of f -divergences and their joint range. IEEE Trans.Inf. Theory, 57(6):3230–3235, Jun. 2011.

[KO94] M.S. Keane and G.L. O’Brien. A Bernoulli factory. ACM Transactions on Modeling andComputer Simulation, 4(2):213–219, 1994.

[Kos63] VN Koshelev. Quantization with minimal entropy. Probl. Pered. Inform, 14:151–156,1963.

[KS14] Oliver Kosut and Lalitha Sankar. Asymptotics and non-asymptotics for universal fixed-to-variable source coding. arXiv preprint arXiv:1412.4444, 2014.

[KV14] Ioannis Kontoyiannis and Sergio Verdu. Optimal lossless data compression: Non-asymptotics and asymptotics. IEEE Trans. Inf. Theory, 60(2):777–795, 2014.

[LC86] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer-Verlag, NewYork, NY, 1986.

[LM03] Amos Lapidoth and Stefan M Moser. Capacity bounds via duality with applications tomultiple-antenna systems on flat-fading channels. IEEE Transactions on InformationTheory, 49(10):2426–2467, 2003.

[Loe97] Hans-Andrea Loeliger. Averaging bounds for lattices and linear codes. IEEE Transactionson Information Theory, 43(6):1767–1773, Nov. 1997.

[Mas74] James Massey. On the fractional weight of distinct binary n-tuples (corresp.). IEEETransactions on Information Theory, 20(1):131–131, 1974.

[MF98] Neri Merhav and Meir Feder. Universal prediction. IEEE Trans. Inf. Theory, 44(6):2124–2147, 1998.

[MP05] Elchanan Mossel and Yuval Peres. New coins from old: computing with unknown bias.Combinatorica, 25(6):707–724, 2005.

[MT10] Mokshay Madiman and Prasad Tetali. Information inequalities for joint distributions,with interpretations and applications. IEEE Trans. Inf. Theory, 56(6):2699–2713, 2010.

[OE15] O. Ordentlich and U. Erez. A simple proof for the existence of “good” pairs of nestedlattices. IEEE Transactions on Information Theory, Submitted Aug. 2015.

[OPS48] BM Oliver, JR Pierce, and CE Shannon. The philosophy of pcm. Proceedings of the IRE,36(11):1324–1331, 1948.

[Per92] Yuval Peres. Iterating von Neumann’s procedure for extracting random bits. Annals ofStatistics, 20(1):590–597, 1992.

[PPV10a] Y. Polyanskiy, H. V. Poor, and S. Verdu. Channel coding rate in the finite blocklengthregime. IEEE Trans. Inf. Theory, 56(5):2307–2359, May 2010.

[PPV10b] Y. Polyanskiy, H. V. Poor, and S. Verdu. Feedback in the non-asymptotic regime. IEEETrans. Inf. Theory, April 2010. submitted for publication.

340

[PPV11] Y. Polyanskiy, H. V. Poor, and S. Verdu. Minimum energy to send k bits with andwithout feedback. IEEE Trans. Inf. Theory, 57(8):4880–4902, August 2011.

[PW12] E. Price and D. P. Woodruff. Applications of the shannon-hartley theorem to datastreams and sparse recovery. In Proceedings of the 2012 IEEE International Symposiumon Information Theory, pages 1821–1825, Boston, MA, Jul. 2012.

[PW14] Y. Polyanskiy and Y. Wu. Peak-to-average power ratio of good codes for Gaussianchannel. IEEE Trans. Inf. Theory, 60(12):7655–7660, December 2014.

[PW17] Yury Polyanskiy and Yihong Wu. Strong data-processing inequalities for channels andBayesian networks. In Eric Carlen, Mokshay Madiman, and Elisabeth M. Werner, editors,Convexity and Concentration. The IMA Volumes in Mathematics and its Applications,vol 161, pages 211–249. Springer, New York, NY, 2017.

[Rad97] Jaikumar Radhakrishnan. An entropy proof of Bregman’s theorem. J. Combin. TheorySer. A, 77(1):161–164, 1997.

[Ree65] Alec H Reeves. The past present and future of PCM. IEEE Spectrum, 2(5):58–62, 1965.

[RSU01] Thomas J. Richardson, Mohammad Amin Shokrollahi, and Rudiger L. Urbanke. Designof capacity-approaching irregular low-density parity-check codes. IEEE Transactions onInformation Theory, 47(2):619–637, 2001.

[RU96] Bixio Rimoldi and Rudiger Urbanke. A rate-splitting approach to the gaussian multiple-access channel. Information Theory, IEEE Transactions on, 42(2):364–375, 1996.

[RZ86] Ryabko B. Reznikova Zh. Analysis of the language of ants by information-theoreticalmethods. Problemi Peredachi Informatsii, 22(3):103–108, 1986. English translation:http://reznikova.net/R-R-entropy-09.pdf.

[SF11] Ofer Shayevitz and Meir Feder. Optimal feedback communication via posterior matching.IEEE Trans. Inf. Theory, 57(3):1186–1222, 2011.

[Sha48] C. E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379–423and 623–656, July/October 1948.

[Sio58] Maurice Sion. On general minimax theorems. Pacific J. Math, 8(1):171–176, 1958.

[Smi71] J. G. Smith. The information capacity of amplitude and variance-constrained scalarGaussian channels. Information and Control, 18:203 – 219, 1971.

[Spe15] Spectre. SPECTRE: Short packet communication toolbox. https://github.com/

yp-mit/spectre, 2015. GitHub repository.

[Spi96] Daniel A. Spielman. Linear-time encodable and decodable error-correcting codes. IEEETransactions on Information Theory, 42(6):1723–1731, 1996.

[Spi97] Daniel A. Spielman. The complexity of error-correcting codes. In Fundamentals ofComputation Theory, pages 67–84. Springer, 1997.

[SV11] Wojciech Szpankowski and Sergio Verdu. Minimum expected length of fixed-to-variablelossless compression without prefix constraints. IEEE Trans. Inf. Theory, 57(7):4017–4025,2011.

341

https://github.com/yp-mit/spectre

https://github.com/yp-mit/spectre

[TE97] Giorgio Taricco and Michele Elia. Capacity of fading channel with no side information.Electronics Letters, 33(16):1368–1370, 1997.

[Top00] F. Topsøe. Some inequalities for information divergence and related measures of discrim-ination. IEEE Transactions on Information Theory, 46(4):1602–1609, 2000.

[Tsy09] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Verlag, New York,NY, 2009.

[TV05] David Tse and Pramod Viswanath. Fundamentals of wireless communication. CambridgeUniversity Press, 2005.

[UR98] R. Urbanke and B. Rimoldi. Lattice codes can achieve capacity on the AWGN channel.IEEE Transactions on Information Theory, 44(1):273–278, 1998.

[Ver07] S. Verdu. EE528–Information Theory, Lecture Notes. Princeton Univ., Princeton, NJ,2007.

[vN51] J. von Neumann. Various techniques used in connection with random digits. MonteCarlo Method, National Bureau of Standards, Applied Math Series, (12):36–38, 1951.

[Yek04] Sergey Yekhanin. Improved upper bound for the redundancy of fix-free codes. IEEETrans. Inf. Theory, 50(11):2815–2818, 2004.

[Yos03] Nobuyuki Yoshigahara. Puzzles 101: A Puzzlemaster’s Challenge. A K Peters, Natick,MA, USA, 2003.

[Yu97] Bin Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam: Research Papersin Probability and Statistics, pages 423–435, 1997.

[Zam14] Ram Zamir. Lattice Coding for Signals and Networks. Cambridge University Press,Cambridge, 2014.

[ZY97] Zhen Zhang and Raymond W Yeung. A non-Shannon-type conditional inequality ofinformation quantities. IEEE Trans. Inf. Theory, 43(6):1982–1986, 1997.

[ZY98] Zhen Zhang and Raymond W Yeung. On characterization of entropy function viainformation inequalities. IEEE Trans. Inf. Theory, 44(4):1440–1452, 1998.

342

lecture notes on information theory prefaceyw562/teaching/itlectures.pdf · lecture notes on...

Documents