information theory from a functional...

Information Theory from A

Functional Viewpoint

Jingbo Liu

A Dissertation

Presented to the Faculty

of Princeton University

in Candidacy for the Degree

of Doctor of Philosophy

Recommended for Acceptance

by the Department of

Electrical Engineering

Adviser: Paul Cuff and Sergio Verdu

January 2018

c© Copyright by Jingbo Liu, 2017.

All rights reserved.

Abstract

A perennial theme of information theory is to find new methods to determine the

fundamental limits of various communication systems, which potentially helps the

engineers to find better designs by eliminating the deficient ones. Traditional meth-

ods have focused on the notion of “sets”: the method of types concerns the cardinality

of subsets of the typical sets; the blowing-up lemma bounds the probability of the

neighborhood of decoding sets; the single-shot (information-spectrum) approach uses

the likelihood threshold to define sets. This thesis promotes the idea of deriving the

fundamental limits using functional inequalities, where the central notion is “func-

tions” instead of “sets”. A functional inequality follows from the entropic definition

of an information measure by convex duality. For example, the Gibbs variational

formula follows from the Legendre transform of the relative entropy.

As a first example, we propose a new methodology of deriving converse (i.e. im-

possibility) bounds based on convex duality and the reverse hypercontractivity of

Markov semigroups. This methodology is broadly applicable to network information

theory, and in particular resolves the optimal scaling of the second-order rate for

the previously open “side-information problems”. As a second example, we use the

functional inequality for the so-called Eγ metric to prove non-asymptotic achievability

(i.e. existence) bounds for several problems including source coding, wiretap channels

and mutual covering.

Along the way, we derive general convex duality results leading to a unified treat-

ment to many inequalities and information measures such as the Brascamp-Lieb in-

equality and its reverse, strong data processing inequality, hypercontractivity and its

reverse,transportation-cost inequalities, and Renyi divergences. Capitalizing on such

dualities, we demonstrate information-theoretic approaches to certain properties of

functional inequalities, such as the Gaussian optimality. This is the antithesis of the

main thesis (functional approaches to information theory).

iii

Acknowledgements

First, I thank my advisors Professor Paul Cuff and Professor Sergio Verdu for their

unceasing guidance and support over the past five years. I was new to information

theory at the start of my PhD, and Professor Cuff was generous with his time in

helping getting on the road of this field. The puzzle meetings with Professor Cuff

and other students was a great source of diversion and inspiration for our graduate

life. Professor Verdu demonstrated to us the beauty of research and the charisma of

teaching. He encouraged us to delve deeply into serious research topics, revise papers

carefully, and work hard even on holidays, by doing so himself.

Thanks, Prof. Abbe, Braverman, Cuff and Verdu, for serving as the committee;

Prof. Chen, Poor, and Verdu, for being the readers; the administrative staff from

ELE, for their assistance over the years. Thanks to all faculty members of Princeton

who have taught me classes; what I learnt from them is reflected in every aspect of this

thesis. I am indebted to Princeton University for providing a wonderful environment

for studying and research.

I owe sincere gratitude to my mentors and collaborators: Sanket Satpathy, Sudeep

Kamath, Wei Yang for numerous illuminating discussions; Professor Emmanuel Abbe

for letting me catch a glimpse of working on coding theory; Professor Thomas Cour-

tade for showing me how to prove Gaussian optimality; Professor Ramon van Handel

for advising me to look into semigroup operators, as opposed to the maximal operators

which I was previously stuck with.

I am thankful to all my friends at Princeton, my colleagues from the information

sciences and systems group, especially my roommates, Sen Tao, Lanqing Yu and Amir

Reza Asadi, for helping me out with many things in life.

Last but not the least, I thank my parents for everything, and in particular for

providing me a place where a large part of the thesis was written.

iv

To my parents.

v

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

1 Introduction 1

1.1 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Information Measures and Operational Problems . . . . . . . 1

1.1.2 Channel Coding Revisited . . . . . . . . . . . . . . . . . . . . 3

1.1.3 More on Converses in Information Theory . . . . . . . . . . . 7

1.1.4 The Side Information Problems . . . . . . . . . . . . . . . . . 10

1.1.5 Common Randomness Generation . . . . . . . . . . . . . . . . 11

1.2 Concentration of Measure . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.1 The Gromov-Milman Formulation . . . . . . . . . . . . . . . . 13

1.2.2 Is It Natural for Information Theorists? . . . . . . . . . . . . . 15

1.3 Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.3.1 In Large Deviation Theory . . . . . . . . . . . . . . . . . . . . 16

1.3.2 Hypercontractivity and Its Reverse . . . . . . . . . . . . . . . 17

1.3.3 Brascamp-Lieb Inequality . . . . . . . . . . . . . . . . . . . . 19

1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 21

vi

2 A Tale of Two Optimizations 23

2.1 The Duality between Functions and Measures . . . . . . . . . . . . . 24

2.2 Dual Formulations of the Forward Brascamp-Lieb Inequality . . . . . 26

2.3 Application to Common Randomness Generation . . . . . . . . . . . 33

2.3.1 The One Communicator Problem: the Setup . . . . . . . . . . 33

2.3.2 A Zero-Rate One-Shot Converse for the Omniscient Helper

Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 Dual Formulation of a Forward-Reverse Brascamp-Lieb Inequality . . 42

2.4.1 Some Mathematical Preliminaries . . . . . . . . . . . . . . . . 43

2.4.2 Compact X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.4.3 Noncompact X . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.4.4 Extension to General Convex Functionals . . . . . . . . . . . . 59

2.5 Some Special Cases of the Forward-Reverse Brascamp-Lieb Inequality 61

2.5.1 Variational Formula of Renyi Divergence . . . . . . . . . . . . 62

2.5.2 Strong Data Processing Constant . . . . . . . . . . . . . . . . 64

2.5.3 Loomis-Whitney Inequality and Shearer’s Lemma . . . . . . . 66

2.5.4 Hypercontractivity . . . . . . . . . . . . . . . . . . . . . . . . 68

2.5.5 Reverse Hypercontractivity (Positive Parameters) . . . . . . . 69

2.5.6 Reverse Hypercontractivity (One Negative Parameter) . . . . 70

2.5.7 Transportation-Cost Inequalities . . . . . . . . . . . . . . . . . 71

3 Smoothing 74

3.1 The BL Divergence and Its Smooth Version . . . . . . . . . . . . . . 74

3.2 Upper Bounds on dδ . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.2.1 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.2.2 Gaussian Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.2.3 Continuous Density Case . . . . . . . . . . . . . . . . . . . . . 90

3.3 Application to Common Randomness Generation . . . . . . . . . . . 91

vii

3.3.1 Second-order Converse for the Omniscient Helper Problem . . 91

4 The Pump 96

4.1 Prelude: Binary Hypothesis Testing . . . . . . . . . . . . . . . . . . . 97

4.1.1 Review of the Blowing-Up Method, and Its Second-order Sub-

optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.1.2 The Fundamental Sub-optimality of the Set-Inequality Ap-

proaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.1.3 The New Pumping-up Approach . . . . . . . . . . . . . . . . . 102

4.2 Optimal Second-order Fano’s Inequality . . . . . . . . . . . . . . . . . 107

4.2.1 Bounded Probability Density Case . . . . . . . . . . . . . . . 107

4.2.2 Gaussian Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.3 Optimal Second-order Change-of-Measure . . . . . . . . . . . . . . . 114

4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.4.1 Output Distribution of Good Channel Codes . . . . . . . . . . 124

4.4.2 Broadcast Channels . . . . . . . . . . . . . . . . . . . . . . . . 125

4.4.3 Source Coding with Compressed Side Information . . . . . . . 126

4.4.4 One-communicator CR Generation . . . . . . . . . . . . . . . 131

4.5 About Extensions to Processes with Weak Correlation . . . . . . . . 136

5 More Dualities 137

5.1 Air on g‹ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.1.1 Review of the Image-size Problem . . . . . . . . . . . . . . . . 138

5.1.2 g‹ and the Optimal Second-order Image Size Lemmas . . . . . 141

5.1.3 Relationship between g and g‹ . . . . . . . . . . . . . . . . . . 151

5.2 Variations on Eγ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.2.1 Eγ: Definition and Dual Formulation . . . . . . . . . . . . . . 155

viii

5.2.2 The Relationship of Eγ with Excess Information and Smooth

Renyi Divergence . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.2.3 Eγ-Resolvability . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.2.4 Application to Lossy Source Coding . . . . . . . . . . . . . . . 198

5.2.5 Application to Mutual Covering Lemma . . . . . . . . . . . . 202

5.2.6 Application to Wiretap Channels . . . . . . . . . . . . . . . . 207

6 The Counterpoint 225

6.1 Data Processing, Tensorization and Convexity . . . . . . . . . . . . . 226

6.1.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 226

6.1.2 Tensorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

6.1.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

6.2 Gaussian Optimality in the Forward Brascamp-Lieb Inequality . . . . 229

6.2.1 Optimization of Conditional Differential Entropies . . . . . . . 232

6.2.2 Optimization of Mutual Informations . . . . . . . . . . . . . . 244

6.2.3 Optimization of Differential Entropies . . . . . . . . . . . . . . 246

6.3 Gaussian Optimality in the Forward-Reverse Brascamp-Lieb Inequality 247

6.4 Consequences of the Gaussian Optimality . . . . . . . . . . . . . . . . 258

6.4.1 Brascamp-Lieb Inequality with Gaussian Random transforma-

tions: an Information-Theoretic Proof . . . . . . . . . . . . . . 258

6.4.2 Multi-variate Gaussian Hypercontractivity . . . . . . . . . . . 260

6.4.3 T2 Inequality for Gaussian Measures . . . . . . . . . . . . . . 263

6.4.4 Wyner’s Common Information for m Dependent Random Vari-

ables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

6.4.5 Key Generation with an Omniscient Helper . . . . . . . . . . 267

6.5 Appendix: Existence of Maximizer in Theorem 6.2.1 . . . . . . . . . . 270

7 Conclusion and Future Work 276

ix

Bibliography 278

x

List of Tables

1.1 Comparison of previous converse proof techniques . . . . . . . . . . . 10

4.1 Comparison of the basic ideas in BUL and the new methodology . . . 107

xi

List of Figures

1.1 CR generation with one-communicator . . . . . . . . . . . . . . . . . 12

2.1 Diagrams for Theorem 2.4.5. . . . . . . . . . . . . . . . . . . . . . . . 51

2.2 The “partial relations” between the inequalities discussed in this chap-

ter with respect to implication. . . . . . . . . . . . . . . . . . . . . . 61

2.3 Diagram for hypercontractivity (HC) . . . . . . . . . . . . . . . . . . 68

2.4 Diagram for reverse HC . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.5 Diagram for reverse HC with one negative parameter . . . . . . . . . 70

4.1 Schematic comparison of 1A, 1Ant and Tbnt 1A, where A is the indicator

function of a Hamming ball. . . . . . . . . . . . . . . . . . . . . . . 106

4.2 Illustration of the action of Txn,t. The original function (an indicator

function) is convolved with a Gaussian measure and then dilated (with

center xn). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.3 A set A Ď Y and its “preimage” under QY |X in X . . . . . . . . . . . 114

4.4 Source coding with compressed side information . . . . . . . . . . . . 126

5.1 The tradeoff between the sizes of the two images. . . . . . . . . . . . 139

5.2 EγpP Qq as a function of γ where P “ Berp0.1q and Q “ Berp0.5q. . . 157

5.3 Setup for channel resolvability. . . . . . . . . . . . . . . . . . . . . . . 173

5.4 The wiretap channel . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

6.1 Key generation with an omniscient helper . . . . . . . . . . . . . . . 267

xii

Chapter 1

Introduction

1.1 Information Theory

1.1.1 Information Measures and Operational Problems

The scope of a research field is not only defined by its very name, but also shaped

by the people who left their mark on it. Shannon’s ground-breaking 1948 paper [1]

characterized the capacity of transmitting messages through a channel in terms of

the maximum mutual information. Since then, information theory has been closely

associated with communication engineering, even though the ideas and the analytical

tools from information theory have found applications in many other research fields,

such as physics, statistics, computer science, networking, control theory and financial

engineering [2][3][4].

Roughly speaking, the research by Shannon and his successors may be split into

the following two categories:

• Information measures. Just as physicists are intrigued by the notion of a space,

information theorists are perpetually obsessed with finding a good notion of an

information measure, which is a quantity defined for a probability distribution

that characterizes, say, the complexity of a random object, or the correlation

1

between random objects. Examples of information measures include the relative

entropy, mutual information, common information [5][6], the strong data pro-

cessing coefficient [7][8][9], and the maximal correlation coefficient [10][11][12].

A good definition of an information measure entails certain mathematical con-

veniences while being the answers to interesting problems.

• Operational problems. Once a practical system, say from communication engi-

neering, is formulated as an abstract model, it then becomes a well-posed math-

ematical question to determine whether an operational task (e.g. data transmis-

sion) can be performed for a given parameter range of the model. The critical

value of the parameter is called the fundamental limit, which the information

theorists seek to express or bound in terms of the information measures. Ex-

amples of operational problems in communication engineering include channel

coding [1], source coding [13], randomness generation [14][15][16], their gener-

alizations to network settings [17][18], or their variants [19][20][21][22][23][24].

Research in the above two categories are not independent. Firstly, knowing the

properties of information measures certainly helps computing the fundamental limits

of operational problems. Secondly, the study of operational problems often serves

as a source of motivations and ideas for proposing new information measures, or the

proofs of information theoretic inequalities.

This thesis investigates the functional representations1 (sometimes called the vari-

ational formulae) of various information measures, ubiquitous in information theory,

and their implications on the fundamental limits of operational problems. The most

basic example of such a functional representation is the Gibbs variational formula

1In the mathematical literature a “functional” usually refers to a mapping from a vector space tothe reals. In the scope of this thesis, this vector space is the set of all test functions (e.g. boundedcontinuous real-valued functions) on an alphabet.

2

(also known as the Donsker-Varadhan formula) of the relative entropy,

DpP Qq “ supf

"ż

f dP ´ log

ż

exppfq dQ

*

(1.1)

where P and Q are probability distributions and f is a real-valued measurable func-

tion, all on the same measurable space. As simple and well-known as (1.1) is, it

may be surprising that a new, yet powerful and canonical methodology of attack-

ing a wide range of operational problems in communication engineering lies beneath

these functional representations. A main objective of this thesis is to introduce this

new methodology, and apply it to certain problems regarded as challenging or open

by some researchers in the community. As another example, the Eγ metric to be

introduced in Section 5.2 has the following functional representation:

EγpπQq “ supf : 0ďfď1

"ż

f d π ´ γ

ż

f dQ

*

. (1.2)

While there are other ways of deriving (1.2), a very illuminating viewpoint, which

also matches the philosophy of this thesis, is to view Eγp¨Qq as the Legendre-Fenchel

dual of the convex functional

f ÞÑ

$

’

&

’

%

γş

fdQ, 0 ď f ď 1 a.e.;

`8, otherwse.(1.3)

1.1.2 Channel Coding Revisited

To get a glimpse of the above ideas through a concrete example, let us recall the data

transmission (or channel coding) problem considered by Shannon [1], which consists

of the following (in the single-shot formulation):

• The input and the output alphabets X and Y , which represent the sets of the

possible values of the input and output signals for a given communication chan-

3

nel. The communication channel is modeled as a random transformation (or

transition probability) PY |X .

• A set of messagesM to be transmitted by the channel. The semantic meaning

of the messages is irrelevant for the solution of the operational problem, so it is

without loss of generality to consider M “ t1, . . . ,Mu for some integer M .

• The encoder is a mapping from M to X , which can be represented by a code-

book consisting of codewords c1, . . . , cM P X .

• The decoder is a rule of reconstructing a message m P M based on the ob-

servation y P Y . A deterministic decoder can be specified by disjoint sets

D1, . . . ,DM Ď Y . A stochastic decoder can be represented by nonnegative

functions f1, . . . , fM where fm : Y Ñ r0, 1s is the probability of declaring mes-

sage m upon observing y. HenceřMm“1 fm “ 1 (i.e. pfmq

Mm“1 is a partition of

the unity).

The goal is to find the tradeoff between M and the maximum error probability2

ε :“ 1´ min1ďmďM

PY |X“cmrDms. (1.4)

Of course, in general such a problem cannot have a simple yet explicit answer. Shan-

non considered a special case called the discrete stationary memoryless channel, where

X and Y are finite sets, and the channel can be used repeatedly for n times; in other

2To streamline the presentation we consider the maximum error probability here rather than themore popular average error probability. By a simple expurgation argument (see e.g. [25]) it is knownthat the choice of the two error probability criteria is irrelevant for the fundamental limit in anasymptotic sense.

4

words, one considers3

X Ð X n, (1.5)

Y Ð Yn, (1.6)

PY |X Ð PbnY |X . (1.7)

Then it is known [1][26] that fix any ε P p0, 1q, one can send at most4

M “ exppnpC ` oεp1qqq (1.8)

messages as nÑ 8, where

C :“ maxPX

IpX;Y q (1.9)

is the maximum mutual information over all input distributions. To establish such

a result, one needs to prove lower and upper bounds on M , which are called the

achievability and the converse, respectively. For the converse part, the classical Fano’s

inequality claims that

IpXn;Y nq ě p1´ εq logM ´ hpεq. (1.10)

where Xn equiprobably takes the values c1, . . . , cM , and Y n is the corresponding

output. Information theorists call (1.10) a weak converse because it only shows

that limεÓ0 limnÑ81n

logM ď C for the optimal codebook. In contrast, the classical

blowing-up lemma argument [27] (see Section 4.1.1 for an introduction) strengthens

3Conventionally, Xn :“ X ˆ¨ ¨ ¨ˆX denotes the Cartesian product. The product channel (tensorproduct) PbnY |X is defined by PbnY |Xpy

n|xnq :“śni“1 PY |Xpyi|xiq for any xn P Xn and yn P Yn.

4The notation φpε, nq “ oεpfpnqq means that limnÑ8φpε,nqfpnq “ 0 for any ε ą 0.

5

(1.10) to

IpXn;Y nq ě logM ´ oεp

a

n log nq. (1.11)

Note that (1.11) implies

M ď exppnpC ` oεp1qqq (1.12)

which is called a strong converse. However, the oεp1q term in (1.12) has not yet

captured the correct order of the second-order term. In this thesis we propose a new

converse method, further strengthening (1.12) to

IpXn;Y nq ě logM ´O

˜

c

n log1

1´ ε

¸

(1.13)

which, as we will see, is sharp in both ε Ò 1 and n Ñ 8! Although there are other

methods for obtaining sharp converse bounds for channel coding, to our knowledge

there was no previous method for proving the sharp Fano’s inequality in (1.13), which

is a versatile tool immediately applicable to a number of other operational problems

in network information theory.

The fundamental limits for most operational problems in network information

theory can be expressed in terms of the linear combinations of information measures,

which admit functional representations (analogous to the Gibbs variational formula)

via convex duality, so we can play a similar game. For certain network information

theory problems, we thus obtain by far the only existing method for a converse bound

which is sharp as nÑ 8.

6

1.1.3 More on Converses in Information Theory

While the early work in information theory centered around first-order asymptotic

results, such as (1.8), information theorists have always been striving for improve-

ments:

• Non-asymptotic bounds where the blocklength n is assumed to be fixed. For

example, Fano’s inequality [28] provides a simple, though not very strong, non-

asymptotic bound. While traditionally information theory has focused on the

stationary memoryless settings, it is certainly of practical importance to con-

sider processes with memory, or sources and channels with only one realization

(as opposed to repeated independent realizations). In fact the latter setting

is often adopted in the computer science literature, for problems in the inter-

section of theoretical computer science and information theory (see e.g. [29]).

Moreover, even in the stationary memoryless setting, it is sometimes more de-

sirable to consider the scenario where the alphabet size or the number of users

is large compared to the number of channel/source realizations [30][31][32], per-

haps even more so in the so-called “big data” and “IoT” era. In those cases,

the traditional n Ñ 8 assumption appears inadequate, while a versatile non-

asymptotic bound may be applicable to a more relevant regime of asymptotics.

• More precise asymptotics. For example, in channel coding, it is possible to

improve (1.8) to the precise second-order asymptotic result:

M “ exp´

nC ´?nV Q´1

pεq ` op?nq¯

(1.14)

where Qp¨q denotes the Gaussian tail probability. The study of the second-order

asymptotics in information theory was initiated by Strassen[33], and recently

flourished thanks to the work of Polyanskiy et al. [34], Hayashi [35], Kostina et

al. [36], and others. When n is not too small, a good approximation of M can

7

be obtained from (1.14) (sometimes significantly better than the error exponent

approximation [34]). This is relevant to the practical setting where a short delay,

hence a short blocklength of the code, is demanded for data transmission.

The key to achieving the above goals is to find new and better ways of proving

achievability and converse bounds. For example, suppose that the goal is to determine

the second-order asymptotics of a certain operational problem in network information

theory. The achievability (i.e. existence) part is rather well understood: for almost

all5 problems from network information theory with known first-order asymptotics,

an achievability bound with an Op?nq second-order term can be derived [37][38].

Indeed, since the achievability part is constructive, one can usually use random coding

to produce a joint distribution which is close to the ideal distribution appearing in the

formula of the first-order fundamental limit, so the second-order analysis is generally

reduced to Berry-Esseen central limit arguments. The problem of producing a desired

joint distribution is sometimes called coordination [39].

On the other hand, the converse (i.e. impossibility) part appears subtler since

the possibility of success by any code must be excluded. It is recognized within

the community that the converse part is usually the bottleneck for nailing down the

exact second-order asymptotics (see e.g. [40, Section V]). Below, we review three most

influential ideas of proving converse results in information theory:

1. Information spectrum method : given probability measures P and Q on the same

alphabet, the information spectrum refers to the cumulative distribution func-

tion of the relative information log dPdQpXq where X „ P . It is possible to bound

the error probability (maximum or average) of channel coding in terms of the

information spectrum [26][41]. The meta-converse approach (based on binary

hypothesis testing) and the Renyi divergence approach [42] are both closely

related to the information spectrum approach: it is easy to see that the knowl-

5Here we do not discuss the variants such as universal coding or zero-error coding

8

edge of the information spectrum is equivalent to the knowledge of the binary

hypothesis testing tradeoff region for P and Q (see e.g. [43]), and the Renyi

divergence between P and Q is recoverable from the information spectrum. Be-

sides being very general and not limited to the stationary memoryless systems,

the information spectrum method shows that the opnq term in (1.12) can be

improved to Op?nq, which is in fact the optimal central limit theorem rate (see

e.g. [34]). Other information-theoretic problems6 with known information spec-

trum converses include lossy source coding [36], Slepian-Wolf coding, multiple

access channels [44], and broadcast channels [45].

2. Type class analysis [8]: this method is restricted to, but very powerful for,

the discrete memoryless systems. The idea is to first look at, instead of a

stationary memoryless distribution, the equiprobable distribution on sequences

with the same empirical distribution (a.k.a type). In other words, imagine that

a genie reveals to the encoder and the decoder the information of the type,

and a different code can be constructed for each type. The evaluation of the

probability then becomes a combinatorial (counting) problem, which has been

referred to as the “skeleton” or the “combinatorial kernel” of the information

theoretic problem [46]. It appears that those Op?nq second-order converses for

those problems proved by the information spectrum methods can also be proved

by the method of types (see [47] for the example of lossy source coding); this is

not surprising since the type determines the value of the relative information,

and hence may be viewed as a finer resolution than the information spectrum.

For those problems, it is generally easy to see (by a combinatorial covering

or packing argument) that the error probability given the type of the source or

channel realization is either very close to 0 or 1, hence the task of estimating the

6By an “information-theoretic problem” we mean an operational task such as channel coding orsource coding.

9

error probability in theOp?nq second-order rate regime is reduced to calculating

the probability of those “bad types”, which is an exercise of the central limit

theorem and the Taylor expansion of the information quantities. The type class

analysis has also been used extensively in [8] for deriving error exponents.

3. Blowing-up lemma (BUL): this method was introduced in [27] and exploited

extensively in the classical book [8]; see also the recent survey [48]. We review

the BUL method in Section 4.1.1. BUL draws on the nonlinear measure concen-

tration mechanism, and has been successful in establishing the strong converse

property in a wide range of multiuser information theoretic problems (including

all settings in [8] with known single-letter 7 rate regions; see [8, Ch. 16/Discus-

sion].)

Informationspectrum

(meta-converse)

Type classanalysis

Blowing-up(image size

characteriza-tion)

Beyond|Y | ă 8?

X 7 7

Second-orderterm

optimal optimal Oεp?n log

32 nq

Extension tomultiuser

more selected selected

all sourcechannel

networks withknown firstorder region

Table 1.1: Comparison of previous converse proof techniques

1.1.4 The Side Information Problems

A class of problems involving side information (also called “helper” in [8]) are known

to be challenging in the converse part. An archetypical example is source coding

7A single-letter expression (as opposed to a multi-letter expression) for a stationary memorylesssystem is a formula involving only the per-letter distribution.

10

with compressed side information. Originally studied by Wyner [49], Ahlswede, Gacs

and Korner [27] in the 1970’s, this problem is the first instance in network informa-

tion theory where the single letter expression of the rate region requires an auxiliary

random variable. Other examples of the side information problems include common

randomness generation under the one-way communication protocol [15] and some

source and channel networks considered in [25]. For the side information problems,

BUL remained hitherto the only method for establishing the strong converse, hence

the precise second-order asymptotics has been recognized as a tough problem [50,

Section 9.2].

On reason such problems require BUL might be that the “combinatorial kernels”

(see the paragraph explaining the type class analysis above) are not easily solved,

hence it is not clear that the error probability can be estimated by simply computing

the error probability of certain “bad types”. We remark that the Gray-Wyner network

[51], although also requiring an auxiliary in the expression of the single-letter region,

does not assume this degree of difficulty as the side information problems. In fact the

exact second-order asymptotics of the Gray-Wyner network can be derived using the

type class analysis described above [52].

1.1.5 Common Randomness Generation

Common randomness (CR) generation refers to the operational problem of generating

shared random bits among terminals. Although the converse methodology proposed

in this thesis is broadly applicable, we will use examples from the common random-

ness generation problem, simply as a consequence of the author’s personal research

background.

The scenario depicted in Figure 1.1, which we call CR generation with one com-

municator, is a typical example of the side information problem. The terminals T0,

. . . , Tm observe correlated sources. Terminal T0 is allowed to communicate to the

11

T1 T2. . . Tm

T0X

K1 K2 Km

K

Y1 Y2 Ym

W1 W2 Wm

Figure 1.1: CR generation with one-communicator

other terminals, after which all terminals aim to produce a common random number

K. The goal is to characterize the tradeoff between the communication rates and

the rate at which the random numbers are produced, under a lower bound on the

probability that the random numbers produced by the terminals agree. The precise

formulation of the problem is given in Section 2.3.1.

The first three chapters of the thesis form a trilogy that offers the three key ingre-

dients for proving a Op?nq second-order converse for this problem in the stationary

memoryless setting, which, to the best of the author’s knowledge, was beyond the

reach of previous techniques. More precisely, the three ingredients allow one to ac-

complish the following, respectively:

• Chapter 2 presents equivalences between entropic optimizations and functional

optimizations, which allow one to derive a sharp converse bound for the case

where Y1,. . . ,Ym are functions of X and when the communication rates are close

to zero.

• Chapter 3 shows that the behavior of an entropic optimization under a per-

turbation on the underlying measure is approximated by a mutual information

optimization. This forms the basis for deriving an Op?nq second-order converse

bound when the vanishing communication rates assumption is dropped.

12

• To show an Op?nq second-order converse bound when further dropping the

assumption that Y1, . . . , Ym are functions of X, we need the pumping-up tech-

nology in Chapter 4.

On the other hand, previous works have obtained one-shot converses using smooth

Renyi entropy [53] or the meta-converse idea [54][55], for which the asymptotic tight-

ness are achieved in the other extreme of limited correlated sources but unlimited

communications. Moreover, Mossel [56] used reverse hypercontractivity to bound the

probability of agreement when there is no communication at all for the binary sym-

metric sources, and derived the tight asymptotics when the number of terminals is

growing and the probability of agreement is vanishing.

1.2 Concentration of Measure

1.2.1 The Gromov-Milman Formulation

Concentration of measure is a collection of tools and results from analysis and prob-

ability theory that have proved successful in many areas of pure and applied math-

ematics [57][48]. Concentration of a real valued random variable simply means that

the random variable is close to its “center” (e.g. its mean or its median) with high

probability.

The relevance of concentration of measure to non-asymptotic information theory

is easy to see: consider the example of channel coding. By the information spectrum

bounds [41], the error probability is essentially

ε « PrıX;Y pX;Y q ă logM s (1.15)

where M denotes the number of messages, ıX;Y p¨q is a quantity called information

density, and X and Y are the input and the output, respectively. Note that the

13

expectation of ıX;Y pX;Y q equals the mutual information IpX;Y q, hence bounding ε

can be reduced to the concentration of ıX;Y pX;Y q.

Many standard tail bounds in statistics, such as Hoeffding’s inequality, Chernoff’s

inequality, and Bernstein’s inequality, capture the concentration of measure of the

i.i.d. sum of real valued random variables. They are sometimes called “linear concen-

tration inequalities” since the summation is linear.

In the more general case, in which these random variables are not real valued,

there are no notions of “linear functions”, “median” or “mean” of these random

variables. However, intuitively we should still be able to describe the concentration

phenomenon of, say, a random variable taking values in a metric space. Indeed, there

are two widely accepted formulations of the concentration of measure phenomenon,

equivalent in an exact way, for general metric measure spaces (See for example [57][58]

for a proof of the equivalence):

1. Any 1-Lipschitz function of the random variable deviates from its median with

small probability.

2. The blow-up of any set with probability at least 1/2 has a high probability.

Here, the blow-up of a set means the collection of all points whose distance to the

set is not exceeding a given threshold. Setting the threshold close to zero, we see

that the second formulation above implies an isoperimetric property (i.e. relationship

between the surface area and the volume of the set). Conversely, isoperimetry implies

concentration of measure by integration. While isoperimetry had long been consid-

ered as merely a curiosity in geometry, Milman showed in the early 1970s [59] that

sphere isoperimetry can be used to give a simpler proof of the Dvoretzky theorem, a

fundamental result in convex geometry originally conjectured by Grothendieck. Since

then, Milman vigorously promoted the concentration of measure as a useful tool for

attacking serious mathematical problems (see the discussions in [58][60]). Moreover,

14

Gromov studied concentration of measure on the manifolds. Thus the above for-

mulation for the concentration of measure has been called the “Gromov-Milman”

formulation (see e.g. [58]). Margulis investigated isoperimetry on the hypercube with

application to a phase transition problem on the random graph in the 1970s [61].

In information theory, concentration of measure has also been called the blowing-up

lemma [27][8].

Peter Gacs recalled the following (see [62]) about how the ideas of concentration

of measure in the early 1970s, and in particular the result of Margulis, influenced

his work (with Ahlswede and Korner) on the strong converse of the side information

problem [27]:

Rudi Ahlswede, at a visit in Budapest, posed a question that I found

solvable. I remembered a Dobrushin seminar presentation of Margulis

about a certain phase transition in random graphs, in particular a certain

isoperimetric theorem used in the proof. This is how the Blowup Lemma

was born. We had a pleasant collaboration with Janos Korner in the

course of working it into a paper, along with a couple of other results.

In the 1980s, Marton [63] invented a transportation method for proving concentra-

tion of measure, yielding simpler proofs and stronger claims for some previous results

[64]. The heart of this methodology is the equivalence between the transportation-cost

inequality and the concentration inequality, which can also be viewed as an instance

of convex duality (see Section 2.5.7).

1.2.2 Is It Natural for Information Theorists?

Although the blowing-up lemma (the Gromov-Milman formulation of concentration

for metric-measure spaces) has been successfully applied to the strong converses of

many network information theory problems [8], the drawbacks are also evident: as

15

we alluded before, the BUL approach does not yield the optimal second-order term,

and is limited to the finite alphabet case.8 Perhaps more importantly, assuming a

metric structure on the alphabet which plays an active role is not logically satisfying.

Whether it be channel coding or the side information problem, it is clear that the

fundamental limit does not change if a measure-preserving map (which may very well

distort the metric structure) is applied to the alphabet.

The new converse approach we introduce in this thesis addresses these issues.

In particular, the Gromov-Milman formulation of concentration of measure is

sidestepped, and a metric structure need not be imposed on the alphabet. We now

proceed to review some elements in functional analysis that form the basis of this

new methodology.

1.3 Functional Analysis

1.3.1 In Large Deviation Theory

While most textbooks on information theory first define the relative entropy by

DpP Qq :“ E„

logdP

dQpXq

, (1.16)

X „ P , and then prove the variational formula (1.1), Varadhan, the founder of

the general theory of large deviations, chose the other way around: he defined the

relative entropy via (1.1) as the convex dual (Legendre-Fenchel dual) of the cumulant

8Although BUL can be extended beyond finite alphabets, the BUL approach to strong conversesinvolves a union bound step which requires finite alphabets.

16

generating function and then proved its equivalence to (1.16) [65, Section 10].9 The

“functional approach” of the present thesis follows the same spirit.

In fact, the idea of defining an object “dually” is very common in mathematics. For

example, in probability theory, some authors prefer to define a probability measure

as a functional on the linear space of continuous functions (i.e. using the claim of

the Riesz-Kakutani theorem as the definition), and then recover the usual definition

(assigning real numbers to measurable sets) and other properties [67][68]. Other

examples include cohomology in topology (dual of homology) and vector fields in

differential geometry (dual of “local functions” on a manifold). Although not the

most intuitive way of defining those objects, such “dual” definitions often turn out

to be more technically convenient in deeper studies of the subject.

1.3.2 Hypercontractivity and Its Reverse

Let LppQY q denote the linear space of all measurable function f equipped with the

Lp norm:

fLppQY q :“ E1p rfppY qs (1.17)

where Y „ QY . Let T be a linear operator from LppQY q to LqpQXq; if10

T LppQY qÑLqpQXq ď 1 (1.18)

for some q ě p ě 0 we say T is a hypercontraction (the case of p “ q is called a

contraction).

9While the equivalence of (1.16) and (1.1) follows immediately from the nonnegativity of therelative entropy when the f in (1.1) is assumed to be measurable, in the large deviation theory theunderlying space is usually assumed to be a metric-measure space (e.g. a Polish space), and thefunction f is assumed to be bounded and continuous. In that case, the proof of the equivalence of(1.16) and (1.1) (due to Donsker and Varadhan) is more involved [66].

10Recall that the norm of an operator T is defined as the supremum norm of Tf over f whosenorm is bounded by 1.

17

For information theorists, such an operator T arises naturally when a random

transformation QY |X is considered (in which case T is called a conditional expectation

operator): Given QY |X , the corresponding T maps a function f on Y to

x ÞÑ ErfpY q|X “ xs (1.19)

which is a function on X , where Y „ QY |X“x conditioned on X “ x. For some delicate

problems (e.g. the proof of the dual formulation of certain functional inequalities in

Chapter 2), it is more technically convenient (and perhaps also more elegant) to

define those objects “dually”: as alluded before, a measure can be thought of as a

linear functional for a suitable space of test functions. Thus, a random transformation

QY |X (yielding a linear mapping taking a measure on X to a measure on Y) can be

thought of as the dual (or conjugate) of a conditional expectation operator T mapping

functions on Y to functions on X .

Hypercontractivity was initially considered in theoretical physics [69] and later

in functional analysis [70], theoretical computer science [71], and statistics [72][73].

In information theory, it was introduced by Ahlswede-Gacs [7], and more recently

received revived interest through the work of Anantharam et al. [9]. It is known

[74][75][76] that the functional definition (1.18) is equivalent to an information theo-

retic characterization, which we discuss in more details in Section 2.5.4. The strong

data processing constant [25] can be recovered from the region of hypercontractive

parameters p and q via a limiting argument [7].

The operator T is said to satisfy a reverse hypercontractivity if

TfLqpQXq ě fLppQY q, @f ě 0 (1.20)

for some 0 ď q ď p ď 1. Initially studied by Borell in the Gaussian and the Bernoulli

cases in the 1970s, reverse hypercontractivity has not found many applications until

18

the recent work of Mossel et al. [77]. Information theoretic formulations of the reverse

hypercontractivity were studied in [78][79][80].

In this thesis, we show that the reverse hypercontractivity mechanism gives rise

to a general approach of proving strong converses which uniformly improves the tra-

ditional blowing-up methodology. In hindsight, this makes good sense, since literally,

the reverse of “contraction” is nothing but “expansion” (i.e. blowing-up). However,

this cute observation was not exploited before, even though Ahlswede, Gac and Korner

have worked on both hypercontractivity [7] and the strong converse problem [27]

around 1976.

We remark that various bounds between hypercontractivity (and its reverse) and

measure concentration have been discussed in certain special cases such as Gaussian

[81, P116][82] or Bernoulli [56]. However, our present application of reverse hyper-

contractivity relies on different arguments, and is also more canonical and convincing

than those discussions in terms of the generality and the sharpness of our results.

1.3.3 Brascamp-Lieb Inequality

The Brascamp-Lieb inequality is a functional inequality that assumes as special cases,

or closely relates to, many other familiar inequalities, including Holder’s inequality,

the sharp Young inequality, the Loomis-Whitney inequality, hypercontractivity, the

entropy power inequality, and the logarithmic Sobolev inequality. Originally studied

by Brascamp and Lieb [83] in the 1970s motivated by problems in particle physics,

the Brascamp-Lieb inequality has now found applications in numerous other fields

such as convex geometry [84], statistics [85], and theoretical computer science [86].

The connection to information theory was already observed in the paper of Cover

and Dembo [87] in the context of the entropy power inequality. However, the present

thesis will investigate a more general connection to information via convex duality

and its implications, as we explain next.

19

By convex duality, the (functional form of the) Brascamp-Lieb inequality has an

equivalent formulation in terms of a bound on the linear combination of relative en-

tropies. Such an equivalent formulation was observed in the work of Carlen and

Cordero-Erausquin [76], Alternative proofs (without using the Gibbs variational for-

mula) and extensions are recently proposed by Nair [74] and Liu et al. [80]. Similarly,

the reverse Brascamp-Lieb inequality [88] also has equivalent information theoretic

formulations; see [79][80].

In high dimensions, we show that a small perturbation of the underlying mea-

sure in the Brascamp-Lieb inequality changes the first-order asymptotics to a certain

linear combination of mutual informations (instead of the linear combinations of the

relative entropy). Such a perturbation idea is a generalization of that of the smooth

entropy arising in quantum information theory [53]. Thus, we obtain a correspondence

between these mutual informations (which appear in the expression of single-letter

regions) and a functional inequality, which can be used in the strong converse proofs

if we take the functions to be the indicator functions of certain encoding or decoding

sets. In the case of finite alphabets, similar results were derived under the moniker

“image-size characterization” [8], using the monotonicity of relative entropy instead

of convex duality.

The Brascamp-Lieb inequality can be restated as an optimization problem over

some measurable functions. The original work of Brascamp and Lieb [83] proved that

Gaussian functions achieve the optimal value. Since then, various other proofs of the

Gaussian optimality have been found [89][90][75][76][91]. Among them, Lieb’s proof

[89] used the fact that two independent random variables whose sum and difference

are also independent must be Gaussian. Using the same property of the Gaussian dis-

tribution but working with information theoretic inequalities, Geng-Nair [92] proved

the Gaussian optimality in the single-letter region of several network information the-

ory problems. Liu et al. [93] gave information-theoretic proofs of the Brascamp-Lieb

20

inequality and its extensions with several improvements over the methodology of Lieb

[89] and Geng-Nair [92].

1.4 Organization of the Thesis

The trilogy of the proposal of the functional approach to information theoretic con-

verses is comprised of the first three chapters of the thesis. Each of these chapters

describes one of the three main ingredients: the convex duality of the functional and

entropic optimizations; the behavior of the optimal value under a perturbation of the

underlying measure; and the reverse hypercontractivity of Markov semigroups. In

each of these chapters we will return to the one-communicator common randomness

generation problem introduced in Section 1.1.5, and show how the introduction of the

new ingredients gets us closer to the goal of obtaining an optimal Op?nq second-order

converse. Towards the end of Chapter 4, we also discuss how to apply the proposed

methodology to other problems in network information theory, such as source net-

works, broadcast channels, and the empirical distribution of good channel codes.

Chapter 2 contains additional results on convex duality which are of independent

interest, but not needed for the applications to common randomness generation.

Chapter 5 presents additional examples of information measures, their dual formu-

lations, and the applications. Eγ is a generalization of the total variational distance.

We get a glimpse of the power of Eγ through channel resolvability in the large de-

viation regime, and applications of the result to source coding, broadcast channels,

and wiretap channels. g‹ is a functional variant of the quantity g in the chapter on

“image-size characterizations” in the book of Csiszar-Korner [8]. The proposed g‹

quantity can substitute the g in the converse proofs of network information theory

problems such as Gelfand-Pinsker coding, but the analysis of the former is cleaner

and also yields the optimal scaling of the second-order term.

21

While Chapters 2-5 are concerned with the applications of functional inequalities

to problems in information theory, Chapter 6 offers the counterpoint that entropic

inequalities and standard techniques in information theory provide simpler alterna-

tive approaches to certain functional inequalities, such as establishing the Gaussian

optimality in the Brascamp-Lieb inequality.

Given the successes of functional inequalities in common randomness generation,

source coding and channel coding, it is tempting to extend this paradigm to the secret

key generation problem, which imposes security constraints on common randomness

generation, as well as the interactive common randomness generation problem where

terminals may communicate back-and-forth under a rate constraint. We discuss these

operational problems and the challenges of extending the functional approach to these

settings in Chapter 7.

With the understanding that readers from various backgrounds may be interested

in different aspects of the thesis, we provide the following recommended itineraries

for the first reading:

• Readers interested in strong converse and non-asymptotic information theory

may prefer to look closely at the first four chapters, excluding Sections 2.4-2.5.

• Readers interested in information security, especially common generation or key

agreement, may prefer to read Sections 2.3, 3.3, 4.4.4, for the converse proof for

the one-communicator common randomness generation model, and then visit

Section 5.2.6 for the application of Eγ to the wiretap channels. Moreover, a

summary of some other results on secret key generation by the author can be

found in Chapter 7.

• Readers interested in information measures or the proof of Gaussian optimal-

ity in the rate regions in network information theory may prefer to focus on

Chapters 2 and 6.

22

Chapter 2

A Tale of Two Optimizations

It is a functional optimization; it is an entropic optimization. We investigate which

viewpoint is better, for various problems under consideration.

This chapter investigates the fundamental convex duality machinery that explains

the connection between the functional optimization and the entropic optimization.

To be concrete, we look at the example of the Brascamp-Lieb inequality from func-

tional analysis. More precisely, we consider a slight extension of the Brascamp-Lieb

inequality tailored to our information-theoretic applications. The entropic formula-

tion of such a functional inequality is proved in Sections 2.1-2.2. The result is then

applied to the common randomness generation problem in Section 2.3 where a zero-

rate single-shot converse bound is derived. The reverse Brascamp-Lieb inequality,

although formally symmetrical to the Brascamp-Lieb inequality, requires more so-

phistication in the proof of the equivalence to the entropic formulation (Section 2.4),

and may be skipped by readers who are only interested in the applications to network

information theory. Section 2.5 reviews several special cases of the Brascamp-Lieb

inequality and its reverse.

23

2.1 The Duality between Functions and Measures

We first introduce some general notations and definitions used in this thesis. An

alphabet (denoted by X , Y etc.) is a measurable space, that is, a set equipped with a

σ-algebra. The set of all real-valued measurable functions (denoted by f , g etc.) on

a given alphabet forms a linear space. We use Greek letters (e.g. µ) for unnormalized

measures, capital Roman letters (e.g. P and Q) for probability measures. A measure

µ defines a linear functional on the space of functions via integration:

f ÞÑ

ż

fdµ, (2.1)

hence the space of measures may be viewed as the dual space (see e.g. [67]) of the space

of “test functions”. Note that this is a general and informal statement since we haven’t

specified what are the “test functions” yet. When we deal with the reverse Brascamp-

Lieb inequality in later sections of this chapter, it is technically necessary to choose a

“nice” space of functions (e.g. bounded continuous functions) so that a certain space of

measures indeed forms the set of all bounded linear functionals, meaning that the space

of measures is the dual of the space of functions in the precise mathematical sense [67].

However, in other places in this thesis, it is generally more convenient to consider the

ensembles of non-negative functions and non-negative measures instead. In this case,

the duality between the two linear spaces no longer holds in the conventional precise

sense of functional analysis (since µ may no longer be bounded as a linear functional),

but the integration (2.1) still always makes sense (possibly equal to `8) thanks

to the nonnegativity assumption. Hence, except in sections related to the reverse-

type inequalities, no additional assumptions (e.g. a topological structure) need to be

imposed on the alphabet other than being a measurable space.

A fundamental quantity in information theory is the relative entropy. First, given

two nonnegative σ-finite measures θ ! µ on X , let us define the relative information

24

as the logarithm of the Radon-Nikodym derivative:

ıθµpxq :“ logdθ

dµpxq (2.2)

where x P X . Note that there is no assumption about |X |, and recall that the right

side of (2.2) is unique up to values on a set of µ-measure zero. Then, the relative

entropy between a probability measure P and a σ-finite measure µ on the same

measurable space is defined as

DpP µq :“ E“

ıP µpXq‰

(2.3)

where X „ P , if P ! µ, and infinity otherwise.

A fundamental quantity in functional analysis is the norm: for p P p0,8q and a

real-valued measurable function f on X , define the p-th norm

fLppµq :“

ˆż

|f |pdµ

˙1p

. (2.4)

Sometimes we abbreviate fLppµq as fp, or as fpXqp if µ is a probability measure

and X „ µ. Note that if µ is a probability measure, and g is a real-valued function

on X , then

log exppgqp “1

plogErexpppgpXqqs (2.5)

“1

pΛgpXqppq (2.6)

where ΛgpXqppq denotes the cumulant-generating function of the real-valued random

variable gpXq.

25

Using a bit of convex analysis (see e.g. [94]) one can verify that the convex func-

tional

P ÞÑ1

pDpP µq (2.7)

is the Legendre dual of the convex functional

g ÞÑ log exppgqp. (2.8)

This basic fact hints that functional inequalities in functional analysis (typically in-

volving integrals and norms) have dual correspondences with inequalities in informa-

tion theory (typically involving the relative entropy). In the next section, we explore a

very canonical instance of such a correspondence, in the context of the Brascamp-Lieb

inequality.

2.2 Dual Formulations of the Forward Brascamp-

Lieb Inequality

The Brascamp-Lieb inequality1, originally motivated by problems from particle

physics [83], can be viewed as a property of the “correlation” among random vari-

ables, and turns out to be naturally suited for the analysis of the common randomness

generation problem. In the literature, the Brascamp-Lieb inequality usually refers

to the fact that Gaussian functions saturate a certain functional inequality. To be

concrete, let us look at a modern formulation of the result from Barthe’s paper [88]:2

1Not to be confused with the “variance Brascamp-Lieb inequality” (cf. [95][96][97]), a differenttype of inequality that generalizes the Poincare inequality.

2 [88, Theorem 1] actually contains additional assumptions, which make the best constants Dand F positive and finite, but not really necessary for the conclusion to hold ([88, Remark 1]).

26

Theorem 2.2.1 ([88, Theorem 1]) Let E, E1, . . . , Em be Euclidean spaces,

equipped with the Lebesgue measure, and Bi : E Ñ Ei be linear maps. Let pciqmi“1 and

D be positive real numbers. Then the Brascamp-Lieb inequality

ż mź

i“1

fipBixq dx ď Dmź

i“1

fi 1ci

, (2.9)

where

fj 1cj

:“ Ecj„

f1cj

j pYjq

, (2.10)

for all nonnegative measurable functions fi on Ei, i “ 1, . . . ,m, holds if and only if

it holds whenever fi, i “ 1, . . . ,m are centered Gaussian functions3.

Various proofs of the Gaussian optimality in the Brascamp-Lieb inequality have been

proposed since the original paper of Brascamp and Lieb [83]; we further discuss these

matters in Section 6.2. In this chapter, however, we are only interested in the form

of the inequality (2.9) rather than the Gaussian optimality result. In particular, we

do not restrict our attention to the setting of Euclidean spaces and linear maps: the

underlying spaces and the mappings between them can be arbitrary.

The form of the inequality (2.9) can be seen as a generalization of several other

familiar inequalities, including Holder’s inequality, the sharp Young inequality, the

Loomis-Whitney inequality, the entropy power inequality, hypercontractivity and the

logarithmic Sobolev inequality; we further discuss these special cases in Section 2.5

(see also [98] [99]) [70]).

The Brascamp-Lieb inequality is known to imply several information-theoretic

inequalities, such as the entropy power inequality [87, Theorem 12] and the entropic

uncertainty principle [100]; see also the summary in [93]. In this thesis, however, we

3A centered Gaussian function is of the form x ÞÑ expp´xJAxq where A is a positive semidefinitematrix.

27

are mainly interested in the dual formulations of the inequality (2.9). Such duality

was noted by Carlen and Cordero-Erausquin [76, Theorem 2.1] (see also [75] for a

special case). We reproduce their result here:

Theorem 2.2.2 ([76, Theorem 2.1]) Let X , Y1,. . . ,Ym be measurable spaces.

Consider a probability measure QX , measurable maps φj : X Ñ Yj, j “ 1, . . . ,m,

probability measures QY1,. . . ,QYm (not necessarily induced by QX and pφjqmj“1), and

c1, . . . , cm P p0,8q. For any d P R, the following statements are equivalent:

1. For any non-negative measurable functions fj : Yj Ñ R, j “ 1, . . . ,m, it holds

that

E

«

mź

j“1

fjpφjpXqq

ff

ď exppdqmź

j“1

fj 1cj

(2.11)

where X „ QX and Yj „ QYj .

2. For any probability measure PX ! QX ,

DpPX ||QXq ` d ěmÿ

j“1

cjDpPYj ||QYjq (2.12)

where PYj is induced by PX and φj, j “ 1, . . . ,m.

In this section we prove an extension of the duality of Carlen and Cordero-

Erausquin, with a functional inequality that generalizes (2.11) by allowing a cost

function and random transformations in lieu of tφju. Both generalizations are essen-

tial for certain information-theoretic applications. The “forward-reverse Brascamp-

Lieb inequality” alluded before will be introduced in later sections since, although

the forward-reverse inequality essentially generalizes the forward inequality (2.11),

28

the proof of its equivalent formulation is more involved and applies only to certain

“regular” (though fairly general) spaces4.

Theorem 2.2.3 Fix a probability measure QX , an integer m P t1, 2, . . . u, cj P p0,8q,

and random transformation QYj |X5 for each j P t1, . . . ,mu. Let pX, Yjq „ QXQYj |X .

Assume that d : X Ñ p´8,8s is a measurable function satisfying

0 ă Erexpp´dpXqqs ă 8. (2.13)

The following statements are equivalent:

1. For any non-negative measurable functions fj : Yj Ñ R, j P t1, . . . ,mu, it holds

that

E

«

exp

˜

mÿ

j“1

Erlog fjpYjq|Xs ´ dpXq

¸ff

ď

mź

j“1

fj 1cj

(2.14)

where the norm fj 1cj

is with respect to QYj .

2. For any distribution PX ! QX , it holds that

DpPX ||QXq ` ErdpXqs ěmÿ

j“1


where X „ PX , and PX Ñ QYj |X Ñ PYj for j P t1, . . . ,mu.

Proof • 1)ñ2) Define

d0 :“ logE

«

expp´dpXqqmź

j“1

exppcjErıPYj ||QYj pYjq|Xsq

ff

. (2.16)

4More precisely, the “entropicñfunctional” direction is not more difficult than the forward in-equality, but the “functionalñentropic” direction requires sophisticated min-max theorems and isonly proved in for Polish spaces. In the finite alphabet case, the latter difficulty can be circumventedby using the KKT conditions [80].

5Throughout the thesis with the exception of the discussions on the reverse Brascamp-Lieb in-equality (mostly, Section 2.4), the definition of a random transformation is the same as the conven-tional definition of a regular conditional probability (see e.g. [43]).

29

Invoking statement 1) with

fj Ð

ˆ

dPYjdQYj

˙cj

(2.17)

we obtain

exppd0q ď

mź

j“1

ˆ

E„

dPYjdQYj

pYjq

˙cj

“ 1. (2.18)

Now if 0 ă exppd0q ď 1 then

dSXpxq :“ expp´dpxq ´ d0q

mź

j“1

exppcjErıPYj ||QYj pYjq|X “ xsqdQXpxq (2.19)

is a probability measure. Then (2.18) combined with the nonnegativity of rela-

tive entropy shows that

mÿ

j“1

cjDpPYj ||QYjq ď DpPX ||SXq ´ d0 `

mÿ

j“1


“ DpPX ||QXq ` ErdpXqs (2.21)

and statement 2) holds. On the other hand, if exppd0q “ 0, then for QX-almost

all x,

expp´dpxqqmź

j“1

exppcjErıPYj ||QYj pYjq|X “ xsq “ 0. (2.22)

Taking logarithms on both sides and taking the expectation with respect to PX ,

we have

´ErdpXqs `mÿ

j“1

cjDpPYj ||QYjq “ ´8 (2.23)

30

and statement 2) also follows.

• 2)ñ1) It suffices to prove for fj’s such that 0 ă a ă fj ă b ă 8 for some a and

b, since the general case will then follow by taking limits (e.g. using monotone

convergence theorem). By this assumption and (2.13), we can always define PX

through

ıPXQX pxq “ ´dpxq ´ d0 ` E

«

mÿ

j“1

log fjpYjq

ˇ

ˇ

ˇ

ˇ

ˇ

X “ x

ff

(2.24)

and SYj through

ıSYj QYj pyjq :“1

cjlog fjpyjq ´ dj, (2.25)

for each j P t1, . . . ,mu, where dj P R, j P t0, . . . ,mu are normalization con-

stants, therefore

exppd0q “ E

«

exp

˜

´dpXq ` E

«

mÿ

j“1

log fjpYjq

ˇ

ˇ

ˇ

ˇ

ˇ

X

ff¸ff

; (2.26)

exppdjq “ E„

exp

ˆ

1

cjlog fjpYjq

˙

, j P t1, . . . ,mu. (2.27)

But direct computation gives

DpPX ||QXq “ ´ErdpXqs ´ d0 ` E

«

mÿ

j“1

log fjpYjq

ff

(2.28)

DpPYj ||QYjq “ DpPYj ||SYjq ` E”

ıSYj ||QYj pYjqı

(2.29)

“ DpPYj ||SYjq ´ dj ` E„

1

cjlog fjpYjq

(2.30)

31

where Yj „ PYj . Therefore statement 2) yields

´d0 ` E

«

mÿ

j“1

log fjpYjq

ff

ě

mÿ

j“1

cjDpPYj ||SYjq ´mÿ

j“1

cjdj ` E

«

mÿ

j“1

log fjpYjq

ff

.

(2.31)

Since fj’s are assumed to be bounded, ´8 ă E”

řmj“1 log fjpYjq

ı

ă 8 so we

can cancel it from the two sides of the inequality. It then follows from the

non-negativity of relative entropy that

d0 ď

mÿ

j“1

cjdj (2.32)

which is equivalent to statement 1) in view of (2.26) and (2.27).

Remark 2.2.1 We prove Theorem 2.2.3 by defining certain auxiliary measures and

then reducing (2.14) or (2.15) to the nonnegativity of relative entropy. Alternatively,

Theorem 2.2.3 may be proved with a variational formula for the relative entropy in a

similar manner as the proof of [76, Theorem 2.1].

Remark 2.2.2 As a convention, if the left side of (2.15) is 8 ´ 8 it should be

understood as E”

ıPXQX pXq ` dpXqı

with X „ PX . Note that the assumption (2.13)

ensures that

dTXpxq :“expp´dpxqqdQXpxq

Erexpp´dpXqqs(2.33)

defines a probability measure TX so that

E”

ıPXQX pXq ` dpXqı

“ D pPXTXq ´ logErexpp´dpXqqs (2.34)

32

is always well-defined (finite or `8).

Remark 2.2.3 In the statements of Theorem 2.2.3, QX is a probability measure and

QX and QYj are connected through the random transformation QYj |X . These help to

keep our notations simple and suffice for most of our applications. However, from the

proof it is clear that these restrictions are not really necessary. In other words, we

have the extension of Theorem 2.2.3 that the following two statements are equivalent:

ż

exp

˜

mÿ

j“1

Erlog fjpYjq|X “ xs ´ dpxq

¸

dνpxq ďmź

j“1

fj 1cj

, @f1, . . . , fm; (2.35)

DpPX ||νq `

ż

dpxqdνpxq ěmÿ

j“1

cjDpPYj ||µjq, @PX , (2.36)

where again fj : Yj Ñ R, j P t1, . . . ,mu are nonnegative measurable functions, PX !

ν, and PX Ñ QYj |X Ñ PYj for j P t1, . . . ,mu. Here the measures ν and µj need not

be normalized and need not be connected by QYj |X , and ¨ 1cj

is with respect to µj.

2.3 Application to Common Randomness Genera-

tion

We now use Theorem 2.2.3 to prove a preliminary converse bound for common ran-

domness generation.

2.3.1 The One Communicator Problem: the Setup

We first formulate the problem in the one-shot setting. Let QXYm be the joint dis-

tribution of sources X, Y1, . . . , Ym, observed by terminals T0, . . . , Tm as shown in

Figure 1.1. The communicator T0 computes/encodes (possibly stochastically) the

33

integers W1pXq, . . . , WmpXq and sends them to T1, . . . , Tm, respectively. Then, ter-

minals T0, . . . , Tm compute/decode (either deterministically or stochastically, which

will be specified in the statements of our results) integers KpXq, K1pY1,W1q,. . . ,

KmpYm,Wmq. The goal is to produce K “ K1 “ ¨ ¨ ¨ “ Km with high probability with

K almost equiprobable.

In the stationary memoryless case where the sources have the per-letter distribu-

tion QXYm , take X Ð Xn, Yj Ð Y nj in the above formulation.

Definition 2.3.1 The pm ` 1q-tuple pR,R1, . . . , Rmq is said to be achievable if a

sequence of key generation schemes can be designed to fulfill the following conditions:

lim infnÑ8

1

nHpKq ě R; (2.37)

lim supnÑ8

1

nlog |Wl| ď Rl, l “ 1, . . . ,m; (2.38)

limnÑ8

max1ďlďm

PrK ‰ Kls “ 0. (2.39)

Irrespective of the stochasticity of the decoders (cf. [16][101]), the achievable region

is the set of pR,R1, . . . , Rmq P r0,8qm`1 such that

supPU|X

#

mÿ

j“1

cjIpU ;Yjq ´ IpU ;Xq

+

`ÿ

j

cjRj ě

˜

ÿ

j

cj ´ 1

¸

R (2.40)

for all cm P p0,8qm (equivalently, for all cm P p0,8qm such thatř

cj ą 1), where

pU,X, Yjq „ PU|XQXYj .

Remark 2.3.1 It is widely known in the information-theoretic secrecy literature (see

e.g. [102] for a discussion) that the rate region (2.40) is rather “robust” under various

definitions. For example, (2.40) does not change if one replaces (2.37) and (2.39) in

34

the conventional definition with

lim infnÑ8

1

nlog |K| ě R; (2.41)

limnÑ8

|PKKm ´ TKKm | “ 0, (2.42)

where TKKm denotes the ideal distribution:

TKKmpk, kmq “1

|K|1tk “ k1 “ ¨ ¨ ¨ “ kmu, @k, k1, . . . , km. (2.43)

In fact, such a definition appears to be more natural for the second-order analysis in

the nonvanishing error regime.

2.3.2 A Zero-Rate One-Shot Converse for the Omniscient

Helper Problem

Although the single-letter rate region is known for the one-communicator problem

(2.40), the previous proof based on Fano’s inequality only establishes a weak converse

result, that is, the error probability is assumed to be vanishing. As mentioned before,

a main goal of the first three chapters is to supply general converse tools to solve,

in particular, the optimal scaling of the second-order rate for the one-communicator

problem with nonvanishing error probability. In this section, however, we achieve an

intermediate goal, which is more modest than the final goal in the following senses:

• The first order asymptotics of the bound is tight only when the supremum in

(2.40) is zero. In other words, we bound the maximum ratio of the log alphabet

sizes of the key and the messages such that the key can be successfully generated

in the omniscient helper problem. Since this ratio is supremized as the key rate

and the communication rates tend to zero, such a converse bound may also be

35

called a zero-rate converse. The bound is only asymptotically tight in the case

of abundant correlated sources but limited communication rates.

• We consider only the omniscient helper special case of the one-communicator

problem, where X “ Y m. Actually, the analysis should work as long as each Yj

is a (deterministic) function of X, but let us consider X “ Y m for simplicity.

We remark that the above two limitations will be remedied by techniques introduced

in Chapters 3,4 respectively.

Now, the rate region (2.40) suggests that we should find a correspondence between

the linear combination of mutual informations to a certain functional inequality. Sup-

pose that random variables Y1,. . . ,Ym and c1, . . . , cm P p0,8q satisfy

E

«

mź

l“1

flpYlq

ff

ď

mź

l“1

flpYlq 1cl

(2.44)

for all bounded real-valued measurable functions fl defined on Yl, l “ 1, . . . ,m. Then

Theorem 2.2.3 shows that (2.44) is equivalent to

DpPYmQYmq ě

mÿ

l“1

clDpPYlQYlq (2.45)

for all PYm ! QYm . To further relate (2.45) to the rate region formula (2.40), we

need the following result:

Proposition 2.3.1 For given QYm and c1, . . . , cm P r1,`8q, (2.45) holds for all

PYm ! QYm if and only if

IpU ;Y mq ě

mÿ

l“1

clIpU ;Ylq (2.46)

holds for all PU |Ym.

36

Proof The (2.45)ñ(2.46) part follows immediately from the definition of the mutual

information. The less trivial direction (2.46)ñ(2.45) can be shown by constructing a

random variable U that takes two values; see for example [74].

In view of the rate region formula (2.40), a rate tuple pR,R1, . . . , Rmq is not

achievable if

R ămÿ

l“1

clpR ´Rlq (2.47)

holds for some c1, . . . , cm satisfying the assumption in Proposition 2.3.1. Conversely,

if r1, . . . , rm P p0,8q satisfies the property that 1 ěřml“1 clp1´ rlq for all pc1, . . . , cmq

such that the assumption in Proposition 2.3.1 is satisfied, then for any δ ą 0 there

exists pR,R1, R2, . . . , Rmq achievable such that RlRă rl ` δ for each l “ 1, . . . ,m. In

other words, the set of admissible pc1, . . . , cmq completely characterizes the set of the

achievable ratios of R1

R, . . . , Rm

R.

We are now ready to prove a zero-rate one-shot converse for the omniscient

helper problem. Suppose the (possibly stochastic) encoder for the public messages

is specified by PWm|Ym and the (possibly stochastic) decoder for the key is given byśm

l“1 PKl|YlWl. Let TKm be the correct distribution under which K1 “ K2 “ ¨ ¨ ¨ “ Km

is equiprobably distributed on K. Clearly, a small total variation |PKm´TKm | implies

both uniformity of the key distribution and a small probability of key disagreement;

in fact in the stationary memoryless case it can be shown that this is asymptotically

equivalent to imposing both the entropy constraint (2.37) and the error constraint

(2.39).

37

Theorem 2.3.2 In the omniscient helper problem, if Y m and c1, . . . , cm satisfies the

condition in Proposition 2.3.1, then

1

2|PKm ´ TKm | ě 1´

1

|K| ´«

|K|mź

l“1

ˆ

|Wl|

|K|

˙clff

1ř

cl

. (2.48)

Proof For any k P K,

P

«

mč

l“1

tKl “ ku

ff

(2.49)

“

ż

Ym

ÿ

wm

mź

l“1

PKl“k|YlWl“wlPWm|YmdPYm (2.50)

ď

ż

Ymmaxwm

mź

l“1

PKl“k|YlWl“wldPYm (2.51)

ď

ż

Ym

mź

l“1

maxwl

PKl“k|YlWl“wldPYm (2.52)

ď

mź

l“1

„ż

Ylpmaxwl

PKl“k|YlWl“wlq1cl dPYl

cl

(2.53)

ď

mź

l“1

„ż

Ylmaxwl

PKl“k|YlWl“wldPYl

cl

(2.54)

ď

mź

l“1

«

ÿ

wl

ż

YlPKl“k|YlWl“wldPYl

ffcl

(2.55)

where

• (2.53) uses the definition of hypercontractivity;

• (2.54) uses 0 ă cl ă 1 and maxwl PKl“k|YlWl“wl ď 1.

Raising both sides of (2.55) to the power of 1řmi“1 ci

, we obtain

P

«

mč

l“1

tKl “ ku

ff1

ř

i ci

ď

mź

l“1

«

ÿ

wl

ż


ff

clř

i ci

(2.56)

38

But the function tm ÞÑśm

l“1 tcl

ř

i ci

l is a concave function on r0,8qm, so by Jensen’s

inequality,

1

|K|

|K|ÿ

k“1

mź

l“1

«

ÿ

wl

ż


ff

clř

i ci

(2.57)

ď

mź

l“1

«

ÿ

wl

ż

Yl

1

|K|

|K|ÿ

k“1

PKl“k|YlWl“wldPYl

ff

clř

i ci

(2.58)

“

mź

l“1

«

ÿ

wl

ż

Yl

1

|K|dPYl

ff

clř

i ci

(2.59)

“

mź

l“1

ˆ

|Wl|

|K|

˙

clř

i ci

(2.60)

Combining (2.56) and (2.60) we obtain

1

|K|

|K|ÿ

k“1

P

«

mč

l“1

tKl “ ku

ff1

ř

i ci

ď

mź

l“1

ˆ

|Wl|

|K|

˙

clř

i ci

. (2.61)

Now, if we can show the following bound

1

2|PKm ´ TKm |

“

|K|ÿ

k“1

ˇ

ˇ

ˇ

ˇ

ˇ

P

«

mč

l“1

tKl “ ku

ff

´1

|K|

ˇ

ˇ

ˇ

ˇ

ˇ

` 1´

|K|ÿ

k“1

P

«

mč

l“1

tKl “ ku

ff

(2.62)

ě 1´1

|K| ´ |K|1

ř

l cl´1

|K|ÿ

k“1

P

«

mč

l“1

tKl “ ku

ff1

ř

l cl

(2.63)

then the proof is finished by combining (2.61) and (2.63).

39

To finish the proof we need to find an lower bound on (2.62) in terms of the left

side of (2.61). Consider the optimization problem over nonnegative vector a|K|:

minimize fpa|K|q :“

|K|ÿ

k“1

ˇ

ˇ

ˇ

ˇ

1

|K| ´ akˇ

ˇ

ˇ

ˇ

`

(2.64)

subject to gpa|K|q :“1

|K|

|K|ÿ

k“1

a1

ř

l clk ď λ (2.65)

where λ ą 0 is some constant. Notice that if a|K| and b|K| satisfies

1

|K|

|K|ÿ

k“1

ak “1

|K|

|K|ÿ

k“1

bk (2.66)

and that ak ´1|K| and bk ´

1|K| have the same sign, then (because 1

ř

clď 1)

fp1

2pa|K| ` b|K|qq “

1

2pfpa|K|q ` fpb|K|qq (2.67)

gp1

2pa|K| ` b|K|qq ě

1

2pgpa|K|q ` gpb|K|qq. (2.68)

This implies that an optimizer a|K| of (2.64) satisfies

|tk : ak P p0,1

|K| qu| ď 1. (2.69)

Moreover for any a|K|, define bk :“ ak1tak ă1|K|u `

1|K|1tak ě

1|K|u, then

fpb|K|q “ fpa|K|q (2.70)

gpb|K|q ď pgpa|K|qq. (2.71)

Thus a|K| can further be assumed to be of the following form:

a|K| “

„

1

|K| ,1

|K| , . . . ,1

|K| , η, 0, 0, . . . , 0

(2.72)

40

for some 0 ă η ď 1|K| , and then it is elementary to show that the optimal value for

(2.64) satisfies

f ě 1´ λ|K|1

ř

l cl ´1

|K| . (2.73)

Thus we have shown that

1

2|PKm ´ TKm | ě 1´

1

|K| ´ |K|1

ř

l cl´1

|K|ÿ

k“1

PrK1 “ ¨ ¨ ¨ “ Km “ ks1

ř

l cl (2.74)

as desired.

Remark 2.3.2 Theorem 2.3.2 is stronger than a conventional converse by Fano’s

inequality (which proves exactly the converse part of (2.40)) in the following senses:

• The conventional converse fails to bound the ratio of the number of common

randomness bits to the communication bits if the communication rate is zero. In

contrast, Theorem 2.3.2 is still gives a tight bound on this ratio in the zero-rate

case where the log size of the key alphabet grows sub-linearly in the blocklength.

In fact, as long as

log |K| ´mÿ

l“1

clplog |K| ´ log |Wl|q Ñ ´8 (2.75)

which is weaker than (2.47), Theorem 2.3.2 implies that 12|PKm´TKm | converges

to 1.

• Even if (2.47) holds, the conventional Fano-based converse does not guarantee

that the error probability tends to 1.

41

2.4 Dual Formulation of a Forward-Reverse Brascamp-

Lieb Inequality

In this section we introduce a new type of functional inequality which may be called

the “forward-reverse Brascamp-Lieb inequality”, and prove its equivalent entropic

formulation. The forward-reverse Brascamp-Lieb inequality is a generalization of

both the forward inequality (which we have discussed in Section 2.2) and the reverse

Brascamp-Lieb inequality. The proof of the equivalent formulation of the forward-

reverse inequality is more technically demanding because of the “reverse” component.

Readers who are only interested in applications to network information theory may

skip this section. However, as shown in Section 2.5 the forward-reverse inequality is

useful in providing a unified treatment to almost all duality results in the literature,

previously proved by various other methods.

To begin with, let us review a common form of the reverse Brascamp-Lieb in-

equality, again from Barthe’s paper [88]:

Theorem 2.4.1 Let E, E1, . . . , Em be Euclidean spaces, and Bi : E Ñ Ei be lin-

ear maps. Let pciqmi“1 and F be positive real numbers. The reverse Brascamp-Lieb

inequality6

ż

suppyiq :

řmi“1 ciB

˚i yi“x

mź

i“1

fipyiq dx ě Fmź

i“1

fi 1ci

, (2.76)

for all nonnegative measurable functions fi on Ei, i “ 1, . . . ,m, holds if and only if

it holds for all centered Gaussian functions.

As in the forward case, in this chapter we are only be interested in the form of

the reverse Brascamp-Lieb inequality rather than the Gaussian optimality result.

6B˚i denotes the adjoint of Bi. In other words, the matrix of B˚i is the transpose of the matrixof Bi.

42

In particular, we consider general (possibly nonlinear) mappings between general

(possibly non-Euclidean) spaces.

Lehec [103, Theorem 18] essentially proved that (2.76) is implied by an entropic

inequality, but did not prove the converse implication (which, as we shall see, is the

more challenging direction).

2.4.1 Some Mathematical Preliminaries

Before we introduce the new forward-reverse inequality, some mathematical notations

and preliminaries are necessary: this is because we will prove the “functional implies

entropic” part of the equivalent formulation result, which is much more sophisticated

than the case of the forward inequality. Some regularity assumptions on the alphabets

and the measures appear to be necessary, and some definitions (e.g. that of a random

transformation) are slightly different than the other sections of the thesis in order

to make the proof work (although of course these definitions are equivalent for finite

alphabets).

Throughout this thesis, when the forward-reverse inequality is considered, we

always assume that the alphabets are Polish spaces, and the measures are Borel

measures7. Of course, this covers the cases where the alphabet is Euclidean or discrete

(endowed with the Hamming metric, which induces the discrete topology, making

every function on the discrete set continuous), among others. Readers interested in

finite-alphabets only may refer to the (much simpler) argument in [80] based on the

KKT condition. However, the proof here for the general Polish space case better

manifests the “function-measure duality” viewpoint emphasized in this thesis.

Notation 1 Let X be a topological space.

7A Polish space is a complete separable metric space. It enjoys several nice properties that weuse heavily in this section, including Prokhorov theorem and Riesz-Kakutani theorem (the latter isrelated to the fact that every Borel probability measure on a Polish space is inner regular, hence aRadon measure). Short introductions to the Polish space can be found in e.g. [104][105].

43

• CcpX q denotes the space of continuous functions on X with a compact support;

• C0pX q denotes the space of all continuous function f on X that vanishes at

infinity (i.e. for any ε ą 0 there exists a compact set K Ď X such that |fpxq| ă ε

for x P X zK);

• CbpX q denotes the space of bounded continuous functions on X ;

• MpX q denotes the space of finite signed Borel measures on X ;

• PpX q denotes the space of probability measures on X .

We consider Cc, C0 and Cb as topological vector spaces, with the topology induced

from the sup norm. The following theorem, usually attributed to Riesz, Markov and

Kakutani, is well-known in functional analysis and can be found in, e.g. [67][106].

Theorem 2.4.2 (Riesz-Markov-Kakutani) If X is a locally compact, σ-compact

Polish space, the dual8 of both CcpX q and C0pX q is MpX q.

Remark 2.4.1 The dual space of CbpX q can be strictly larger than MpX q, since

it also contains those linear functionals that depend on the “limit at infinity” of a

function f P CbpX q (originally defined for those f that do have a limit at the infinity,

and then extended to the whole CbpX q by Hahn-Banach theorem; see e.g. [67]).

Of course, any µ P MpX q is a continuous linear functional on C0pX q or CcpX q,

given by

f ÞÑ

ż

fdµ (2.77)

where f is a function in C0pX q or CcpX q. Remarkably, Theorem 2.4.2 states that

the converse is also true under mild regularity assumptions on the space. Thus, we

8The dual of a topological vector space consists of all continuous linear functionals on that space,which is, naturally, also a topological vector space (with the weak˚ topology).

44

can view measures as continuous linear functionals on a certain function space;9 this

justifies the shorthand notation

µpfq :“

ż

fdµ (2.78)

which we use frequently in the rest of the thesis. This viewpoint is the most natural

for our setting since in the proof of the equivalent formulation of the forward-reverse

Brascamp-Lieb inequality we shall use the Hahn-Banach theorem to show the exis-

tence of certain linear functionals.

Definition 2.4.1 Let Λ: CbpX q Ñ p´8,`8s be a lower semicontinuous, proper

convex function. Its Legendre-Fenchel transform Λ˚ : C˚b pX q Ñ p´8,`8s is given

by

Λ˚p`q :“ supuPCbpX q

r`puq ´ Λpuqs. (2.79)

Let ν be a nonnegative finite Borel measure on a Polish space X . The relative

entropy has the following definition: for any µ PMpX q,

Dpµνq :“ supfPCbpX q

rµpfq ´ Λpfqs (2.80)

where we have defined the convex functional on CbpX q:

Λpfq :“ log νpexppfqq (2.81)

“ log

ż

exppfqdν. (2.82)

9In fact, some authors prefer to construct the measure theory by defining a measure as a linearfunctional on a suitable measure space; see Lax [67] or Bourbaki [68].

45

The definition (2.80) agrees with the definition (2.3) when ν is a probability measure,

by the Donsker-Varadhan formula (c.f. [105, Lemma 6.2.13]). If µ is not a probability

measure, then Dpµνq as defined in (2.80) is `8.

Given a bounded linear operator T : CbpYq Ñ CbpX q, the dual operator

T ˚ : CbpX q˚ Ñ CbpYq˚ is defined in terms of

T ˚µX : f P CbpYq ÞÑ µXpTfq, (2.83)

for any µX P CbpX q˚. Since PpX q ĎMpX q Ď CbpX q˚, we can define a conditional

expectation operator as any T such that T ˚P P PpYq for any P P PpX q. A random

transformation T ˚ is defined as the dual of some conditional expectation operator.

Remark 2.4.2 From the viewpoint of category theory (see for example [107][108]),

Cb is a functor from the category of topological spaces to the category of topological

vector spaces, which is contra-variant because for any continuous, φ : X Ñ Y (mor-

phism between topological spaces), we have Cbpφq : CbpYq Ñ CbpX q, u ÞÑ u ˝ f where

u ˝ φ denotes the composition of two continuous functions, reversing the arrows in

the maps (i.e. the morphisms). On the other hand, M is a covariant functor and

Mpφq : MpX q Ñ MpYq, µ ÞÑ µ ˝ φ´1, where µ ˝ φ´1pBq :“ µpφ´1pBqq for any

Borel measurable B Ď Y. “Duality” itself is a contra-variant functor between the

category of topological spaces (note the reversal of arrows in Fig. 2.1). Moreover,

CbpX q˚ “ MpX q and Cbpφq˚ “ Mpφq if X and Y are compact metric spaces and

φ : X Ñ Y is continuous. Definition 2.4.2 below can therefore be viewed as the special

case where φ is the projection map:

Definition 2.4.2 Suppose φ : Z1 ˆ Z2 Ñ Z1, pz1, z2q ÞÑ z1 is the projection to the

first coordinate.

• Cbpφq : CbpZ1q Ñ CbpZ1ˆZ2q is called a canonical map, whose action is almost

trivial: it sends a function of zi to itself, but viewed as a function of pz1, z2q.

46

• Mpφq : MpZ1 ˆ Z2q ÑMpZ1q is called marginalization, which simply takes a

joint distribution to a marginal distribution.

Next, recall that the Fenchel-Rockafellar duality (see [104, Theorem 1.9], or [94]

in the case of finite dimensional vector spaces) from convex analysis usually refers to

the k “ 1 special case of the following result, for which we provide a proof here for

completeness:

Theorem 2.4.3 Assume that A is a topological vector space whose dual is A˚. Let

Θj : AÑ RY t`8u, j “ 0, 1, . . . , k, for some positive integer k. Suppose there exist

some pujqkj“1 and u0 :“ ´pu1 ` ¨ ¨ ¨ ` ukq such that

Θjpujq ă 8, j “ 0, . . . , k (2.84)

and Θ0 is upper semicontinuous at u0. Then

´ inf`PA˚

«

kÿ

j“0

Θ˚j p`q

ff

“ infu1,...,ukPA

«

Θ0

˜

´

kÿ

j“1

uj

¸

`

kÿ

j“1

Θjpujq

ff

. (2.85)

For completeness, we provide a proof of this result, which is based on the Hahn-

Banach theorem (Theorem 2.4.4) and is similar to the proof of [104, Theorem 1.9].

Proof Let m0 be the right side of (2.85). The ď part of (2.85) follows trivially from

the (weak) min-max inequality since

m0 “ infu0,...,ukPA

sup`PA˚

#

kÿ

j“0

Θjpujq ´ `pkÿ

j“0

ujq

+

(2.86)

ě sup`PA˚

infu0,...,ukPA

#

kÿ

j“0

Θjpujq ´ `pkÿ

j“0

ujq

+

(2.87)

“ ´ inf`PA˚

«

kÿ

j“0

Θ˚j p`q

ff

. (2.88)

47

It remains to prove the ě part, and it suffices to assume without loss of generality

that m0 ą ´8. Note that (2.84) also implies that m0 ă `8. Define convex sets

Cj :“ tpu, rq P Aˆ R : r ą Θjpuqu, j “ 0, . . . , k; (2.89)

D :“ tp0,mq P Aˆ R : m ď m0u. (2.90)

Observe that these are nonempty sets by the assumption (2.84). Also C0 has

nonempty interior by the assumption that Θ0 is upper semicontinuous at u0. Thus,

the Minkowski sum

C :“ C0 ` ¨ ¨ ¨ ` Ck (2.91)

is a convex set with a nonempty interior. Moreover, CYD “ H. By the Hahn-Banach

theorem (Theorem 2.4.4), there exists p`, sq P A˚ ˆ R such that

sm ď `pkÿ

j“0

ujq ` skÿ

j“0

rj. (2.92)

For any m ď m0 and puj, rjq P Cj, j “ 0, . . . , k. From (2.90) we see (2.92) can only

hold when s ě 0. Moreover, from (2.84) and the upper semicontinuity of Θ0 at u0 we

see theřkj“0 uj in (2.92) can take value in a neighbourhood of 0 P A, hence s ‰ 0.

Thus, by dividing s on both sides of (2.92) and setting `Ð ´`s, we see that

m0 ď infu0,...,ukPA

«

´`pkÿ

j“0

ujq `kÿ

j“0

Θjpujq

ff

(2.93)

“ ´

«

kÿ

j“0

Θ˚j p`q

ff

(2.94)

which establishes the ě part in (2.85).

48

Theorem 2.4.4 (Hahn-Banach) Let C and D be convex, nonempty disjoint sub-

sets of a topological vector space A. If the interior of C is non-empty, then there

exists ` P A˚, ` ‰ 0 such that

supuPD

`puq ď infuPC

`puq. (2.95)

Remark 2.4.3 The assumption in Theorem 2.4.4 that C has nonempty interior is

only necessary in the infinite dimensional case. However, even if A in Theorem 2.4.3

is finite dimensional, the assumption in Theorem 2.4.3 that Θ0 is upper semicontin-

uous at u0 is still necessary, because this assumption was not only used in applying

Hahn-Banach, but also in concluding that s ‰ 0 in (2.92).

2.4.2 Compact X

We first state a duality theorem for the case of compact alphabets to streamline the

proof. Later we show that the argument can be extended to a particular non-compact

case.10 Our proof based on the Legendre-Fenchel duality (Theorem 2.4.3) was inspired

by the proof of the Kantorovich duality in the theory of optimal transportation (see

[104, Chapter 1], where the idea is credited to Brenier).

Theorem 2.4.5 (Dual formulation of forward-reverse Brascamp-Lieb inequality)

Assume that

• m and l are positive integers, d P R, X is a compact metric space (hence also a

Polish space);

• For each i “ 1, . . . , l, bi P p0,8q; νi is a finite Borel measure on a Polish space

Zi, and QZi|X “ S˚i is a random transformation;

10Theorem 2.4.5 is not included in the conference paper [80], but was announced in the conferencepresentation.

49

• For each j “ 1, . . . ,m, cj P p0,8q, µj is a finite Borel measure on a Polish

space Yj, and QYj |X “ T ˚j is a random transformation.

• For any PZi such that DpPZiνiq ă 8, i “ 1, . . . , l, there exists PX P

Ş

ipS˚i q´1PZi such that

řmj“1 cjDpPYjµjq ă 8, where PYj :“ T ˚j PX .

Then the following two statements are equivalent:

1. If nonnegative continuous functions pgiq, pfjq are bounded away from 0 and such

that

lÿ

i“1

biSi log gi ďmÿ

j“1

cjTj log fj (2.96)

then (see (2.78) for the notation of the integral)

lź

i“1

νbii pgiq ď exppdqmź

j“1

µcjj pfjq. (2.97)

2. For any pPZiq such that DpPZiνiq ă 811, i “ 1, . . . , l,

lÿ

i“1

biDpPZiνiq ` d ě infPX

mÿ

j“1

cjDpPYjµjq (2.98)

where the infimum is over PX such that S˚i PX “ PZi, i “ 1, . . . , l, and PYj :“

T ˚j PX , j “ 1, . . . ,m.

Proof We can safely assume d “ 0 below without loss of generality (since otherwise

we can always substitute µ1 Ð exp´

dc1

¯

µ1).

11Of course, this assumption is not essential (once we adopt the convention that the infimum in(2.98) is `8 when it runs over an empty set).

50

CbpX qCbpZ1q Q g1

CbpZ2q Q g2

CbpY1q Q f1

CbpY2q Q f2

S1

S2

T1

T2

PpX q Q PXPpZ1q Q PZ1

PpZ2q Q PZ2

PpY1q Q PY1

PpY2q Q PY2

S˚1

S˚2

T ˚1

T ˚2

Figure 2.1: Diagrams for Theorem 2.4.5.

1)ñ2) This is the nontrivial direction which relies on certain (strong) min-max type

results. In Theorem 2.4.3, put

Θ0 : u P CbpX q ÞÑ

$

’

&

’

%

0 u ď 0;

`8 otherwise.(2.99)

Then,

Θ˚0 : π PMpX q ÞÑ

$

’

&

’

%

0 π ě 0;

`8 otherwise.(2.100)

For each j “ 1, . . . ,m, set

Θjpuq “ cj inf log µj

ˆ

exp

ˆ

1

cjv

˙˙

(2.101)

where the infimum is over v P CbpYq such that u “ Tjv; if there is no such v

then Θjpuq :“ `8 as a convention. Observe that

• Θj is convex;

51

• Θjpuq ą ´8 for any u P CbpX q. If otherwise, for any PX and PYj :“ T ˚j PX

we have

DpPYjµjq “ supvtPYjpvq ´ log µjpexppvqqu (2.102)

“ supvtPXpTjvq ´ log µjpexppvqqu (2.103)

“ supuPCbpX q

"

PXpuq ´1

cjΘjpcjuq

*

(2.104)

“ `8 (2.105)

which contradicts the assumption in the theorem;

• From the steps (2.102)-(2.104), we see Θ˚j pπq “ cjDpT

˚j πµjq for any

π PMpX q, where the definition of Dp¨µjq is extended using the Donsker-

Varadhan formula (that is, it is infinite when the argument is not a prob-

ability measure).

Finally, for the given pPZiqli“1, choose

Θm`1 : u P CbpX q ÞÑ

$

’

&

’

%

řli“1 PZipwiq if u “

řli“1 Siwi for some wi P CbpZiq;

`8 otherwise.

(2.106)

Notice that

• Θm`1 is convex;

52

• Θm`1 is well-defined (that is, the choice of pwiq in (2.106) is inconsequen-

tial). Indeed if pwiqli“1 is such that

řli“1 Siwi “ 0, then

lÿ

i“1

PZipwiq “lÿ

i“1

S˚i PXpwiq (2.107)

“

lÿ

i“1

PXpSiwiq (2.108)

“ 0, (2.109)

where PX is such that S˚i PX “ PZi , i “ 1, . . . , l, whose existence is guaran-

teed by the assumption of the theorem. This also shows that Θm`1 ą ´8.

•

Θ˚m`1pπq :“ sup

utπpuq ´Θm`1puqu (2.110)

“ supw1,...,wl

#

π

˜

lÿ

i“1

Siwi

¸

´

lÿ

i“1

PZipwiq

+

(2.111)

“ supw1,...,wl

#

lÿ

i“1

S˚i πpwiq ´lÿ

i“1

PZipwiq

+

(2.112)

“

$

’

&

’

%

0 if S˚i π “ PZi , i “ 1, . . . , l;


Invoking Theorem 2.4.3 (where the uj in Theorem 2.4.3 can be chosen as the

constant function uj ” 1, j “ 1, . . . ,m` 1):

infπ : πě0, S˚i π“PZi

mÿ

j“1

cjDpT˚j πµjq (2.114)

“ ´ infvm,wl :

řmj“1 Tjvj`

řli“1 Siwiě0

«

mÿ

j“1

cj log µj

ˆ

exp

ˆ

1

cjvj

˙˙

`

lÿ

i“1

PZipwiq

ff

(2.115)

53

Note that the left side of (2.115) is exactly the right side of (2.98). For any

ε ą 0, choose vj P CbpYjq, j “ 1, . . . ,m and wi P CbpZiq, i “ 1, . . . , l such that

řmj“1 Tjvj `

řli“1 Siwi ě 0 and

ε´mÿ

j“1

cj log µj

ˆ

exp

ˆ

1

cjvj

˙˙

´

lÿ

i“1

PZipwiq ą infπ : πě0, S˚i π“PZi

mÿ

j“1

cjDpT˚j πµjq

(2.116)

Now invoking (2.97) with fj :“ exp´

1cjvj

¯

, j “ 1, . . . ,m and gi :“ exp´

´ 1biwi

¯

,

i “ 1, . . . , l, we upper bound the left side of (2.116) by

ε´lÿ

i“1

bi log νipgiq `lÿ

i“1

biPZiplog giq ď ε`lÿ

i“1

biDpPZiνiq (2.117)

where the last step follows by the Donsker-Varadhan formula. Therefore (2.98)

is established since ε ą 0 is arbitrary.

2)ñ1) Since νi is finite and gi is bounded by assumption, we have νipgiq ă 8,

i “ 1, . . . , l. Moreover (2.97) is trivially true when νipgiq “ 0 for some i, so we

will assume below that νipgiq P p0,8q for each i. Define PZi by

dPZidνi

“gi

νipgiq, i “ 1, . . . , l. (2.118)

Then for any ε ą 0,

lÿ

i“1

bi log νipgiq “lÿ

i“1

birPZiplog giq ´DpPZiνiqs (2.119)

ă

mÿ

j“1

cjPYjplog fjq ` ε´mÿ

j“1

cjDpPYjµjq (2.120)

ď ε`mÿ

j“1

cj log µjpfjq (2.121)

54

where

• (2.120) uses the Donsker-Varadhan formula, and we have chosen PX , PYj :“

T ˚j PX , j “ 1, . . . ,m such that

lÿ

i“1

biDpPZiνiq ąmÿ

j“1

cjDpPYjµjq ´ ε (2.122)

• (2.121) also follows from the Donsker-Varadhan formula.

The result follows since ε ą 0 can be arbitrary.

Remark 2.4.4 The infimum in (2.98) is in fact achievable: For any pPZiq, there

exists a PX that minimizesřmj“1 cjDpPYjµjq subject to the constraints S˚i PX “ PZi,

i “ 1, . . .m, where PYj :“ T ˚j PX , j “ 1, . . . ,m. Indeed, since the singleton tPZiu is

weak˚-closed and S˚i is weak˚-continuous12, the setŞli“1pS

˚i q´1PZi is weak˚-closed in

MpXq; hence its intersection with PpX q is weak˚-compact in PpX q, because PpX q

is weak˚-compact by (a simple version for the setting of a compact underlying space

X of) the Prokhorov theorem [109]. Moreover, by the weak˚-lower semicontinuity

of Dp¨µjq (easily seen from the variational formula/Donsker-Varadhan formula of

the relative entropy, cf. [43]) and the weak˚-continuity of T ˚j , j “ 1, . . . ,m, we see

řmj“1 cjDpT

˚j PXµjq is weak˚-lower semicontinuous in PX , and hence the existence of

a minimizing PX is established.

Remark 2.4.5 Abusing the terminology from the min-max theory, Theorem 2.4.5

may be interpreted as a “strong duality” result which establishes the equivalence of

12Generally, if T : A Ñ B is a continuous map between two topologically vector spaces, thenT˚ : B˚ Ñ A˚ is a weak˚ continuous map between the dual spaces. Indeed, if yn Ñ y is a weak˚-convergent subsequence in B˚, meaning ynpbq Ñ ypbq for any b P B, then we must have T˚ynpaq “ynpTaq Ñ ypTaq “ T˚ypaq for any a P A, meaning that T˚yn converges to T˚y in the weak˚

topology.

55

two optimization problems. The 1)ñ2) part is the non-trivial direction which requires

regularity on the spaces. In contrast, the 2)ñ1) direction can be thought of as a “weak

duality” which establishes only a partial relation but holds for more general spaces.

Remark 2.4.6 The equivalent formulations of the forward Brascamp-Lieb inequality

(Theorem 2.2.3) can be recovered from Theorem 2.4.5 by taking l “ 1, Z1 “ X , and

letting S1 be the identity map/isomorphism, except that Theorem 2.2.3 is established

for completely general alphabets. In other words, the forward Brascamp-Lieb inequal-

ity is the special case of the forward-reverse Brascamp-Lieb inequality when there is

only one reverse channel which is the identity.

2.4.3 Noncompact X

Our proof of 1)ñ2) in Theorem 2.4.5 makes use of the Hahn-Banach theorem, and

hence relies crucially on the fact that the measure space is the dual of the function

space. Naively, one might want to extend the the proof to the case of locally compact

X by considering C0pX q instead of CbpX q, so that the dual space is still MpX q.

However, this would not work: consider the case when X “ Z1ˆ, . . . ,ˆZl and each

Si is the canonical map. Then Θm`1puq as defined in (2.106) is `8 unless u ” 0,

thus Θ˚m`1 ” 0. Luckily, we can still work with CbpX q; in this case ` P CbpX q˚ may

not be a measure, but we can decompose it into ` “ π`R where π PMpX q and R is

a linear functional “supported at the infinity”. Below we use the techniques in [104,

Chapter 1.3] to prove a particular extension of Theorem 2.4.5 to a non-compact case.

Theorem 2.4.6 Theorem 2.4.5 still holds if

• The assumption that X is a compact metric space is relaxed to the assumption

that it is a locally compact and σ-compact Polish space;

56

• X “śl

i“1Zi and Si : CbpZiq Ñ CbpX q, i “ 1, . . . , l are canonical maps (see

Definition 2.4.2).

Proof The proof of the “weak duality” part 2)ñ1) still works in the noncompact

case, so we only need to explain what changes need to be made in the proof of 1)ñ2)

part. Let Θ0 be defined as before, in (2.99). Then for any ` P CbpX q˚,

Θ˚0p`q “ sup

uď0`puq (2.123)

which is 0 if ` is nonnegative (in the sense that `puq ě 0 for every u ě 0), and `8

otherwise. This means that when computing the infimum on the left side of (2.85),

we only need to take into account of those nonnegative `.

Next, let Θm`1 be also defined as before. Then directly from the definition we

have

Θ˚m`1p`q “

$

’

&

’

%

0 if `př

i Siwiq “ř

i PZipwiq, @wi P CbpZiq, i “ 1, . . . l;


For any ` P C˚b pX q. Generally, the condition in the first line of (2.124) does not imply

that ` is a measure. However, if ` is also nonnegative, then using a technical result

in [104, Lemma 1.25] we can further simplify:

Θ˚m`1p`q “

$

’

&

’

%

0 if ` PMpX q and S˚i ` “ PZi , i “ 1, . . . , l;


This further shows that when we compute the left side of (2.85) the infimum can be

taken over ` which is a coupling of pPZiq. In particular, if ` is a probability measure,

then Θ˚j p`q “ cjDpT

˚j `µjq still holds with the Θj defined in (2.101), j “ 1, . . . ,m.

Thus the rest of the proof can proceed as before.

57

Remark 2.4.7 The second assumption is made in order to achieve (2.125) in the

proof.

Remark 2.4.8 In [80] we studied a version of “reverse Brascamp-Lieb inequality”

which is a special case of Theorem 2.4.6 when there is only one forward channel: in

the setting of Theorem 2.4.5 consider m “ 1, c1 “ 1, X “ Z1ˆ, . . . ,ˆZl. Let Si be

the canonical map, gi Ð g1bii , i “ 1, . . . , l. Then (2.97) becomes

lź

i“1

gi 1bi

ď exppdqµ1pf1q (2.126)

for any nonnegative continuous pgiqli“1 and f1 bounded away from 0 and `8 such

that

lÿ

i“1

log gipziq ď Erlog f1pY1q|Zl“ zls, @zl. (2.127)

Note that (2.127) can be simplified in the deterministic special case: let φ : X Ñ Y1

be any continuous function, and T1 Ð Cbpφq (that is, T1 sends a function f on Y1 to

the function f ˝ φ on X ). Then (2.127) becomes

lź

i“1

gipziq ď f1pφpz1, . . . , zlqq, @z1, . . . , zl. (2.128)

Then the optimal choice of f1 admits an explicit formula, since for any given pgiq, to

verify (2.126) we only need to consider

f1pyq :“ supφpz1,...,zlq“y

#

ź

i

gipziq

+

, @y. (2.129)

Thus when φ is a linear function, (2.126) is essentially Barthe’s formulation of reverse

BL (2.76) (the exception being that Theorem 2.4.6, in contrast to (2.76), restricts

attention to finite νi, i “ 1, . . . , l and µ1). The more straightforward part of the duality

58

(entropic inequalityñfunctional inequality) has essentially been proved by Lehec [103,

Theorem 18] in a special setting.

2.4.4 Extension to General Convex Functionals

For certain applications (e.g. the transportation-cost inequalities, see Section 2.5.7

ahead), we may be interested in convex functionals beyond the relative entropy. Recall

the definition of the Legendre-Fenchel transform in Definition 2.4.1. Then we have,

from the theory of convex analysis (see for example [105, Lemma 4.5.8]),

Λpuq “ sup`PCbpX q˚

r`puq ´ Λ˚p`qs, (2.130)

for any u P CbpX q. Moreover, if Λ˚p`q “ `8 for any ` R PpX q, then from (2.130) we

must also have

Λpuq “ sup`PPpX q

r`puq ´ Λ˚p`qs. (2.131)

For example, the function Λ defined in (2.82) satisfies the property in (2.131). We

need this property in the proof of Theorem 2.4.5 because of step (2.119). From the

proof of Theorem 2.4.5 we see that we can obtain the following generalization to

convex functionals with no additional cost. An application of this generalization to

transportation-cost inequalities is given in Section 2.5.7.

Theorem 2.4.7 Assume that

• m and l are positive integers, d P R, X is a compact metric space (hence also a

Polish space);

• For each i “ 1, . . . , l, Zi is a Polish space, Λi : CbpZiq Ñ R Y t`8u is proper

convex such that Λ˚i p`q “ `8 for ` R PpZiq, and Si : CbpZiq ÞÑ CbpX q is a

conditional expectation operator;

59

• For each j “ 1, . . . ,m, Yj is a Polish space, Θj : CbpYjq Ñ RY t`8u is proper

convex such that Θjpuq ă 8 for some u P CbpYjq which is bounded below, and

Tj : CbpYjq Ñ CbpX q is a conditional expectation operator;

• For any `Zi P MpZiq such that Λ˚i p`Ziq ă 8, i “ 1, . . . , l, there exists `X P

Ş

ipS˚i q´1`Zi such that

řmj“1 Θ˚

j p`Yjq ă 8, where `Yj :“ T ˚j `X .

Then the following two statements are equivalent:

1. If gi P CbpZiq, fj P CbpYjq, i “ 1, . . . , l, j “ 1, . . . ,m satisfy

lÿ

i“1

Sigi ďmÿ

j“1

Tjfj (2.132)

then

lÿ

i“1

Λipgiq ďmÿ

j“1

Θjpfjq. (2.133)

2. For any13 `Zi PMpZiq, i “ 1, . . . , l,

lÿ

i“1

Λ˚i p`Ziq ě inf`X

mÿ

j“1

Θ˚j p`Yjq (2.134)

where the infimum is over `X such that S˚i `X “ PZi, i “ 1, . . . , l, and `Yj :“

T ˚j `X , j “ 1, . . . ,m.

Remark 2.4.9 Just like Theorem 2.4.6, it is possible to extend Theorem 2.4.7 to

the case of noncompact X , provided that X “ Z1ˆ, . . . ,ˆZl and Si, i “ 1, . . . , l are

canonical maps.

13Since by assumption Λ˚i p`Ziq “ `8 when ` R PpZiq, in which case (2.134) is trivially true, it isequivalent to assuming here that ` P PpZiq.

60

2.5 Some Special Cases of the Forward-Reverse

Brascamp-Lieb Inequality

In this section we discuss some notable special cases of the duality results for the

forward-reverse Brascamp-Lieb Inequality (Theorems 2.2.3-2.4.7). The relationship

among the involved inequalities is illustrated in Figure 2.2. The equivalent formu-

lations of Renyi divergence, the strong data processing inequality, hypercontractiv-

ity and its reverse (with positive or negative parameters), Loomis-Whitney inequal-

ity/Shearer’s lemma, and the transportation-cost inequalities have been proved by

different methods (see for example [110][7][111][112][74][79][113]). In some of these

examples (e.g. strong data processing [7]) the previous approaches rely heavily on

the finiteness of the alphabet, whereas the present approach (essentially based on the

nonnegativity of the relative entropy) is simpler and holds for general alphabets.

Forward-reverse Brascamp-Lieb (2.97)

Forward Brascamp-Lieb (2.14)

Strong data processinginequality (2.144)

Reverse hypercontractivity with one negative parameter (2.165)

Reverse Brascamp-Lieb (2.126)

Hypercontractivity(2.161)

Reverse hypercontractivity withpositive parameters (2.163)

Figure 2.2: The “partial relations” between the inequalities discussed in this chapterwith respect to implication.

Due in part to their utility in establishing impossibility bounds, these functional

inequalities have attracted a lot of attention in information theory [114][115][116]

[117][118][101][73][119], theoretical computer science [71][120][121][122][86], and

statistics [123][124][125][72][126][127], to name only a small subset of the literature.

61

2.5.1 Variational Formula of Renyi Divergence

As the first example, we show how (2.14) recovers the variational formula of Renyi

divergence [128] [129] in a special case. A prototype of the variational formula of Renyi

divergence appeared in the context of control theory [128] as a technical lemma. Its

utility in information theory was then noticed by [129] [110], which further developed

the result and elaborated on its applications in other areas of probability theory.

Suppose R and Q are nonnegative measures on pX ,F q, α P p0, 1q Y p1,8q, and

g : X Ñ R is a bounded measurable function. Also, let T be a probability measure

such that R,Q ! T . Define the Renyi divergence

DαpQRq :“1

α ´ 1log

`

E“

exp`

αıQT pXq ` p1´ αqıRT pXq˘‰˘

(2.135)

where X „ T , which is independent of the particular choice of the reference measure

T [130]. Then the variational formula of Renyi divergence [129, Remark 2.2] can be

equivalently stated as the functional inequality14

1

α ´ 1logErexpppα ´ 1qgpXqqs ´

1

αlogErexppαgpXqqs ď

1

αDαpQRq (2.136)

where X „ R and X „ Q, with equality achieved when

dQ

dRpxq “

exppgpxqq

ErexppgpXqqs. (2.137)

The well-known variational formula of the relative entropy (see e.g. [43]) can be

recovered by taking α Ñ 1. In the α P p1,8q case, we can choose exppgq to be the

indicator function of an arbitrary measurable set A Ď X , to obtain the logarithmic

14Note that our definition of Renyi divergence is different from [129] by a factor of α.

62

probability comparison bound (LPCB) [129]15

1

α ´ 1logQpAq ´ 1

αlogRpAq ď 1

αDαpQRq. (2.138)

Now we give a new proof of the functional inequality (2.136) using a well-known

entropic inequality in information theory. First consider α P p1,8q. In Theorem 2.2.3,

set m Ð 1, Y1 “ X , c Ð α´1α

, PY1|X “ id (the identity mapping). We may assume

without loss of generality that Q ! R, since otherwise DαpQRq “ 8 and (2.136)

always holds. Thus, setting the cost function as

dpxq Ð ´ıQRpxq `α ´ 1

αDαpQRq (2.139)

we see that (2.15) is reduced to

DpP Rq ` E”

´ıQRpXqı

`α ´ 1

αDαpQRq ě

α ´ 1

αDpP Rq (2.140)

which, by our convention in Remark 2.2.2, can be simplified to

DpP Qq `α ´ 1

αDαpQRq ě

α ´ 1

αDpP Rq. (2.141)

It is a well-known result that (2.141) holds for all P absolutely continuous with respect

to Q and R (see for example [130, Theorem 30], [131, Theorem 1], [132, Corollary 2]),

due to its relation to the fundamental problem of characterizing the error exponents

in binary hypothesis testing. By Theorem 2.2.3 with m “ 1 we translate (2.140) into

the functional inequality:

E„

exp

ˆ

log fpXq ` ıQRpXq ´α ´ 1

αDαpQRq

˙

ď

´

E”

fαα´1 pXq

ı¯α´1α

(2.142)

15[129] focuses on the case of probability measure, but (2.138) continues to hold if Q and R arereplaced by any unnormalized nonnegative measures.

63

for all nonnegative measurable f . Finally, by taking the logarithms and dividing by

α´ 1 on both sides and setting f Ð expppα´ 1qgq, the functional inequality (2.136)

is recovered.

Note that the choice of dp¨q in (2.139) is simply for the purpose of change-of-

measure, since the two relative entropy terms in (2.141) have different reference mea-

sures Q and R. Thus, an alternative proof is to take dp¨q “ 0 but invoke the extension

of Theorem 2.2.3 in Remark 2.2.3 with µÐ Q and ν Ð R.

The case of α P p0, 1q can be proved in a similar fashion using Theorem 2.4.5 with

m Ð 2, l Ð 1, X “ Y1 “ Y2 Ð X , QY1|X “ QY2|X “ id, QY1 Ð Q, QY2 Ð R, and

|Z| “ 1; we omit the details here.

Note that the original proofs of the functional inequality (2.136) in [128][129]

were based on Holder’s inequality, whereas the present proof relies on the duality

between functional inequalities and entropic inequalities, and the property (2.141)

(which amounts to the nonnegativity of relative entropy). Moreover, the weaker

probability version (2.138) can be easily proved by a data processing argument; see

for example [42, Section II.B][133].

2.5.2 Strong Data Processing Constant

The strong data processing inequality (SDPI) [7][8][9] has received considerable in-

terest recently. It has proved fruitful in providing impossibility bounds in various

problems; see [117] for a recent list of its applications. It generally refers to an in-

equality of the form

DpPXQXq ě cDpPY QY q, for all PX ! QX (2.143)

where PX Ñ QY |X Ñ PY , and we have fixed QXY “ QXQY |X . The conventional

data processing inequality corresponds to the case of c “ 1. The study of the best

64

(largest) constant c for (2.143) to hold can be traced to Ahlswede and Gacs [7], who

showed, among other things, its equivalence to the functional inequality

ErexppErlog fpY q|Xsqs ď f 1c. for all nonnegative f. (2.144)

The strong data processing inequality can be viewed as a special case of the forward-

reverse Brascamp-Lieb inequality where there is only one forward and one identity

reverse channel. In other words, the equivalence between (2.143) and (2.144) can be

readily seen from either Theorem 2.2.3 or Remark 2.4.8. Its original proof of such

an equivalence [7, Theorem 5], on the other hand, relies on a limiting property of

hypercontractivity, which, in turn, relies heavily on the finiteness of the alphabet and

the proof is quite technical even in that case.

As we saw in Section 2.5.1, a functional inequality often implies an inequality of

the probabilities of sets when specialized to the indicator functions. In the case of

(2.144), however, a more rational choice is

fpyq “ p1Apyq `QY rAs1Apyqqc (2.145)

where A is an arbitrary measurable subset of Y and A :“ YzA. Then using (2.144),

QcεY rAsQXrx : QY |X“xrAs ě 1´ εs “ Qcε

Y rAsQXrx : QY |X“xrAs ď εs (2.146)

ď

ż

exp`

logQY rAsc ¨QY |X“xrAs˘

dQXpxq

(2.147)

“ ErexppErlog fpY q|Xsqs (2.148)

ď Ecrf1c pY qs (2.149)

ď 2cQcY rAs. (2.150)

65

Rearranging, we obtain the following bound on conditional probabilities:

QX

“

x : QY |X“xrAs ě 1´ ε‰

ď 2cQcp1´εqY rAs (2.151)

which, by a blowing-lemma argument (cf. [27]), would imply the asymptotic result

of [27, Theorem 1], a useful tool in establishing strong converses in source coding

problems. Note that [27, Section 2] proved a result essentially the same as (2.151) by

working on (2.143) rather than (2.144).16

2.5.3 Loomis-Whitney Inequality and Shearer’s Lemma

The duality between Loomis-Whitney Inequality and Shearer’s Lemma is yet another

special case of Theorem 2.2.3. This is already contained in the duality theorem of

Carlen and Cordero-Erausquin [76], but we briefly discuss it here.

The combinatorial Loomis-Whitney inequality [134, Theorem 2] says that if A is

a subset of Am, where A is a finite or countably infinite set, then

|A| ďmź

j“1

|πjpAq|1

m´1 (2.152)

where we defined the projection

πj : Am Ñ Am´1, (2.153)

px1, . . . , xmq ÞÑ px1, . . . , xj´1, xj`1, . . . , xmq. (2.154)

16Another difference is that [27, Theorem 1] involves an auxiliary r.v. U with |U | ď 3, where thecardinality bound comes from convexifying a subset in R2. Here (2.151) holds if (2.143), which isslightly simpler involving only relative entropy terms, because we are essentially working with thesupporting lines of the convex hull, and the supporting line of a set is the same as the supportingline of its convex hull.

66

for each j P t1, . . . ,mu. The combinatorial inequality (2.152) can be recovered from

the following integral inequality: let µ be the counting measure on Am, then

ż

Am

mź

j“1

fjpπjpxqqdµpxq ďmź

j“1

fjm´1 (2.155)

for all nonnegative fj’s, where the norm on the right side is with respect to the

counting measure on Am´1. We claim that (2.155) is an inequality of the form (2.14).

To see how (2.155) recovers (2.152), let fj be the indicator function of πjpAq for each

j. Then the left side of (2.155) upper-bounds the left side of (2.152), while the right

side of (2.155) is equal to the right side of (2.152). Now we invoke Remark 2.2.3 with

X Ð Am, ν and µj being the counting measure on Am and Am´1, respectively, QYj |X

being the projection mappings in (2.153)-(2.154), and let pX1, . . . , Xmq be distributed

according to a given PX , to obtain

´HpX1, . . . , Xmq “ DpPXνq (2.156)

ě

mÿ

j“1

1

m´ 1DpPYjµjq (2.157)

ě ´

mÿ

j“1

1

m´ 1HpX1, . . . , Xj´1, Xj`1, . . . , Xmq (2.158)

where Hp¨q is the Shannon entropy. This is a special case of Shearer’s Lemma [135]

[112].

Similarly, the continuous Loomis-Whitney inequality for Lebesgue measure, that

is,

ż

Rm

mź

j“1

fjpπjpxnqqdxn ď

mź

j“1

fjm´1 (2.159)

67

is the dual of a continuous version of Shearer’s lemma involving differential entropies:

hpXmq ď

mÿ

j“1

1

m´ 1hpX1, . . . , Xj´1, Xj`1, . . . , Xmq. (2.160)

2.5.4 Hypercontractivity

PpY1 ˆ Y2qPpZ1q

PpY1q

PpY2q

–

T ˚1

T ˚2

Figure 2.3: Diagram for hypercontractivity (HC)

In Theorem 2.4.5, take l Ð 1, m Ð 2, b1 Ð 1, d Ð 0, f1 Ð f1c1

1 , f2 Ð f1c2

2 ,

ν1 Ð QY1Y2 , µ1 Ð QY1 , µ2 Ð QY2 . Also, put Z1 “ X “ pY1, Y2q, and let T1 and T2 be

the canonical maps (Definition 2.4.2). We obtain the equivalence between17

f1 1c1

f2 1c2

ě Erf1pY1qf2pY2qs, @f1 P L1c1 pQY1q, f2 P L

1c2 pQY2q (2.161)

and

@PY1Y2 , DpPY1Y2QY1Y2q ě c1DpPY1QY1q ` c2DpPY2QY2q. (2.162)

This equivalence can also be obtained from Theorem 2.2.3. By Holder’s inequal-

ity, (2.161) is equivalent to saying that the norm of the linear operator sending

f1 P L1c1 pQY1q to Erf1pY1q|Y2 “ ¨s P L

11´c2 pQY2q does not exceed 1. The interest-

ing case is 11´c2

ą 1c1

, hence the name hypercontractivity. The equivalent formulation

of hypercontractivity was shown in [74] using a different proof via the method of

types/typicality, which relies on the finite nature of the alphabet. In contrast, the

proof based on the nonnegativity of relative entropy removes this constraint, allow-

17By a standard dense-subspace argument, we see that it is inconsequential that f1 and f2 in(2.161) are not assumed to be continuous.

68

ing one to prove Nelson’s Gaussian hypercontractivity from the information-theoretic

formulation (see Section 6.4.2).

2.5.5 Reverse Hypercontractivity (Positive Parameters)

PpZ1 ˆ Z2q

PpZ1q

PpZ2q

PpY1q

S˚1

S˚2

–

Figure 2.4: Diagram for reverse HC

In Theorem 2.4.5, take l Ð 2, m Ð 1, c1 Ð 1, d Ð 0, g1 Ð g1b11 , g2 Ð g

1b22 ,

µ1 Ð QZ1Z2 , ν1 Ð QZ1 , ν2 Ð QZ2 . Also, put Y1 “ X “ pZ1, Z2q, and let S1 and S2 be

the canonical maps (Definition 2.4.2). Note that in the title, by “positive parameters”

we mean the b1 and b2 in (2.164) are positive. We obtain the equivalence between

g1 1b1

g2 1b2

ď Erg1pZ1qg2pZ2qs, @g1 P L1b1 pQZ1q, g2 P L

1b2 pQZ2q (2.163)

and

@PZ1 , PZ2 , DPZ1Z2 , DpPZ1Z2QZ1Z2q ď b1DpPZ1QZ1q ` b2DpPZ2QZ2q. (2.164)

Note that the function f1 in (2.97) disappears since its “optimal value” is easily com-

puted in terms of g1 and g2 from (2.96) in this special setting. Moreover, in this

set-up, if Z1 and Z2 are finite, then the condition in the last bullet in Theorem 2.4.5

is equivalent to QZ1Z2 ! QZ1 ˆ QZ2 . The equivalent formulations of reverse hyper-

contractivity were observed in [78], where the proof is based on the method of types

argument.

69

PpZ1 ˆ Y2qPpZ1q

PpY1q

PpY2q

S˚1–

T ˚2

Figure 2.5: Diagram for reverse HC with one negative parameter

2.5.6 Reverse Hypercontractivity (One Negative Parameter)

In Theorem 2.4.5, take l Ð 1, m Ð 2, c1 Ð 1, d Ð 0, g1 Ð g1b1 , f2 Ð f

´ 1c2 ,

µ1 Ð QZ1Y2 , ν1 Ð QZ1 , µ2 Ð QY2 . Also, put Y1 “ X “ pZ1, Y2q, and let S1 and

T2 be the canonical maps (Definition 2.4.2). Note that in the title, by “one negative

parameter” we mean the b1 is positive and ć2 is negative in (2.166). We obtain the

equivalence between

f 1ć2

g 1b1

ď ErfpY2qgpZ1qs, @f P L1ć2 pQY2q, g P L

1b1 pQZ1q (2.165)

and

@PZ1 , DPZ1Y2 , DpPZ1Y2QZ1Y2q ď b1DpPZ1QZ1q ` pć2qDpPY2QY2q. (2.166)

Again, the function f1 in (2.97) does not exist in (2.165) since its “optimal choice”

f1pz1, y2q “gb11 pz1q

f c22 py2q“ fpy2qgpz1q, @y2, z1 (2.167)

is easily computed in terms of g1 and f2 from (2.96) in this special setting. Inequality

(2.165) is called reverse hypercontractivity with a negative parameter in [79], where

an equivalent formulation is established for finite alphabets using the method of types.

Multiterminal extensions of (2.165) and (2.166) (called reverse Brascamp-Lieb type

inequality with negative parameters in [79]) can also be recovered from Theorem 2.4.5

70

in the same fashion, i.e., we move all negative parameters to the other side of the

inequality so that all parameters become positive.

In summary, from the viewpoint of Theorem 2.4.5, the results in Sections 2.5.4,2.5.5

and 2.5.6 are degenerate special cases, in the sense that in any of the three cases the

optimal choice of one of the functions in (2.97) can be explicitly expressed in terms

of the other functions, hence this “hidden function” disappears in (2.161), (2.163) or

(2.165).

2.5.7 Transportation-Cost Inequalities

Definition 2.5.1 (see for example [57]) We say that a probability measure Q on

a metric space pZ, dq satisfies the Tppλq inequality, p P r1,8q, λ P p0,8q, if

infπE

1p rdppX, Y qs ď

a

2λDpP Qq (2.168)

for every P ! Q, where the infimum is over all coupling π of P and Q, and pX, Y q „

π.

It suffices to focus on the case of λ “ 1, since results for general λ P p0,8q can usually

be obtained by a scaling argument.

As a consequence of Theorem 2.4.7 and Remark 2.4.9, we have

Corollary 2.5.1 Let pZ, dq be a locally compact, σ-compact Polish space.

(a) A probability measure Q on Z satisfies T2p1q inequality if and only if for any

f P CbpZq,

logQ

ˆ

exp

ˆ

infzPZ

„

fpzq `d2p¨, zq

2

˙˙

ď Qpfq. (2.169)

71

(b) A probability measure Q on Z satisfies Tpp1q inequality, p P r1, 2q, if and only if

logQ

ˆ

t infzPZ

„

fpzq `dpp¨, zq

p

˙

ď

ˆ

1

p´

1

2

˙

t2

2´p ` tQpfq, @t P r0,8q, f P CbpZq.

(2.170)

Proof (a) In Theorem 2.4.7, put l “ 2, m “ 1, Z1 “ Z2 Ð Z, X “ Y1 Ð Z ˆ Z,

and

Λ1puq :“ 2 logQ´

exp´u

2

¯¯

; (2.171)

Λ2puq :“ Qpuq; (2.172)

Θ1puq :“

$

’

&

’

%

0 u ď d2;


For ` PMpX q, we can compute

Λ˚1p`q “ 2Dp`Qq; (2.174)

Λ˚2p`q :“

$

’

&

’

%

0 ` “ Q;

`8 otherwise;(2.175)

Θ˚1p`q “

$

’

&

’

%

`pd2q ` ě 0;


We also have Λ˚i p`q “ `8 for any ` R PpX q, i “ 1, 2. Thus by Theorem 2.4.7,

(2.169) is equivalent to the following: for any f1 P CbpZ1 ˆ Z2q, g1 P CbpZ1q,

g2 P CbpZ2q such that g1 ` g2 ď f1, it holds that

Λ1pg1q ` Λ2pg2q ď Θ1pf1q. (2.177)

72

By the monotonicity of Λ1, this is equivalent to

Λ1

´

infzrd2p¨, zq ´ g2pzqs

¯

` Λ2pg2q ď 0 (2.178)

for any g2 P CbpZq, which is the same as (2.169).

(b) The proof is similar to Part (a), except that we now pick

Θ1pfq :“

$

’

&

’

%

2´2

2´ppp

2´p p2´ pq sup2

2´p`

fdp

˘

if sup f ě 0;

0 otherwise,(2.179)

so that for any ` ě 0,

Θ˚1p`q “ r`pd

pqs

2p . (2.180)

Remark 2.5.1 Actually, the proof of Corollary 2.5.1 does not use the assumption

that d is a metric (other than that it is a continuous function which is bounded below).

The equivalent formulation of the T1 inequality (special case of (2.170)) was known to

Rachev [136] and Bobkov and Gotze [113] (who actually slightly simplified the formula

using the fact that d is a metric). The equivalent formulation of the T2 inequality in

(2.169) also appeared in [113], and was employed in [96][82] to show a connection to

the logarithmic Sobolev inequality. The equivalent formulation of the Tp inequality,

p P r1, 2q in (2.170) appeared in [57, Proposition 22.3].

Transportation-cost inequalities have been fruitful in obtaining measure concentration

results (since [137][138]). Section 6.4.3 contains further discussions on T2 inequalities

in the Gaussian case.

73

Chapter 3

Smoothing

In Chapter 2 we studied the fundamentals of the functional-entropic duality and used

the result to derive a strong converse bound for the omniscient helper common ran-

domness generation problem. The bound is only (first-order) asymptotically tight for

vanishing communication rates, because the mutual information optimization arising

from the single-letter rate region coincides with the corresponding relative entropy

optimization only in that case. In this chapter we introduce a machinery called

smoothing to overcome this issue. By perturbing a product distribution by a tiny

bit in the total variation distance, we regain the (first-order asymptotic) equivalence

between the relative entropy optimization and the mutual information optimization

in rather general settings. Analysis of the second-order asymptotics is carried out

in the discrete and the Gaussian cases, implying the optimal Op?nq second-order

converse of the whole rate region for the omniscient-helper problem in those cases.

3.1 The BL Divergence and Its Smooth Version

This section introduces some necessary definitions and notations, and summarizes the

main goal of this chapter.

Definition 3.1.1 Given a nonnegative finite measure µ on X , nonnegative σ-finite

measures ν1, . . . , νm on Y1, . . . ,Ym, random transformations QY1|X , . . . , QYm|X , and

74

c1, . . . , cm P r0,8q, Define the Brascamp-Lieb divergence

dpµ, pQYj |Xq, pνjq, cmq :“ sup

PX

#

mÿ

j“1

cjDpPYjνjq ´DpPXµq

+

(3.1)

where the supremum is over PX : PX ! µ such that DpPXµq and DpPYjνjq, j “

1, . . . ,m are finite, and PX Ñ QYj |X Ñ PYj .

We remark that the optimization in (3.1) is generally not convex. Moreover, as a

convention of this thesis, if the feasible set for a supremum is empty, then we set the

supremum equal to ´8.

By Theorem 2.2.3, we have the following alternative expression for the Brascamp-

Lieb (BL) divergence:

Proposition 3.1.1 Under the assumptions in Definition 3.1.1,

dpµ, pQYj |Xq, pνjq, cmq

“ supf1,...,fm

#

log

ż

exp

˜

mÿ

j“1

Erlog fjpYjq|X “ xs

¸

dµpxq ´mÿ

j“1

log fj 1cj

+

(3.2)

where the supremum is over nonnegative functions on Y1, . . .Ym, and Yj „ QYj |X“x

conditioned on X “ x.

The entropic definition (3.1) is more close to the answer (the single-letter formula

of network information theory problems, usually expressed using the linear combina-

tion of mutual informations), whereas the functional (3.2) is more “operational” in

the derivation of bounds.

When other parameters are fixed, the BL divergence dpµ, pQYj |Xq, pQYjq, cmq can

be viewed as a function of µ. Loosely speaking, the “smoothing” operation in the

title of the chapter can be thought of as the “semicontinuous regularization” of a

function on the measure space in the total variation distance. That is, we infimize

(or supremize) a function within a neighbourhood of its argument. The rationale for

75

considering such a regularization is that a small perturbation of the distribution in the

total variation distance implies only a small change of the observed error probability,

but the asymptotic behavior of the value of the function can be vastly different. For

example, in cryptography, it has long been known that the collision entropy (Renyi

entropy of order 2) bounds the performance of the privacy amplification (see e.g.

[139][140][141][142]). However, Renner and Wolf [53] showed that the smooth Renyi

entropy has the asymptotic behavior of the Shannon entropy (Renyi entropy of order

1). Hence, the correct rate formula for privacy amplification is still expressed in terms

of the Shannon information quantities.

Now, let us try to apply the same smoothing operation to the BL divergence.

Instead of using the total variation distance, however, we shall consider a very similar,

but mathematically more convenient, quantity:

Definition 3.1.2 For nonnegative measures ν and µ on the same measurable space

pX ,F q where νpX q ă 8, and γ P r1,8q, the Eγ divergence is defined as

Eγpνµq :“ supAPF

tνpAq ´ γµpAqu. (3.3)

Note that under this definition E1pP µq “12|P ´µ| if and only if µ is a probability

measure. We will return to some other applications of Eγ in chapter 5.

Now let us proceed to define the smooth BL divergence. Since in most of our

usage, the random transformations QY1|X , . . . , QYm|X , We sometimes omit these ran-

dom transformations from the argument of the BL divergences, to keep the notations

simple.

Definition 3.1.3 For δ P r0, 1q, QX , pQYj |Xq, pνjq and cm P r0,8qm, define

dδpQX , pνjq, cmq :“ inf

µ : E1pQXµqďδdpµ, pνjq, c

mq (3.4)

76

with dp¨q is defined in (3.1.1).

Remark 3.1.1 According to discussions in the previous chapter, the BL divergence is

a generalization of several information measures, including the strong data processing

constant, hypercontractivity, and Renyi divergence. For example, for α P p1,8q, the

Renyi divergence between two probability measures P and Q on the same alphabet can

be expressed in terms of the BL divergence:

DαpP Qq “α

α ´ 1d

ˆ

P,Q,α ´ 1

α

˙

, (3.5)

where the random transformation in the definition of the BL divergence is the identity.

Consequently, the smooth Renyi divergence [53] can be expressed in terms of a smooth

BL divergence:

DδαpP Qq “

α

α ´ 1dδ

ˆ

P,Q,α ´ 1

α

˙

(3.6)

for δ P p0, 1q.

Remark 3.1.2 If we restrict µ in (3.4) to be a probability measure, then all asymp-

totic results remain the same. However, allowing unnormalized measures avoids the

unnecessary step of normalization in the proof, and is in accordance with the litera-

ture on smooth Renyi entropy, where such a relaxation generally gives rise to nicer

properties and tighter non-asymptotic bounds, cf. [53][143].

We now introduce a quantity which resembles the BL divergence and plays a

central role in the characterization of the asymptotic performance of the smooth BL

divergence.

77

Definition 3.1.4 Given QX , pQYj |Xq, pνjq and cm P p0,8qm such that |DpQYjνjq| ă

8, j “ 1, . . . ,m, define

d‹pQX , pνjq, cmq :“ sup

PUX : PX“QX

«

mÿ

j“1

cjDpPYj |Uνj|PUq ´ IpU ;Xq

ff

, (3.7)

where pU,Xq „ PUX , PYj |U is the concatenation of PX|U and QYj |X , and in the supre-

mum we also assume that PUX is such that each term in (3.7) is well-defined and

finite.

Remark 3.1.3 In Definition 3.1.4, it is without loss of generality to impose that

|U | ă 8, which can be shown using the fact that the mutual information equals its

supremum over finite partitions (see [144] for more details).

In later applications to the common randomness generation problem, we take the

reference measures νj “ QYj , j “ 1, . . . ,m, whereas in other applications we may

take νj equal to the equiprobable distribution. Note that in the case of νj “ QYj , we

have

d‹pQX , pQYjq, cmq “ sup

PU |X

#

mÿ

l“1

clIpU ;Ylq ´ IpU ;Xq

+

(3.8)

which is the quantity appearing in the single-letter region of the common randomness

generation problem (Section 2.3.1)! Moreover, as Proposition 2.3.1 shows, we must

have d‹ “ d if one of them is known to be 0. However, in general such an equivalence

may not hold, which is why our converse bound in the previous chapter is not (first-

order asymptotically) tight – the issue that we are now trying to fix in this chapter.

78

Notice that from these definitions and the tensorization property of dp¨q (Sec-

tion 6.1), we have

dpQX , pQYjq, cmq “

1

ndpQbnX , pQbnYj q, c

mq (3.9)

ě1

ndδpQ

bnX , pQbnYj q, c

mq. (3.10)

The goal of this chapter is to study the asymptotics of dδpQbnX , pQbnYj q, c

mq and discuss

its implication for some coding theorems.

The converse of the omniscient-helper common randomness (CR) generation prob-

lem that we will derive using dδ implies the general lower bound

dδpQbnX , pQbnYj q, c

mq ě n d‹pQX , pQYjq, c

mq `Op

?nq (3.11)

for any δ P p0, 1q, since otherwise the converse bound would contradict the known

achievability bound for the CR generation problem, which is impossible.

Upper bounding dδpQbnX , pQbnYj q, c

mq is the more nontrivial part. We would still

expect that d‹pQX , pQYjq, cmq will emerge as the asymptotic limit in rather general

settings. The results are summarized as follows:

• In the cases of 1) finite X , 2) QXYm is jointly Gaussian, the second-order term

is order Op?nq:

dδpQbnX , pQbnYj q, c

mq “ n d‹pQX , pQYjq, c

mq `Op

?nq. (3.12)

• We have conclusive results on the first-order asymptotics for a rather large class

of continuous distributions. More precisely, if X “ Y m and νj is an atomless1

1That is, for any measurable A : A Ď Yj with νYjpAq ą 0, there exists a measurable B : B Ď A

such that 0 ă νYjpBq ă νYj

pAq.

79

nonnegative σ-finite measures on a Polish space2, for each j, and a certain

relative information quantity is continuous and absolutely integrable, then

limnÑ8

1

ndδpQ

bnX , pQbnYj q, c

mq “ d‹pQX , pQYjq, c

mq. (3.13)

The second-order asymptotic result of (3.12) will lead to second-order converses

for coding theorems, whereas the first-order result of (3.13) will lead to a strong

converse result. The proposed approach to strong converses has several advantages

compared with existing approaches such as the method of types approach in [8], which

are nicely illustrated by our example of CR generation:

• The approach is applicable to continuous sources satisfying certain regularity

conditions for which the method of types is futile.

• In the omniscient helper CR generation problem, our approach covers possibly

stochastic encoders and decoders.3 Stochastic encoders and decoders are tricky

to handel with the method of types or change-of-measure techniques since they

can not be described by encoding and decoding sets.

Strong converses for a continuous source when the rate region involves auxiliaries

are rare [50]. The smooth BL divergence offers a canonical method for this situation.

3.2 Upper Bounds on dδ

3.2.1 Discrete Case

The main result for the discrete case can be summarized as follows:

2A Polish space is a complete separable metric space. It enjoys nice properties that allow us toperform some technical steps smoothly; see e.g. [105][104] for brief introductions.

3The (asymptotic) rate region with stochastic encoders can be strictly larger than with deter-ministic encoders, since in the former case the CR rate is unbounded whereas in the latter case itis bounded by the entropy of the sources. Regarding the decoders, we argue in Remark 3.3.2 thatallowing stochasticity can strictly decrease the (one-shot) error, but within a constant factor.

80

Theorem 3.2.1 Fix QX , pQYj |Xq, pνjq and cm P p0,8qm. Assume that X is finite,

and

DpPYjνjq ă `8, @PX ! QX , j “ 1, . . .m; (3.14)

d‹pQX , pνjq, cmq ą ´8. (3.15)

Then

dδpQbnX , pνbnj q, cmq ď n d‹pQX , pνjq, c

mq `Op

?nq. (3.16)

We prove a non-asymptotic version of Theorem 3.2.1 here. For simplicity we only

consider the m “ 1 case here, but the analysis completely goes through the general

case as well.

Lemma 3.2.2 Consider QX , QY |X , ν, δ P p0, 1q, c P p0,8q. Define βX and αY as

in Lemma 3.2.3. Then there exists some Cn, QbnX rCns ě 1´ δ, such that µn :“ QbnX |Cn

satisfies

dpµn, QbnY |X , ν

bn, cq

ď nd‹pQX , QY |X , ν, cq

` lnpαcY βc`1X q

c

3nβX ln|X |δ

(3.17)

for all n ą 3βX ln |X |δ

.

The intuition behind the second-order estimate in Lemma 3.2.2 is not hard to

comprehend: we can choose

Cn “ txn : | pPxn ´QX | “ Opn´12qu (3.18)

81

where pPxn denotes the empirical measure of xn, to guarantee that QbnX pCnq ě 1´δ (in

Lemma 3.2.2 a slightly more sophisticated Cn is chosen). A simple single-letterization

argument shows that 1ndpµn, Q

bnY |X , ν

bn, cq can be bounded using only the empirical

distributions of the sequences on the support of µn, so that the Opn´12q term in

(3.18) contributes to a second-order correction term in (3.61).

Proof Let pPXn denote the empirical distribution of Xn „ QbnX . For n ą 3βX ln |X |δ

,

define

εn :“

c

3βXn

ln|X |δP p0, 1q. (3.19)

For each x, we have, from the Chernoff bound for Bernoulli random variables (see

below),

Pr pPXnpxq ą p1` εnqQXpxqs ď e´n3QXpxqε

2n , (3.20)

therefore by the union bound,

Pr pPXn ď p1` εnqQXs ě 1´ δ. (3.21)

Finally, we note that for any PXn supported on

Cn :“ txn : Pxn ď p1` εnqQXu, (3.22)

82

we have

cDpPY nνbnq ´DpPXnµnq

“ cDpPY nνbnq ´DpPXnQbnX q (3.23)

“ cnÿ

i“1

DpPYi|Y i´1ν|PY i´1q ´

nÿ

i“1

DpPXi|Xi´1QX |PXi´1q (3.24)

ď cnÿ

i“1

DpPYi|Xi´1ν|PXi´1q ´

nÿ

i“1

DpPXi|Xi´1QX |PXi´1q (3.25)

“ nrcDpPYI |IXI´1ν|PIXI´1q ´DpPXI |IXI´1QX |PIXI´1qs (3.26)

ď n supPX : PXďp1`εnqQX

φpPXq (3.27)

where (3.25) follows since Yi ´X i´1 ´ Y i´1 under P ; in (3.26) we defined I to be a

random variable equiprobably distributed on t1, . . . , nu and independent of everything

else; and in (3.27) φ is defined in Lemma 3.2.3. Note that (3.23)-(3.27) is essentially

the same as the single-letterization property [145, Lemma 9].

Since the probability of the set in (3.22) is lower bounded by (3.21), the result

follows by taking the supremum over PXn ! µn in (3.23)-(3.27) and using Lemma 3.2.3

below.

The following technical result was used in the proof of the main result. For

notational simplicity, the base of log in this lemma is natural.

Lemma 3.2.3 Fix pQX , QY |X , νq and c P p0,8q. Define βX :“ pminxQXpxqq´1 and

αY “ |ν|

›

›

›

›

dQY

dν

›

›

›

›

8

. (3.28)

Let

φpPXq :“ supPU |X

cDpPY |Uν|PUq ´DpPX|UQX |PUq(

, (3.29)

83

for any PX ! QX . Then

φpPXq ď φpQXq ` lnpβc`1X αcY qε (3.30)

for PX : PX ď p1` εqQX , ε P r0, 1q.

Proof Given an arbitrary PX : PX ď p1 ` εqQX , suppose that PU |X is a maximizer

for (3.29) (if a maximizer does not exist, the argument can still carry through by

approximation). Then consider PUX where U “ U Y t‹u, X “ X , defined as the

follows

PU :“1

1` εPU `

ε

1` εδ‹; (3.31)

PX|U“u :“ PX|U“u, @u P U ; (3.32)

PX|U“‹ :“1` ε

ε

ˆ

QX ´1

1` εPX

˙

(3.33)

where δ‹ is the one-point distribution on ‹. Then observe that PX “ QX , and

DpPX|UQX |PUq “1

1` εDpPX|UQX |PUq

`ε

1` εD

ˆ

1` ε

εQX ´

1

εPX

›

›

›

›

QX

˙

(3.34)

and

DpPY |Uν|PUq “1

1` εDpPY |Uν|PUq

`ε

1` εD

ˆ

1` ε

εQY ´

1

εPY

›

›

›

›

ν

˙

(3.35)

84

which imply that

DpPX|UQX |PUq ě DpPX|UQX |PUq

´ εD

ˆ

1` ε

εQX ´

1

εPX

›

›

›

›

QX

˙

(3.36)

ě DpPX|UQX |PUq

´ ε ln βX ; (3.37)

DpPY |Uν|PUq “ p1` εqDpPY |Uν|PUq ´ εD

ˆ

1` ε

εQY ´

1

εPY

›

›

›

›

ν

˙

(3.38)

ď DpPY |Uν|PUq ` εDpPY |Uν|PUq ` ε ln |ν| (3.39)

“ DpPY |Uν|PUq ` εDpPY |UQY |PUq ` εErıQY νpY qs ` ε ln |ν| (3.40)

ď DpPY |Uν|PUq ` ε lnpβXαY q (3.41)

where ıQY ν :“ ln dQYdν

and the last step used the data processing inequality

DpPY |UQY |PUq ď DpPX|UQX |PUq ď ln βX . (3.42)

The proof of the main result used the following standard large deviation bound

[64]:

Theorem 3.2.4 (Chernoff Bound for Bernoulli random variables) Assume

that X1, . . . , Xn are i.i.d. Berppq. Then for any ε P p0,8q,

P

«

nÿ

i“1

Xi ě p1` εqnp

ff

ď e´nDpp1`εqppq (3.43)

ď e´mintε2,εunp

3 . (3.44)

85

3.2.2 Gaussian Case

The main result for the Gaussian case can be summarized as follows:

Theorem 3.2.5 Fix a Gaussian measure QX , Gaussian random transformations

pQYj |Xq4, Lebesgue or Gaussian measures pνjq, and cm P p0,8qm. Then

dδpQbnX , pνbnj q, cmq ď n d‹pQX , pνjq, c

mq `Op

?nq. (3.45)

In the following we prove a non-asymptotic version of the theorem. For simplicity

we consider the setting where m “ 1 and νj is the Lebesgue measure, but the analysis

completely goes through the general case as well. Such a setting is useful in proving

the second-order converse for certain distributed lossy source compression problems

with quadratic cost functions. The base of log in this lemma is natural.

Lemma 3.2.6 Let QX “ N p0, σ2q, QY |X“x “ N px, 1q, and suppose that ν “ λ is

Lebesgue. Then for any c P p0,8q, δ P p0, 1q and n ě 24 ln 2δ,

dδpQbnX , QbnY |X , ν

bn, cq ď nd‹pQX , QY |X , ν, cq `

c

6n ln2

δ. (3.46)

Proof Define the two kinds of typical sets:

Snε1 “"

xn :1

nxn2 ď p1` ε1qσ

2

*

, (3.47)

and

T nε2 “#

xn :1

n

nÿ

i“1

ıQXλpxiq ď ´hpQXq `ε22

+

(3.48)

“

"

xn :1

nxn2 ě p1´ ε2qσ

2

*

. (3.49)

4By a “Gaussian random transformation” we mean the output is a linear transformation of theinput plus an independent Gaussian noise.

86

If we choose εi “Ai?n, i “ 1, 2, since

lnE”

eλXn2σ2

ı

“n

2ln

1

1´ 2λ, @λ P p´8, 12q, (3.50)

we have

P

«

nÿ

i“1

X2i ą np1` ε1qσ

2

ff

ď eń sup

λPp0, 12qt 1

2lnp1´2λq`λp1`ε1qu

(3.51)

“ eńp´12

lnp1`ε1q`ε12q (3.52)

ď eńpε214´ε

316q. (3.53)

P

«

nÿ

i“1

X2i ă np1´ ε2qσ

2

ff

ď eń supλPp´8,0qt12

lnp1´2λq`λp1´ε2qu (3.54)

“ eńp´12

lnp1´ε2q´ε22q (3.55)

ď eńε224. (3.56)

Therefore by the union bound,

1´QbnX rSnε1X T nε2s ď e

Á2

14`

A31

6?n ` e´

A22

4 . (3.57)

Now, suppose that we can show the following result: if µn is the restriction of QbnX

on Snε1 X T nε2 , then

dpµn, QbnY |X , ν

bn, cq ď nd‹pQX , QY |X , ν, cq `?n

ˆ

1´ c

2A1 `

A2

2

˙

. (3.58)

Then, taking

A1 “ A2 “

c

6 ln2

δ(3.59)

87

we see that for n ě 24 ln 2δ, the right side of (3.57) is bounded above by δ and (3.46)

holds.

It remain to prove (3.58). Not complicating the notations too much, let us shift to

the case of general m for the convenience of future use. Given pQYj |Xqmj“1 and cm, let

λ and pλjqmj“1 denote the Lebesgue measures on the Euclidean spaces X and pYjqmj“1,

respectively. Define

F pMq :“ sup!

´ÿ

cjhpYj|Uq ` hpX|Uq)

(3.60)

“ sup!

ÿ

cjDpPYj |Uλj|PUq ´DpPX|Uλ|PUq)

(3.61)

where the suprema are over PUX such that ΣX ĺ M for the given positive semidefinite

matrix M. In Chapter 6 we will discuss results about the Gaussian optimality in this

type of optimizations. In particular, we will be able to show that F pMq equals the

sup in (3.61) restricted to constant U and Gaussian PX, which implies that

F pΣq ` C “ d‹pQX, νj, cmq (3.62)

for

C :“ ´ÿ

j

cjErıνjλjpYjqs ´ hpXq (3.63)

88

where pYj,Xq „ QYj |XQX. For µn defined as the restriction of QbnX on Snε1 X T nε2 and

PXn ! µn, we have5

1

n

«

ÿ

j

cjDpPYjnλbnj q ´DpPXnλbnq

ff

“1

n

nÿ

i“1

«

ÿ

j

cjDpPYjiλj|PYj

i´1q ´DpPXiλ|PXi´1q

ff

(3.64)

ď1

n

nÿ

i“1

«

ÿ

j

cjDpPYjiλj|PXi´1q ´DpPXi

λ|PXi´1q

ff

(3.65)

ď F pp1` ε1qΣq. (3.66)

Since PXn is supported on T nε2 , we also have

1

n

«

ÿ

j

cjDpPYjnλbnj q ´DpPXnλbnq

ff

` C

ě1

n

«

ÿ

j

cjDpPYjnνbnj q ´DpPXnQbnX q

ff

´ ε2 (3.67)

Hence from (3.66)-(3.67) we conclude

1

n

«

ÿ

j

cjDpPYjnνbnj q ´DpPXnµnq

ff

ď F pp1` ε1qΣq ` C ` ε2 (3.68)

where we used DpPXnQbnX q “ DpPXnµnq. Also, a simple scaling property of the

differential entropy shows that

F pp1` ε1qΣq “ F pΣq `lnp1` ε1q

2

´

m´ÿ

cj

¯

(3.69)

ď F pΣq `1

2?n

´

m´ÿ

cj

¯

A1. (3.70)

5As is clear from the context, in Yjn the n indexes the blocklength and the j indexes the terminal,

so Yjn is not an abbreviation for pYj , . . . ,Ynq.

89

Continuing (3.68), we see

1

n

«

ÿ

j

cjDpPYjnνbnj q ´DpPXnµnq

ff

ď F pΣq `1

2?n

´

m´ÿ

cj

¯

A1 ` C `A2?n

(3.71)

ď d‹pQX, νj, cmq `

1

2?n

´

m´ÿ

cj

¯

A1 `A2?n

(3.72)

where the last step used (3.62). Thus (3.58) is established, as desired.

3.2.3 Continuous Density Case

As alluded before, the first-order asymptotics of the BL divergence is known when

X “ Y m and Y m has a continuous distribution. This result is sufficient for es-

tablishing the strong converse of omniscient-helper CR generation with continuous

distributions, but the second-order analysis is still elusive at this point.

The main result of this subsection is the following:

Theorem 3.2.7 Assume that Y1, . . . ,Ym are Polish spaces, νY1 , . . . , νYm and

QYm are Borel measures on Y1, . . . ,Ym and Ym, respectively, and ıQYmνYm is a

(finite-valued) continuous function on Ym, where νYm :“ νY1 ˆ ¨ ¨ ¨ ˆ νYm, such

that Er|ıQYmνYm pY mq|s ă 8, where Y m „ QYm. Then for any δ P p0, 1q and

c1, . . . , cm P p0,8q,

lim infnÑ8

dδpQbnYm , pν

bnYjq, cmq ď d‹pQYm , pνYjq, c

mq. (3.73)

The proof of this result can be found in [144]. It is technically involved, so we

omit it here. The proof idea is outlined as follows: first, we show that if cj ą 1 for

any j P t1, . . . ,mu, then the result is trivial since both sides of the desired inequality

equal `8. Next we prove the theorem in the special case where Y1, . . . ,Ym, using

90

quantization and the result for discrete distributions (Lemma 3.2.3). Finally, we

extend the result to general Polish alphabets using the fact that a probability measure

on a Polish space can be approximated by its restriction on a compact set.

The regularity condition Er|ıQYmνYm pY mq|s ă 8 is rather mild. For example, if

νYj is the Lebesgue measure, then the condition is equivalent to assuming that the

positive and negative parts of the integral for computing the differential entropy of

Y m are both finite. When νYj “ QYj and m “ 2, the condition translates into the

assumption that IpY1;Y2q ă 8.

3.3 Application to Common Randomness Genera-

tion

The smooth BL divergence can be used to give single-shot converse bounds for prob-

lems in multiuser information theory. In this section we will focus on the example of

omniscient-helper common randomness generation. The applications to the discrete

and Gaussian Gray-Wyner source coding problem, which we do not discuss here, is

similar and can be found in our journal version [144].

3.3.1 Second-order Converse for the Omniscient Helper

Problem

The following result can be proved based on the Theorem 2.3.2 and the smoothing

idea.

Theorem 3.3.1 (one-shot converse for omniscient helper CR generation)

Fix QYm, δ P r0, 1q, and cm P p0,8qm such thatřmj“1 cj ą 1. Let QKm be the actual

CR distribution in a coding scheme for omniscient helper CR generation, using

stochastic encoders and deterministic decoders (or stochastic decoders, if cj ď 1,

91

j “ 1, . . . ,m). Then

1

2|QKm ´ TKm | ě 1´

1

|K| ´śm

j“1 |Wj|cjř

ci

|K|1´1

ř

ci

exp

ˆ

dδpQYm , pQYjq, cmq

ř

ci

˙

´ δ, (3.74)

where

TKmpkmq :“1

|K|1tk1 “ ¨ ¨ ¨ “ kmu (3.75)

is the target CR distribution and K and pWjqmj“1 denote the CR and message alphabets.

Proof The δ “ 0 special case of the result was established in Theorem 2.3.2. Note

the argument can be extended to the case where the source probability distributed

is replaced by unnormalized measures (see Proposition 4.4.5 which holds for unnor-

malized measures). In other words, suppose µYm is an unnormalized measure for the

source variables, and let µKm denote the corresponding measure for the CR when the

given encoders and decoders are applied. Then we have

E1pTKmµKmq ě 1´1

|K| ´śm

l“1 |Wl|cl

ř

ci

|K|1´1

ř

ci

exp

ˆ

dpµYm , pQYjq, cmq

ř

ci

˙

. (3.76)

Now Theorem 3.3.1 follows using

1

2|QKm ´ TKm | “ E1pTKmQKmq (3.77)

ě E1pTKmµKmq ´ E1pQKmµKmq (3.78)

ě E1pTKmµKmq ´ E1pQYmµYmq. (3.79)

92

Remark 3.3.1 Let QKKm and

TKKmpk, kmq :“1

|K|1tk “ k1 “ ¨ ¨ ¨ “ kmu (3.80)

denote the actual and the target distributions of the CR generated by T0, T1,. . . ,Tm,

respectively. Since

|QKKm ´ TKKm | ě |QKm ´ TKm |, (3.81)

Theorem 3.3.1 also provides a lower bound on |QKKm ´ TKKm |. Actually, if the

decoders are deterministic, T0 can always produce K such that the two total variations

are equal, because T0 is aware of the CR produced by the other terminals.

Remark 3.3.2 Allowing stochastic decoders can strictly lower 12|QKm ´ TKm |: con-

sider the special case where m “ 2, X and Y are constant, and there are no mes-

sages sent. Then the minimum 12|QKm ´ TKm | achieved by deterministic decoders is

1 ´ 1|K| . On the other hand, T1 and T2 can each independently output an integer in

t1, . . . ,a

|K|u equiprobably, achieving 12|QKm ´ TKm | “ 1´ 1?

|K|. On the other hand,

we can argue that allowing stochastic decoders can at most reduce the error by a factor

of 4, using the following argument: suppose 12|QKm ´ TKm | ď δ for some stochastic

decoders, then 12|QKK1 ´ TKK1 | ď δ and QpK1 “ K2 “, . . . ,“ Kmq ď δ. We can

then remove the stochasticity of decoders at T2. . .Tm but retain the last two inequal-

ities. This is possible since each of Kj, j “ 2, . . . ,m is independent of all other CR

conditioned on the observation of Tj, and we used the fact that average is no greater

than the maximum. Thus 12|QKKm ´ TKKm | ď 2δ is achievable with deterministic

decoders at T2,. . . ,Tm. Applying a similar argument again we can further remove the

stochasticity of the decoder at T1, at the cost of another factor of 2.

Using Theorem 3.3.1 and the definition of the BL divergence, we see that

93

Corollary 3.3.2 (Strong converse for omniscient helper CR generation)

Suppose that QYm, pQYj |Ymq and cm P p0,8qm satisfy

lim infnÑ8

dδpQbnYm , pQ

bnYjq, cmq ď d‹pQYm , pQYjq, c

mq (3.82)

and that pR,R1, . . . , Rmq P p0,8qm`1 satisfy

d‹pQYm , pQYjq, cmq `

ÿ

j

cjRj ě

˜

ÿ

j

cj ´ 1

¸

R. (3.83)

Then, any omniscient helper CR generation scheme for the stationary memoryless

source with per-letter distribution QYm allowing stochastic encoders and decoders at

the rates pR,R1, . . . , Rmq must satisfy

limnÑ8

1

2|QKm ´ TKm | “ 1, (3.84)

where QKm denotes the actual CR distribution and TKm denotes the ideal distribution

under which K1 “ ¨ ¨ ¨ “ Km is equiprobable.

Note that (3.82) is satisfied in the settings discussed in the previous section. More-

over, in the case of finite alphabets or Gaussian distributions, the second-order upper

bounds on the smooth BL divergence in the previous section imply second-order con-

verses for the omniscient helper CR generation problem:

Corollary 3.3.3 (Second-order converse) Assume that the per-letter source dis-

tribution QYm is either

1. m-dimensional Gaussian with a non-degenerate covariance matrix, or

2. a probability distribution on a finite alphabet.

94

Suppose that there is a sequence of CR generation schemes such that

lim infnÑ8

1

2|QKm

n ´ TKmn | ă 1 (3.85)

Then,

lim infnÑ8

?n”´

ÿ

cj ´ 1¯

Rn ´ÿ

cjRjn ´ d‹pQYm , pQYjq, cmq

ı

ě Op?nq (3.86)

where

Rn :“1

nlog |K|; (3.87)

Rjn :“1

nlog |Wj|, j “ 1, . . . ,m (3.88)

are the rates at blocklength n.

95

Chapter 4

The Pump

This chapter is the third part of the trilogy on the functional approach to information-

theoretic converses (Chapters 2-4), and also one of the highlights of the thesis. We

introduce the “pumping-up” argument, where we design an operator T with the

property that for any function f taking values in r0, 1s whose 1-norm is not too small,

the 0-norm of Tf is not too small either. The Markov semigroups provide an excellent

source of such operators. This method is reminiscent of the philosophy of the classical

blowing-up method (the concentration of measure in the Hamming space), and can

indeed substitute the parts of the strong converse proofs which used the blowing-up

argument. However, in contrast to the blowing-up lemma which yields a sub-optimal

Oεp?n log

32 nq second-order term for discrete memoryless systems in the regime of

non-vanishing error probability, the proposed method yields an Opb

n log 11´εq bound

on the second-order term which is sharp in both the blocklength n and the error

probability ε Ò 1. Moreover, the new approach extends beyond stationary memoryless

settings: we discuss how it works under the assumptions of bounded probability

density, Gaussian distribution, and processes with memory and weak correlation.

The optimality and the wide applicability of the new method suggests that it may

be the “right” approach to multiuser second-order converses, and perhaps, that the

classical blowing-up method works for strong converse purposes simply because the

blowing-up operation approximates the optimal semigroup operation.

96

While Chapters 2 and 3 proved the Op?nq second-order converse for the om-

niscient helper CR generation problem, the “pumping-up” machinery allows us to

extend the result to the general one-communicator problem, which is a canonical

example of the side information problems.

4.1 Prelude: Binary Hypothesis Testing

This section we review the classical blowing-up approach and then introduce the

proposed pumping-up approach through the basic example of binary hypothesis test-

ing. Although this is a rather simple setting where many alternative approaches are

applicable, it is rich enough to demonstrate many features of the new approach.

A few words about the notations of the chapter: H`pYq denotes the set of

nonnegative measurable functions on Y and Hr0,1spYq is the subset of H`pYq with

range in r0, 1s. For a measure ν and f P H`pYq, we write νpfq :“ş

fdν and

fpp “ fpLppνq “ş

|f |pdν, while the measure of a set is denoted as νrAs. A ran-

dom transformation QY |X , mapping measures on X to measures on Y , is viewed as

an operator mapping H`pYq to H`pX q according to QY |Xpfq :“ ErfpY q|X “ ¨s

where pX, Y q „ QXQY |X .

4.1.1 Review of the Blowing-Up Method, and Its Second-

order Sub-optimality

Many converses in information theory (most notably, the meta-converse [34]) rely on

the analysis of certain binary hypothesis tests (BHT). Consider probability distribu-

tions P and Q on Y . Let f P Hr0,1spYq be the probability of deciding the hypothesis

P upon observing y. Denote by πP |Q “ Qpfq the probability that P is decided when

Q is true and vice versa by πQ|P . By the data processing property of the relative

97

entropy

DpP Qq ě dpπQ|P 1´ πP |Qq (4.1)

ě p1´ πQ|P q log1

πP |Q´ hpπQ|P q, (4.2)

where dp¨¨q is the binary relative entropy function on r0, 1s2. In the special case of

product measures P Ð Pbn, QÐ Qbn, and πQbn|Pbn ď ε P p0, 1q, (4.2) yields

πPbn|Qbn ě exp

ˆ

ńDpP Qq

1´ εÓp1q

˙

, (4.3)

which is not sufficiently powerful to yield the converse part of the Chernoff-Stein

Lemma [3] (because of the factor 11´ε

).

In the case of deterministic tests (f “ 1A for some A Ď Yn) we show how to

improve (4.3) by means of a remarkable property enjoyed by product measures: a

small blowup of a set of nonvanishing probability suffices to increase its probability

to nearly 1 [27]. The following “modern version” was due to Marton [63]; see also [48,

Lemma 3.6.2].

Lemma 4.1.1 (Blowing-up) Denote the r-blowup of A Ď Yn by

Ar :“ tvn P Yn : dnpvn,Aq ď ru, (4.4)

where dn is the Hamming distance on Yn. Then, for any c ą 0,

PbnrArs ě 1´ eć2

(4.5)

where

r “

c

n

2

ˆ

d

ln1

PbnrAs ` c˙

. (4.6)

98

Moreover, as every element of Ar is obtained from an element of A by changing at

most r coordinates, a simple counting argument [27, Lemma 5] shows that

QbnrArs ď Crpr ` 1q

ˆ

n

r

˙

QbnrAs (4.7)

“ exppnhprnq ÒprqqQbnrAs (4.8)

for r P Ωp?nq X opnq, where C “ |Y |minyQpyq.

Proposition 4.1.2 Assuming |Y | ă 8 and πQbn|Pbn ď ε P p0, 1q. Any deterministic

test between Pbn and Qbn on Yn satisfies1

πPbn|Qbn ě exp´

ńDpP Qq Óεp?n log

32 nq

¯

. (4.9)

Proof Fix A Ď Y . If, in lieu of (4.1), we use

nDpP Qq ě d`

PbnrArsQbnrArs˘

, (4.10)

then similarly as (4.3) we obtain that

QbnrArs ě exp

ˆ

ńDpP Qq

PbnrArsÓp1q

˙

. (4.11)

Now, applying (4.5) and (4.8), and taking the (optimal) r “?αn log n for any fixed

α P p14,8q, we have

PbnrArs ě 1´ oεpn´ 1

2 q; (4.12)

QbnrArs ď exppOp?n log

32 nqqQbnrAs. (4.13)

“ exppOp?n log

32 nqqπPbn|Qbn . (4.14)

1The subscript in Oε indicates that the implicit constant in the big O notation depends on ε.Later we may omit this subscript when the discussion focuses on the asymptotics in n.

99

Then (4.9) follows from (4.11), (4.12) and (4.13).

Although Proposition 4.1.2 is sufficient to obtain the converse part of the Chernoff-

Stein Lemma, there are far easier and more general methods to accomplish that goal.

Moreover, the sublinear term in the exponent of (4.9) is not optimal. Indeed, our aim

is to lower bound QbnrAs given PbnrAs ě 1 ´ ε. By the Neyman-Pearson lemma,

the extremal A is a sublevel set of ıPbnQbn :“ log dPbn

dQbn, which is a sum of i.i.d.

random variables under Pbn. The optimal rate of the second-order term in (4.9)

is thus the central limit theorem rate Op?nq (see for example the analysis in [146,

Lemma 2.15] via the Chebyshev inequality). Such an optimal second-order rate is

fundamentally beyond the blowing-up method: simple examples show that neither

(4.5) nor (4.8) can be essentially improved and the choice r “ Θp?n log nq is optimal.

In other classical applications of BUL such as the settings in Section 4.2 and 4.3, the

suboptimal Op?n log

32 nq second term emerges for a similar reason.

It can be seen that the present setting of BHT is a special case of the change-of-

measure problem in Section 4.3 by taking X to be a singleton. However, in contrast

to BHT, the extremal set for the problem in Section 4.3 cannot be so simply charac-

terized, and BUL appears to be the only previous method to achieve the second-order

strengthening (to a suboptimal Op?n log

32 nq order).

4.1.2 The Fundamental Sub-optimality of the Set-Inequality

Approaches

Previously we have argued that the blowing-up lemma only achieves the?n log

32 n

second-order rate. Here we remark that the optimal?n rate is, in fact, fundamentally

beyond the reach of the idea of transforming sets and applying the data processing

argument (e.g. (4.1) and (4.10)): suppose, generalizing the blowing-up operation, we

apply a certain transformation of the set A ÞÑ A, and then proceed with the same

100

argument as Proposition 4.1.2. It is not hard to see that in order to achieve the

?n rate, such a set transformation must have the property that for any A such that

PbnrAs is bounded away from 0 and 1, say PbnrAs P r13, 23s for all n, we have

• Lower bound

PbnrAs ě 1Ópn´12 q. (4.15)

• Upper bound

QbnrAs ď exppOp?nqqQbnrAs. (4.16)

We remark that for proving a strong converse (i.e. opnq second-order rate), it suffices

to relax (4.15) and (4.16) to

PbnrAs ě 1´ op1q; (4.17)

QbnrAs ď exppopnqqQbnrAs, (4.18)

which are of course satisfied by the blowing-up operation, see (4.12) and (4.13).

Now we will assume that (4.15) and (4.16) indeed hold, and we will reach

a contradiction. Upon choosing A such that PbnrAs P r13, 23s and QrAs “

exppńDpP Qq Òp?nqq we get

PbnrAs ě 1Ópn´12 q, (4.19)

QbnrAs ď exppńDpP Qq Òp?nqq. (4.20)

However, under the constraint (4.19), QbnrAs is minimized by A of the form:

A :“ txn : ıPbnQbnpxnq ą nDpP Qq ´ C

?n lnnu (4.21)

101

for some C ą 0, which can be seen from the Neyman-Pearson lemma and the Gaussian

approximation of a normalized iid sum. However, for any C 1 P p0, Cq,

QbnrAs ě P”

C 1?n lnn ă nDpP Qq ´ ıPbnQbnpY

nq ă C

?n lnn

ı

(4.22)

ě exp´

ńDpP Qq ` C 1?n lnn

¯

¨ P”

C 1?n lnn ă nDpP Qq ´ ıPbnQbnpX

nq ă C

?n lnn

ı

(4.23)

where Xn „ Pbn and Y n „ Qbn. We can show [147, Lemma 8.1] that the second

factor in (4.23) is of the order of exppÓplog nqq. Then (4.23) contradicts (4.20). In

fact, by refining the above analysis we see that the best possible second-order rate

that can be obtained from the data processing argument is?n log n.

4.1.3 The New Pumping-up Approach

In this subsection we propose the pumping-up approach, which is a gentler alterna-

tive to BUL that achieves the optimal Op?nq second-order rate, while retaining the

applicability of BUL to multiuser information theory problems. Instead of applying

a set transformation A ÞÑ A, we will apply a linear transformation T to a function

f P Hr0,1s (which can be thought of as the indicator function of A, but not necessar-

ily). In contrast to (4.15) and (4.16), which cannot be achieved simultaneously as we

have shown, we will find T with the following properties:

• Lower bound: for any f P Hr0,1spYq, with, say2, Pbnpfq ě 12,

TfL0pPbnq ě exppÓp?nqq. (4.24)

2We define fL0pP q :“ limqÓ0 fLqpP q “ exppP pln fqq.

102

• Upper bound: for any f P Hr0,1spYq,

QbnpTfq ď exppOp?nqqQbnpfq. (4.25)

We now explain how (4.24) and (4.25) can lead to an Op?nq second-order converse.

Instead of applying the data processing argument as in (4.1) and (4.10), we note that,

by the variational formula for the relative entropy (e.g. [48, (2.4.67)])3,

DpP Qq ě P pln gq ´ lnQpgq, @g P H`pYq. (4.26)

Taking Y Ð Yn, P Ð Pbn and Q Ð Qbn in (4.26), we see that (4.24) and (4.25)

lead to

Qbnpfq ě e´nDpP Qq´Op?nq. (4.27)

Next, we need to find the operator T satisfying (4.24) and (4.25). Markov semi-

groups appear to be tailor-made for this purpose. We say pTtqtě0 is a simple semi-

group4 with stationary measure P if

Tt : H`pYq Ñ H`pYq, f ÞÑ e´tf ` p1´ e´tqP pfq. (4.28)

In the i.i.d. case P Ð Pbn we consider their tensor product

Tt :“ re´t ` p1´ e´tqP sbn (4.29)

3While the bases of the information theoretic quantities, log and exp were arbitrary (but consis-tent) up to here (see the convention in [43]), henceforth they are natural.

4Markov Semigroups arise from the study of Markov processes. For example, the semigroupin 4.29 has the following Markov process interpretation: A particle sits at xn at the time t “ 0.Whenever a Poisson clock of rate n ticks, we equiprobably select one coordinate among t1, . . . , nuand replace the value by a random sample according to P . Let Pxn,t be the distribution at t. ThenpTtfqpx

nq :“ Pxn,tpfq for any f .

103

Note that (4.24) implies that T is an positivity improving operator, that is, it maps

a nonnegative, not identically zero function to a strictly positive function. The

positivity-improving property of Tt is precisely quantified by the reverse hypercon-

tractivity phenomenon discovered by Borell [148] and recently generalized by Mossel

et al. [77].

Theorem 4.1.3 [77] Let pTtqtě0 be a simple semigroup (defined in (4.28)) or an

arbitrary tensor product of simple semigroups. Then for all 0 ă q ă p ă 1, f P H`and t ě ln 1´q

1´p,

Ttfq ě fp. (4.30)

When Tt is defined by (4.29), for any f P Hr0,1spYq,

PbnplnTtfq “ ln TtfL0pPbnq (4.31)

ě ln fL1´e´t pPbnq (4.32)

ě1

1´ e´tlnPbnpfq (4.33)

ě

ˆ

1

t` 1

˙

lnPbnpfq, (4.34)

where (4.34) follows from et ě 1 ` t. On the other hand, instead of the counting

argument (4.8), in the present setting we use

QbnpTtfq “ Qbnppe´t ` p1´ e´tqP qbnfq (4.35)

ď Qbnppe´t ` αp1´ e´tqQqbnfq (4.36)

“ pe´t ` αp1´ e´tqqnQbnpfq (4.37)

ď epα´1qntQbnpfq, (4.38)

for all f P H`pYq, where α :“ dPdQ8 ě 1.

104

Theorem 4.1.4 If πQbn|Pbn ď ε P p0, 1q, any (possibly stochastic) test between Pbn

and Qbn satisfies

πPbn|Qbn ě p1´ εq exp

˜

´nDpP Qq ´ 2

d

n

›

›

›

›

dP

dQ

›

›

›

›

8

ln1

1´ ε

¸

. (4.39)

Proof Take g Ð Ttf in (4.26) with fpyq the probability for which the test decides

P observing y. Apply (4.34) and (4.38). The result follows by optimizing over t ą 0.

Remark 4.1.1 In the setting of Theorem 4.1.4, the likelihood ratio test is optimal by

the Neyman-Pearson lemma, and using the central limit theorem it is easy to show

that (see e.g. [50, Proposition 2.3])

πPbn|Qbn ě exp´

´nDpP Qq `b

nVarpıP QpXqqQ´1pεq ´ op

?nq¯

(4.40)

where X „ P , which is tight up to the?n second-order term. Since Q´1pεq „

´2b

ln 11´ε

as ε Ò 1, we see that the second-order term in (4.39) is tight up to a

constant factor depending only on P and Q for ε close to 1.

Remark 4.1.2 It is instructive to compare the blowing-up and semigroup operations.

Note that

1Arpxq “ sup|S|ďr

suppziqiPS

1AppziqiPS , pxiqiPScq, (4.41)

while expanding (4.29) gives (cf. (4.51))

Tt1Apxq “ Er1AppXiqiPS , pxiqiPScqs (4.42)

where the set S is uniformly distributed over sets of size |S| „ Binompn, 1´ e´tq and

Xi „ P are i.i.d. Thus the semigroup operation can be viewed as an “average” coun-

105

terpart of blowing-up (with r « np1 ´ e´tq). In contrast to the maximum, averaging

increases the small values of f (positivity-improving) while preserving the total mass

PbnpTtfq “ Pbnpfq, so that QbnpTtfq does not increase too much.

Hamming weight of xn

1A 1Ant

Tbnt 1A

Figure 4.1: Schematic comparison of 1A, 1Ant and Tbnt 1A, where A is the indicatorfunction of a Hamming ball.

We note that the new method draws on a philosophy similar to BUL, but enjoys

the following advantages:

• Achieves the optimal Opb

n ln 11´εq second-order rate, which is sharp in both n

(as nÑ 8) and ε (as εÑ 1);

• Purely measure-theoretic in nature: finiteness of the alphabet is sufficient, but

not necessary, for Theorem 4.1.4. In fact, the boundedness of the Radon-

Nikodym derivative is sometimes not necessary if we apply semigroups other

than (4.29): see Section 4.2.2 for a Gaussian example. In contrast, while ana-

logues of the blowing-up property (4.5) exist for many other measures, no ana-

logue of the counting estimate (4.8) can hold in continuous settings as the

blow-up of a set of measure zero may have positive measure.

A comparison between the BUL method and the pumping-up method is given in

Table 4.1.

106

BUL approachPumping-up

approachConnectinginformation

measures andobservables

Data processingproperty

Convex duality

Lower boundwith respect toa given measure

Blowing-Uplemma

Reverse hyper-contractivity

Upper boundwith respect tothe reference

measure

A countingargument (4.8)

Change ofmeasure (4.36)

(boundeddensity case);

scalingargument (4.61)(Gaussian case)

Table 4.1: Comparison of the basic ideas in BUL and the new methodology

4.2 Optimal Second-order Fano’s Inequality

4.2.1 Bounded Probability Density Case

Consider a random transformation PY |X . For any x P X , denote by pTx,tqtě0 the

simple Markov semigroup:

Tx,t :“ e´t ` p1´ e´tqPY |X“x. (4.43)

Motivated by the steps (4.35)-(4.38), for α P r1,8q, t P r0,8q and a probability

measure ν on Y , define a linear operator Λνα,t : H`pYnq Ñ H`pYnq by

Λνα,t :“

nź

i“1

pe´t ` αp1´ e´tqνpiqq (4.44)

107

where νpiq : HpYnq Ñ HpYnq is the linear operator that integrates the i-th coordinate

with respect to ν. Since νpiq1 “ 1, we see from the binomial formula that

Λνα,t1 “

nÿ

k“0

e´pn´kqtαkp1´ e´tqkˆ

n

k

˙

(4.45)

“`

e´t ` αp1´ e´tq˘n

(4.46)

ď epα´1qnt. (4.47)

Lemma 4.2.1 Fix pPY |X , ν, tq. Suppose that

α :“ supx

›

›

›

›

dPY |X“xdν

›

›

›

›

8

P r1,8q; (4.48)

Txn,t :“ bni“1Txi,t. (4.49)

Then for n ě 1 and f P H`pYnq,

supxn

Txn,tf ď Λνα,tf. (4.50)

Proof For any xn P X n, observe that

Txn,tf “nź

i“1

re´t ` p1´ e´tqPpiqY |X“xi

sf (4.51)

“ÿ

SĎt1,...,nu

e´|Sc|tp1´ e´tq|S|

˜

ź

iPSPpiqY |X“xi

¸

pfq,

The result then follows from (4.44) and (4.48).

Theorem 4.2.2 Fix PY |X and positive integers n and M . Suppose (4.48) holds

for some probability measure ν on Y. If there exists c1, . . . , cM P X n and disjoint

108

D1, . . . ,DM Ď Yn such that

Mź

m“1

P1M

Y n|Xn“cmrDms ě 1´ ε, (4.52)

then

IpXn;Y nq ě lnM ´ 2

c

pα ´ 1qn ln1

1´ ε´ ln

1

1´ ε, (4.53)

where Xn is equiprobable on tc1, . . . , cMu, Yn is its output from PY n|Xn :“ PbnY |X , and

α is defined in (4.48).

Proof Let fm :“ 1Dm , m “ 1, . . . ,M . Fix some t ą 0 to be optimized later. Observe

that

IpXn;Y nq “

1

M

Mÿ

m“1

DpPY n|Xn“cmPY nq (4.54)

ě1

M

Mÿ

m“1

PY n|Xn“cmpln Λνα,tfmq

´1

M

Mÿ

m“1

lnPY npΛνα,tfmq (4.55)

where (4.55) is from the variational formula (4.26). We can lower bound the first

term of (4.55) by

1

M

Mÿ

m“1

ln Λνα,tfmL0pPY n|Xn“cm q

ě1

M

Mÿ

m“1

ln Tcm,tfmL0pPY n|Xn“cm q(4.56)

ě ´

ˆ

ln1

1´ ε

˙ˆ

1`1

t

˙

, (4.57)

109

where (4.56)-(4.57) follows similarly as (4.31)-(4.34). For the second term in the right

of (4.55), using Jensen’s inequality and (4.47),

´1

M

Mÿ

m“1

lnPY npΛνα,tfmq ě ´ lnPY n

˜

1

M

Mÿ

m“1

Λνα,tfm

¸

(4.58)

ě lnM ´ pα ´ 1qnt. (4.59)

The result then follows by optimizing t.

Remark 4.2.1 If |Y | ă 8 then (4.48) holds with ν equiprobable and α “ |Y |. The

“geometric average criterion” in (4.52) is weaker than the maximal error criterion but

stronger than the the average error criterion. Under the average error criterion, one

cannot expect a bound like IpXn;Y nq ě lnM ´ opnq since it would contradict [149,

Remark 5]. In [150] we introduce the notion of “λ-decodability” which subsumes the

geometric average criterion, average error criterion, and the maximum error criterion

as the λ “ 0, 1,´8 special cases, and show that λ “ 0 is the critical value for a

strengthening of the Fano inequality to hold.

Remark 4.2.2 Using the blowing-up lemma, one can get a weaker version of The-

orem 4.2.2 in the case of finite alphabets under the maximal error criterion with the

Op?nq second-order term replaced by Op

?n log

32 nq. This is essentially contained in

the analysis in [27][63]; see also [149, Theorem 6].

4.2.2 Gaussian Case

We now find an analogue of Theorem 4.2.2 for Gaussian channels, i.e., PY |X“x “

N px, 1q. The previous argument does not immediately carry through due to the

bounded density assumption (4.48). To surmount this problem, it will be convenient

to replace (4.49) in the Gaussian setting by the Ornstein-Uhlenbeck semigroup with

110

stationary measure N pxn, Inq:

Txn,tfpynq :“ Erfpe´tyn ` p1´ e´tqxn `

?1´ e´2tV n

qs (4.60)

for f P H`pRnq, where V n „ N p0n, Inq. In this setting, (4.30) holds under the even

weaker assumption t ě 12

ln 1´q1´p

[148].

Original

ynConvolution

ynDilation

yn

xn

xn

Figure 4.2: Illustration of the action of Txn,t. The original function (an indicatorfunction) is convolved with a Gaussian measure and then dilated (with center xn).

The proof proceeds a little differently than the discrete case. Here, the analogue

of Λνα,t in (4.44) is simply T0n,t. Instead of Lemma 4.2.1, we will exploit a simple

change-of-variable formula: for any f ě 0, t ą 0 and xn P Rn, we have

PY n|Xn“xnplnT0n,tfq “ PY n|Xn“e´txnplnTe´txn,tfq, (4.61)

which can be verified from the definition in (4.60). For later applications in broadcast

channels, we consider a slight extension of the setting of Theorem 4.2.2 to allow

stochastic encoders.

111

Theorem 4.2.3 Let PY |X“x “ N px, σ2q. Assume that there exist

φ : t1, . . . ,Mu ˆ t1, . . . , Lu Ñ Rn, (4.62)

and disjoint sets pDwqMw“1 such that

ź

w,v

P1ML

Y n|Xn“φpw,vqrDws ě 1´ ε (4.63)

where ε P p0, 1q. Then

IpW ;Y nq ě lnM ´

c

2n ln1

1´ ε´ ln

1

1´ ε(4.64)

where pW,V q is equiprobable on t1, . . . ,Mu ˆ t1, . . . , Lu, Xn :“ φpW,V q, and Y n is

the output from PY n|Xn :“ PbnY |X .

Remark 4.2.3 Under the maximal error probability criterion, one can also derive

a second-order strengthening of the Gaussian Fano inequality (4.64) by applying

Poincare’s inequality to an information spectrum bound; see [149, Theorem 8].

Proof By the scaling invariance of the bound it suffices to consider the case of

σ2 “ 1. Let fw “ 1Dw for w P t1, . . . ,Mu. Put Xn “ etXn and Y n the corresponding

output from the same channel. Note that for each w,

DpPY n|W“wPY nq

ě1

L

Lÿ

v“1

PY n|Xn“etφpw,vqplnT0n,tfwq ´ lnPY npT0n,tfwq (4.65)

“1

L

Lÿ

v“1

PY n|Xn“φpw,vqplnTφpw,vq,tfwq ´ lnPY npT0n,tfwq (4.66)

where the key step (4.66) used (4.61). The summand in the first term of (4.66) can

be bounded using reverse hypercontractivity for the Ornstein-Uhlenbeck semigroup

112

(4.60) as

PY n|Xn“φpw,vqplnTφpw,vq,tfwq ě1

1´ e´2tlnPY n|Xn“φpw,vqpfwq

thus from the assumption (4.63),

1

ML

ÿ

w,v

PY n|Xn“etφpw,vqplnT0n,tfwq ě ´1

1´ e´2tln

1

1´ ε.

On the other hand, by Jensen’s inequality,

1

M

Mÿ

w“1

lnPY npT0n,tfwq ď lnPY n

˜

T0n,t1

M

Mÿ

w“1

fw

¸

ď ln1

M

where the last step usesřMw“1 fw ď 1. Thus taking 1

M

řMw“1 on both sides of (4.66),

we find

IpW ; Y nq ě lnM ´

1

1´ e´2tln

1

1´ ε. (4.67)

Moreover, let Gn „ N p0n, Inq be independent of Xn,

hpY nq “ hpetXn

`Gnq (4.68)

“ hpXn` e´tGn

q ` nt (4.69)

ď hpY nq ` nt, (4.70)

where (4.70) can be seen from the entropy power inequality. On the other hand for

each w we have

hpY n|W “ wq ´ hpY n

|W “ wq

“ IpY n;Xn|W “ wq ´ IpY n;Xn

|W “ wq ě 0 (4.71)

113

where (4.71) can be seen from [151, Theorem 1]. The result follows from (4.67),

(4.70), (4.71) and by optimizing t.

4.3 Optimal Second-order Change-of-Measure

We now revisit a change-of-measure problem considered in [27] in proving the strong

converse of source coding with side information (cf. Section 4.4.3). For a “combina-

torial” formulation, which amounts to assigning an equiprobable distribution on the

strongly typical set, see [8, Theorem 15.10]. Variations on such an idea have been

applied to the “source and channel networks” in [8, Chapter 16] under the name

“image-size characterizations”. Given a random transformation QY |X , measures ν on

Y and µn on X n, we want to lower bound the νbn-measure of A Ď Yn in terms of

the µn-measure of its “ε-preimage” under QY n|Xn :“ QbnY |X . More precisely, for given

c ą 0, ε P p0, 1q find an upper bound on

supA

lnµnrxn : QY n|Xn“xnrAs ą 1´ εs ´ c ln νbnrAs

(

.

X

x : QY |X“xrAs ě 1´ ε

x

Y

A

Figure 4.3: A set A Ď Y and its “preimage” under QY |X in X .

114

Definition 4.3.1 Fix µ on X , ν on Y, and QY |X . For c P p0,8q and PX Ñ QY |X Ñ

PY ,

dpµ,QY |X , ν, cq :“ supPX : PX!µ

tcDpPY νq ´DpPXµqu (4.72)

“ supfPH`pYq

lnµpecQY |Xpln fqq ´ c ln νpfq(

. (4.73)

where (4.73) follows from, e.g. Theorem 2.2.3.

In Definitions 4.3.1 and 4.3.2 we adopt the convention 8´8 “ ´8.

Observe that given QXY , the largest c ą 0 for which dpQX , QY |X , QY , cq “ 0 is

the reciprocal of the strong data processing constant ; see the references in [80]. If

in definition (4.72) for dpµn, QbnY |X , ν

bn, cq we choose PXn to be µn conditioned on

B :“ txn : QY n|Xn“xnrAs ą 1 ´ εu, i.e., PXnrCs :“ µnrBXCsµnrBs , @C, then [27, (19)-(21)]

showed that when |ν| “ 1,

lnµnrxn : QY n|Xn“xnrAs ą 1´ εs ´ cp1´ εq ln νbnrAs

ď dpµn, QbnY |X , ν

bn, cq ` c ln 2. (4.74)

We quickly recall the derivation of (4.74) here: define PXn by

PXnrCs :“µnrB X CsµnrBs

, @C. (4.75)

115

Then

DpPXnµnq “ ln1

µnrBs, (4.76)

DpPY nνbnq ě PY nrAs ln

PY nrAsνbnrAs ` PY nrA

cs ln

PY nrAcsνbnrAcs (4.77)

ě ´hpPY nrAsq ` PY nrAs ln1

νbnrAs (4.78)

ě ´ ln 2` p1´ εq ln1

νbnrAs . (4.79)

Thus

dpµn, QbnY |X , ν

bnq ě cDpPY nν

bnq ´DpPXnµnq (4.80)

ě ´c ln 2` cp1´ εq ln1

νbnrAs ´ ln1

µnrBs(4.81)

which is (4.74).

Note the undesired second ε in (4.74), which would result in a weak converse. For

finite Y , [27] used the blowing-up lemma to strengthen (4.74). For the same reason

discussed in Section 4.1, even using modern results on concentration of measure, one

can only obtain Op?n ln

32 nq in the second order term:



bn, cq `Op?n ln

32 nq. (4.82)

If µn “ QbnX , then by the tensorization of the BL divergence (Chapter 6),

dpQbnX , QbnY |X , νbn, cq “ n dpQX , QY |X , ν, cq, (4.83)

we see the right side of (4.74) grows linearly with slope dpQX , QY |X , ν, cq, which is

larger than desired (i.e. when applied to the source coding problem in Section 4.4.3

116

would only result in an outer bound). Luckily, when |X | ă 8, it was noted in [27]

that if µn is the restriction of QbnX on the QX-strongly typical set5, then the linear

growth rate of dpµn, QbnY |X , ν

bn, cq is the following (desired) quantity:

Definition 4.3.2 Given QX , QY |X , measure ν on Y and c P p0,8q, define

d‹pQX , QY |X , ν, cq

:“ supPUX : PX“QX

cDpPY |Uν|PUq ´DpPX|UQX |PUq(

where PUXY :“ PUXQY |X , and DpPX|UQX |PUq :“ş

DpPX|U“¨QXqdPU denotes the

conditional relative entropy.

It follows from Definitions 4.3.1 and 4.3.2 that d‹pQX , QY |X , ν, cq ď dpQX , QY |X , ν, cq.

For general alphabets, we can extend the idea and let µn “ QbnXˇ

ˇ

Cnfor some Cn

with 1 ´ QbnX rCns ď δ for some δ P p0, 1q independent of n; this will not affect its

information-theoretic applications in the non-vanishing error probability regime. In

Chapter 3, we have shown, in the discrete and the Gaussian cases, that we can choose

Cn so that

dpµn, QbnY |X , ν

bn, cq ď n d‹pQX , QY |X , ν, cq `Op?nq. (4.84)

Using the semigroup method, we can also improve the Op?n ln

32 nq term in (4.82) to

Op?nq in the discrete and the Gaussian cases. This combined with (4.84) implies


ď n d‹pQX , QY |X , ν, cq `Op?nq. (4.85)

5The restriction µ|C of a measure µ on a set C is µ|CrDs :“ µrC XDs.

117

While the original proof [27] used a data processing argument, just as (4.2), to get

(4.74), the present approach uses the functional inequality (4.73), just as (4.26). In

the discrete case we have the following result:

Theorem 4.3.1 Consider QX a probability measure on a finite set X , ν a probability

measure on Y, and QY |X . Let

βX :“ 1minxQXpxq P r1,8q, (4.86)

α :“ supx

›

›

›

›

dQY |X“x

dν

›

›

›

›

8

P r1,8q. (4.87)

Let c P p0,8q, η, δ P p0, 1q and n ą 3βX ln |X |δ

. We can choose some set Cn with

QbnX rCns ě 1´ δ, such that for µn :“ QbnXˇ

ˇ

Cnwe have

lnµnrxn : QY n|Xn“xnpfq ě ηs ´ c ln νbnpfq

ď nd‹pQX , QY |X , ν, cq ` A?n` c ln

1

η(4.88)

for any f P Hr0,1spYnq, where

A :“ lnpαcβc`1X q

c

3βX ln|X |δ` 2c

c

pα ´ 1q ln1

η. (4.89)

The proof of Theorem 4.3.1 relies on some ideas in the proof of Theorem 4.2.2; in

particular the properties (4.47) and (4.50) for the operator Λνα,t play a critical role.

Proof We first establish the following claim: for any n ě 1, f P Hr0,1spYnq and µn

a measure on X n,

lnµnrxn : QY n|Xn“xnpfq ě ηs ´ c ln νbnpfq


bn, cq ` 2c

c

pα ´ 1qn ln1

η` c ln

1

η, (4.90)

118

where QY n|Xn :“ QbnY |X . Notice that for any g P H`pYnq, from the variational formula

of the relative entropy (4.26) we have

ż

gL0pQY n|Xn“xn qdµnpxnq “

ż

eQbnY |X

pln gqdµn (4.91)

“ e´DpPXnµnq`

ş

QbnY |X

pln gqdPXn (4.92)

ď edpµn,QbnY |X

,νbn,cqćDpPY nνbnq`cPXn pln g

1c q (4.93)

ď edpµn,QbnY |X

,νbn,cqgL1cpνbnq. (4.94)

where we defined PXn via dPXndµn

“ eQbnY |X

pln gq´

ş

eQbnY |X

pln gqdµn

¯´1

. Note that (4.94) can

also be recovered as a particular instance of [80, Theorem 1]. Now take g “ pΛνα,tfq

c

and observe that by (4.47),

gL1cpνbnq ď ecpα´1qntrνbnpfqsc. (4.95)

Moreover,

ż

gL0pQY n|Xn“xn qdµnpxnq “

ż

Λνα,tf

cL0pQY n|Xn“xn q

dµnpxnq (4.96)

ě

ż

fcL1é´t pQY n|Xn“xn q

dµnpxnq (4.97)

ě

ż

QY n|Xn“xnpfqc

1é´t dµnpxnq (4.98)

ě ηc

1é´t µnrQY n|Xnpfq ě ηs. (4.99)

where (4.97) used Lemma 4.2.1 and Theorem 4.1.3; and (4.98) used the assumption

f ď 1. Then (4.90) follows by using

1

1´ e´tď

1

t` 1 @t P p0,8q (4.100)

and optimizing over t ą 0.

119

Having established (4.90), the rest of the proof follows from the estimate of

dpµn, QbnY |X , ν

bn, cq in Lemma 3.2.2, noting that the αY defined therein satisfies αY ď

α.

We now proceed to find an analogue of Theorem 4.3.1 in the case of Gaussian QX ,

QY |X and Lebesgue ν. Again we consider the Ornstein-Uhlenbeck semigroup. The

proof of the following uses some ideas already touched in Section 4.2.2, in particular

the change-of-variable trick in (4.61).

Theorem 4.3.2 Suppose QX “ N p0, σ2q, QY |X“x “ N px, 1q, ν is Lebesgue on R,

and a, δ P p0, 1q. For n ě 24 ln 2δ, there exists Cn Ď Rn with QbnX rCns ě 1 ´ δ such

that for µn “ QbnXˇ

ˇ

Cn,

lnµnrxn : QY n|Xn“xnpfq ą ηs ´ c ln νbnpfq

ď nd‹pQX , QY |X , ν, cq

` c

c

2n ln1

η`

c

6n ln2

δ` c ln

1

η(4.101)

for any f P Hr0,1spRnq.

Note that the value of σ does not affect the bound.

Proof We first prove the following claim: for any nonnegative finite measure µn on

Rn,



bn, cq ` c

d

2

ˆ

ln1

η

˙

n` c ln1

η(4.102)

for any f P Hr0,1spRnq.

120

Let µn be the dilation of µn by factor et, that is,

µnretCs “ µnrCs, (4.103)

for any measurable C. As before, the functional inequality gives

ż

eQbnY |X

pln gqdµn ď exppdpµn, QbnY |X , ν

bn, cqqgL1cpRnq, @g ě 0. (4.104)

Now substitute with

g “ pT0n,tfqc. (4.105)

Observe that

gL1cpRnq “ T0n,tfcL1pRnq (4.106)

“ ecntfcL1pRnq (4.107)

“ ecntrνbnpfqsc, (4.108)

121

where (4.107) can be verified using Fubini’s theorem and (4.60).

ż

eQbnY |X

pln gqdµn “

ż

ecQY n|Xn“xn plnT0n,tfqdµnpxnq (4.109)

“

ż

ecQY n|Xn“e´txn plnTe´txn,tfqdµnpxnq (4.110)

“

ż

Te´txn,tfcL0pQY n|Xn“e´txn q

dµnpxnq (4.111)

ě

ż

fcL1é´2t

pQY n|Xn“e´txn qdµnpx

nq (4.112)

ě

ż

fc

1é´2t

L1pQY n|Xn“e´txn qdµnpx

nq (4.113)

ě ηc

1é´2t µnrxn : QY n|Xn“e´txnpfq ą ηs (4.114)

“ ηc

1é´2t µnrxn : QY n|Xn“xnpfq ą ηs (4.115)

where (4.110) used (4.61); (4.112) is from (4.30); (4.113) used the fact that f ď 1;

(4.115) is from (4.103). Combining (4.104), (4.108) (4.115) we obtain



bn, cq ` inftą0

"ˆ

ln1

η

˙

c

1´ e´2t` cnt

*

(4.116)


bn, cq ` c

d

2

ˆ

ln1

η

˙

n` c ln1

η(4.117)

where the last step used (4.100). Using an argument similar to (4.71), we find

dpµn, QbnY |X , ν

bn, cq ď dpµn, QbnY |X , ν

bn, cq. (4.118)

Indeed, by definition

dpµn, QbnY |X , ν

bn, cq “ supPXntcDpPY nν

bnq ´DpPXnµnqu. (4.119)

122

When PXn and µn are both scaled by et, DpPXnµnq remains unchanged. On the

other hand DpPY nνbnq “ ´hpPY nq, so by the same argument as (4.71) we have

DpPY nνbnq ď DpPY nν

bnq. This establishes (4.118). Combining (4.117), (4.118),

we have established the claim (4.102). The rest of the proof of Theorem 4.3.2 follows

from the bound on dpµn, QbnY |X , ν

bn, cq in Lemma 3.2.6.

Remark 4.3.1 If ν “ N p0, σ2q (σ ě 1) instead, then an Op?nq second-order bound

still holds. Indeed, in this case notice that

T0n,tfL1pνnq “ E”

fpe´tW n`?

1´ e´2tV nq

ı

(4.120)

“ E rf pUnqs (4.121)

“ E„

fpW nq

dPUn

dPWn

pW nq

(4.122)

ď E

«

fpW nq

ˆ

σ2

e´2tσ2 ` 1´ e´2t

˙n2ff

(4.123)

ď E“

fpW nqent

‰

(4.124)

“ entfL1pνnq (4.125)

where V n „ N p0n, Inq and W n „ N p0n, σ2Inq are independent, and Un „

N p0n, pe´2tσ2 ` 1 ´ e´2tqInq. Therefore the equality in (4.107) will be replaced

by ď, so the analysis can still go through. On the other hand, the third term on the

right side of (4.101), which is the approximation error for dpµn, QbnY |X , ν

bn, cq, will

be slightly changed, but is still of the order of n12. As for (4.118), notice that

DpPY nνbnq ´DpPY nν

bnq ď

1

2σ2ErY n

2s ´

1

2σ2ErY n

2s (4.126)

“1

2σ2ErXn

2s ´

1

2σ2ErXn

2s (4.127)

“ pe2t´ 1q

1

2σ2ErXn

2s. (4.128)

123

Therefore if µn (hence PXn) is supported on txn : xn ď Op?nqu then

dpµn, QbnY |X , ν

bn, cq ď dpµn, QbnY |X , ν

bn, cq `Op?nq (4.129)

as long as t “ Θpn´12q.

4.4 Applications

4.4.1 Output Distribution of Good Channel Codes

Consider a stationary memoryless channel PY |X with capacity C. If |Y | ă 8 then by

the steps [149, (64)-(66)] and our Theorem 4.2.2 we conclude that an pn,M, εq code

under the maximal error criterion with deterministic encoders satisfies the converse

bound

DpPY nP‹Y nq ď nC ´ lnM ` 2

c

|Y |n ln1

1´ ε` ln

1

1´ ε. (4.130)

This implies that the Burnashev condition

supx,x1

›

›

›

›

dPY |X“xdPY |X“x1

›

›

›

›

8

ă 8 (4.131)

in [149, Theorem 6] (cf. [48, Theorem 3.6.6]) is not necessary. Using the blowing-up

lemma, [149, Theorem 7] bounded DpPY nP‹Y nq without requiring (4.131), but with

a suboptimal?nplnnq32 second-order term. Also note that in our approach |Y | ă 8

can be weakened to a bounded density assumption (4.48), and the maximal error

criterion assumption can be weakened to the geometric average criterion (4.52).

124

4.4.2 Broadcast Channels

Consider a Gaussian broadcast channel where the SNRs in the two component chan-

nels are S1, S2 P p0,8q. Suppose there exists an pn,M1,M2, εq-maximal error code.

Using Theorem 4.2.3 and the same steps in the proof of the weak converse (see e.g. [18,

Theorem 5.3]), we immediately obtain

lnM1 ď nCpαS1q `

c

2n ln1

1´ ε` ln

1

1´ ε; (4.132)

lnM2 ď nC

ˆ

p1´ αqS2

αS2 ` 1

˙

`

c

2n ln1

1´ ε` ln

1

1´ ε, (4.133)

for some α P r0, 1s, where Cptq :“ 12

lnp1 ` tq. An alternative proof of the strong

converse via information spectrum, which does not give simple explicit bounds on the

sublinear terms, is given in [152]. Converses under the average error criterion can be

obtained by codebook expurgation (e.g. [18, Problem 8.11]).

Next, let us prove an Op?nq second-order counterpart for the discrete broadcast

channel. Consider a degraded broadcast channel pPY |X , PZ|Xq with |Y |, |Z| ă 8.

Suppose that there exists an pn,M1,M2, εq code under the maximal error criterion. We

can generalize Theorem 4.2.2 to allow stochastic encoders as in Theorem 4.2.3 which,

combined with the steps in the proofs of the weak converse of degraded broadcast

channels (e.g. [18, Theorem 5.2]) show that there exist PUX such that

lnM1 ď nIpX;Y |Uq ` 2

c

|Y |n ln1

1´ ε` ln

1

1´ ε; (4.134)

lnM2 ď nIpU ;Zq ` 2

c

|Z|n ln1

1´ ε` ln

1

1´ ε. (4.135)

This improves the best Op?n ln

32 nq second term obtained by the blowing-up ar-

gument ([27], [48, Theorem 3.6.4]). Again, with our approach the finite alphabet

assumption may be weakened to (4.48). Converse bounds under the average error

criterion can be obtained by an expurgation argument (e.g. [18, Problem 8.11]).

125

4.4.3 Source Coding with Compressed Side Information

DecoderEncoder 2

Encoder 1

Y

X

Y

W2

W1

Figure 4.4: Source coding with compressed side information

As alluded in Chapter 1, source coding with compressed side information [49][27]

is a canonical example of the side-information problem, whose second-order converse

was open [50]. See Figure 4.4 for the setup. Using Theorem 4.3.1 plus essentially the

same arguments as [27, Theorem 3], we can show the following second-order converse:

Theorem 4.4.1 Consider a stationary memoryless discrete source with per-letter

distribution QXY . Let ε P p0, 1q and n ě 3βX ln 2p1`εq|X |1´ε

, where βX is defined in

(4.86). Suppose that there exist encoders f : X n ÑW1 and g : Yn ÑW2 and decoder

V : W1 ˆW2 Ñ Yn such that

PrY n‰ Y n

s ď ε. (4.136)

Then for any c P p0,8q,

ln |W1| ` c ln |W2|

ě n infU : U´X´Y

tcHpY |Uq ´ IpU ;Xqu

´?n

˜

lnp|Y |cβc`1X q

c

3βX ln4|X |1´ ε

` 2c

c

|Y | ln 2

1´ ε

¸

´ p1` cq ln2

1´ ε. (4.137)

126

Proof The first part of the proof parallels the proof of [27, Theorem 3]. For any

w PW1, define the “correctly decodable set”

Bw :“ tyn : yn “ V pw, gpynqqu . (4.138)

Then

ErQY n|XnrBfpXnq|Xnss ě 1´ ε, (4.139)

where Xn „ QbnX . Choose any ε1 P pε, 1q. By the Markov inequality,

QbnX rxn : QY n|Xn“xnrBfpxnqs ě 1´ ε1s ě 1´

ε

ε1. (4.140)

Take ν to be the equiprobable measure on Y . Applying Theorem 4.3.1 with δ “ ε1´ε2ε1

and η “ 1´ ε1, for n ą 3βX ln 2|X |1´εε1

, we find µn such that

µnrxn : QY n|Xn“xnrBfpxnqs ě 1´ ε1s ě 1´

ε

ε1´ δ (4.141)

“ε1 ´ ε

2ε1(4.142)

and

lnµnrxn : QY n|Xn“xnrBfpxnqs ą as ´ c ln νbnrBfpxnqs

ď nd‹pQX , QY |X , ν, cq ` A?n` c ln

1

η. (4.143)

Since f takes at most |W1| possible values, from (4.142), there exists a w˚ such that

µnrxn : QY n|Xn“xnrBw˚s ě 1´ ε1s ě

ε1 ´ ε

2ε1|W1|

´1. (4.144)

127

Moreover, since Bw˚ ĎŤ

wPW2tyn : yn “ V pw˚, wqu, by the union bound,

νbnrBw˚s “ |Y |ń|Bw˚ | ď |W2||Y |ń. (4.145)

Comparing (4.143), (4.144) and (4.145), we obtain

ln |W1| ` c ln |W2|

ě ń supPUX : PX“QX

tcDpPY |Uν|PUq ´ IpU ;Xqu ` nc ln |Y | ´ A?n´ c ln

1

η` ln

1´ εε1

2

(4.146)

ě n infU : U´XÝ

tcHpY |Uq ´ IpU ;Xqu ´ A?n´ c ln

1

1´ ε1´ ln

2

1´ εε1(4.147)

(4.137) then follows by taking ε1 “ 1`ε2

.

Note that the first term on the right side of (4.137) corresponds to the rate region

(see e.g. [18, Theorem 10.2]). Using the BUL method [27], on the other hand, one

can only bound the second term as Op?n ln

32 nq, which is suboptimal.

A counterpart for Gaussian sources under the quadratic distortion is the following

(which is a special case of the Berger-Tung problem), for which we also derive the

Op?nq second-order converse. The setup is still as in Figure 4.4, except that we seek

to approximate Y by Y in quadratic error, instead of seeking exact reconstruction.

Theorem 4.4.2 Consider stationary memoryless Gaussian sources where per-letter

distribution QXY is jointly Gaussian with covariance matrix

¨

˚

˝

1 ρ

ρ 1

˛

‹

‚

. Suppose ε P

p0, 1q and n ě 24 ln 4p1`εq1´ε

, and that there exist encoders f : X n Ñ W1 and g : Yn Ñ

W2 and decoder V : W1ˆW2 Ñ Yn (X “ Y “ Y “ R) such that for some D P p0,8q,

PrY n´ Y n

2ą nDs ď ε. (4.148)

128

Then for any c P p0,8q,

ln |W1| ` c ln |W2| ě n infρPr0,1s

"

c

2ln

1´ ρ2ρ2

D`

1

2ln

1

1´ ρ2

*

´ c

c

2n ln2

1´ ε´

c

6n ln4p1` εq

1´ ε

´ c ln2

1´ ε´ ln

2p1` εq

1´ ε. (4.149)

Proof The first part of the proof parallels the proof of [27, Theorem 3], with adap-

tations to include the distortion function. For any w P W1, define the “correctly

decodable set”

Bw :“

yn : yn ´ V pw, gpynqq2 ď nD(

. (4.150)

Then

ErQY n|XnrBfpXnq|Xnss ě 1´ ε, (4.151)

where Xn „ QbnX . Choose any ε1 P pε, 1q. By the Markov inequality,

QbnX rxn : QY n|Xn“xnrBfpxnqs ě 1´ ε1s ě 1´

ε

ε1. (4.152)

Applying Theorem 4.3.2 with δ “ ε1´ε2ε1

and ε1 “ 1´ η, for n ě 24 ln 41´εε1

, we find µn

such that

µnrxn : QY n|Xn“xnrBfpxnqs ě 1´ ε1s ě 1´

ε

ε1´ η “

ε1 ´ ε

2ε1(4.153)

129

and

lnµnrxn : QY n|Xn“xnrBfpxnqs ą 1´ ε1s ´ c ln νbnrBfpxnqs

ď nd‹pQX , QY |X , ν, cq ` c

c

2n ln1

1´ ε1`

c

6n ln4ε1

ε1 ´ ε` c ln

1

1´ ε1. (4.154)

Since f takes at most |W1| possible values, from (4.153), there exists a w˚ such that

µnrxn : QY n|Xn“xnrBw˚s ě 1´ ε1s ě

ε1 ´ ε

2ε1|W1|

´1. (4.155)

Moreover, since Bw˚ ĎŤ

wPW2tyn : yn ´ V pw˚, wq2 ď nDu, by the union bound and

the volume estimate of the n-dimensional ball,

νbnrBw˚s ď |W2|p2πeDqn2 . (4.156)

Comparing (4.154), (4.155) and (4.156), we obtain

ln |W1| ` c ln |W2| ě n infρPr0,1s

"

c

2ln

1´ ρ2ρ2

D`

1

2ln

1

1´ ρ2

*

´ c

c

2n ln1

1´ ε1´

c

6n ln4ε1

ε1 ´ ε

´ c ln1

1´ ε1´ ln

2ε1

ε1 ´ ε. (4.157)

(4.149) then follows by taking ε1 “ 1`ε2

.

Remark: The weak converse version of (4.149) is the following result, which

is a special case of [18, Theorem 12.3]. The first term on the right side of (4.149)

corresponds to the known (first-order) single-letter region which can be obtained by

taking D2 Ñ 8 in [18, Theorem 12.3].

Theorem 4.4.3 (see [18, Theorem 12.3] and the references therein) Consider

lossy source coding with compressed side information where the per-letter distribution

130

QXY is jointly Gaussian with covariance matrix

¨

˚

˝

1 ρ

ρ 1

˛

‹

‚

and the distortion function

is quadratic. Then pR1, R2, Dq P r0,8q2 is achievable if and only if

R2 ě1

2ln

1´ ρ2 ` ρ2 expp´2R1q

D. (4.158)

4.4.4 One-communicator CR Generation

We are now ready (finally!) to give a Op?nq second-order converse to the one-

communicator CR generation problem in Section 2.3.1. Armed with the optimal

second-order change-of-measure lemma of the present chapter, we now only need the

following result relating the change-of-measure lemma to the performance of the CR

generation. For simplicity m “ 1 is assumed, while the method should carry over to

the case of arbitrarily many receivers.

Lemma 4.4.4 Consider the single-shot setup as depicted in Figure 1.1 with m “ 1.

Suppose that there exist δ1, δ2 P p0, 1q, a stochastic encoder QW |X , and deterministic

decoders QK|X and QK|WY , such that

PrK “ K1s ě 1´ δ1; (4.159)

1

2|QK ´ Tk| ď δ2, (4.160)

where TK denotes the equiprobable distribution. Also, suppose that there exist µX ,

δ, ε1 P p0, 1q and c P p1,8q, d P p0,8q such that

E1pQXµXq ď δ; (4.161)

µX`

x : QY |X“xpAq ě 1´ ε1˘

ď 2c exppdqQcY pAq (4.162)

131

for any A Ď Y. Then, for any δ3, δ4 P p0, 1q such that δ3δ4 “ δ1 ` δ, we have

δ2 ě 1´ δ ´ δ3 ´1

|K| ´2 exp

`

dc

˘

|W |pε1 ´ δ4q

1c |K|1´ 1

c

. (4.163)

Proof As we assumed m “ 1, let us write K :“ K1. Define the joint measure

µXYWKK :“ µXQY |XQW |XQK|XQK|YW (4.164)

which we sometimes abbreviate as µ. Since E1pQµq “ E1pQXµXq ď δ, (4.159)

implies

µpK ‰ Kq ď δ1 ` δ. (4.165)

Put

J :“ tk : µK|Kpk|kq ě 1´ δ4u. (4.166)

The Markov inequality implies that µKpJ q ě 1´ δ3. Now for each k P J , we have

p1´ δ4qµkpkq

ď µKKpk, kq (4.167)

ď

ż

FkQY |X“x

˜

ď

w

Awk¸

dµXpxq (4.168)

ď p1´ ε1qµKpkq ` µ

˜

x : QY |X“x

˜

ď

w

Akw ě 1´ ε1

¸¸

(4.169)

ď p1´ ε1qµKpkq ` 2c exppdqQcY

˜

ď

w

Akw¸

, (4.170)

132

where Fk Ď X is the decoding set for K, and Akw is the decoding set for K upon

receiving w. Rearranging,

pε1 ´ δ4q1cµ

1ck pkq ď 2 exp

ˆ

d

c

˙

QY

˜

ď

w

Akw¸

. (4.171)

Now let µ be the restriction of µK on J . Then summing both sides of (4.171) over

k P J , applying the union bound, and noting that tAkwuk is a partition of Y for each

w, we obtain

D 1cpµT q ď log |K| ´ 1

1´ 1c

log2|W |

pε1 ´ δ4q1c

´d

c´ 1. (4.172)

The proof is completed invoking Proposition 4.4.5 below and noting that

E1pQKµq ď E1pQKµq ` E1pµµq ď δ ` δ3. (4.173)

Proposition 4.4.5 Suppose T is equiprobable on t1, . . . ,Mu and µ is a nonnegative

measure on the same alphabet. For any α P p0, 1q,

E1pT µq ě 1´1

M´ expp´p1´ αqDαpµT qq. (4.174)

The special case of Proposition 4.4.5 when µ is a probability measure was proved in

the end of the proof of Theorem 2.3.2. The extension to unnormalized µ can be easily

proved in a similar way.

Now specializing the single-shot result of Lemma 4.4.4 to the discrete memoryless

case, and using Theorem 4.3.1, we obtain the following non-asymptotic second-order

133

converse to the CR generation problem. Note that the right side of (4.181) is essen-

tially

exp`

nd‹

c` A

c

?n˘

|W ||K|1´ 1

c

(4.175)

where the?n term characterizes the correct second-order coding rate.

Theorem 4.4.6 Consider a stationary memoryless source with per-letter distribution

QXY . Let

βX :“1

minxQXpxq P r1,8q; (4.176)

α :“ supx

›

›

›

›

dQY |X“x

dQY

›

›

›

›

8

P r1,8q; (4.177)

c P p1,8q. (4.178)

Suppose that there exist a CR generation scheme (see the setup in Figure 1.1) for

which

PrK “ K1s ě 1´ δ1 P p0, 1q; (4.179)

1

2|QK ´ Tk| ď δ2 P p0, 1q. (4.180)

Then for any δ11 P pδ1, 1q and n ě 3βX ln 3|X |δ11´δ1

, we have

1´ δ2 ´ δ11 ď

1

|K| `exp

´

nd‹

c` A

c

?n` ln

4δ11`2δ1δ11´δ1

¯

|W |´

δ11´δ14δ11`2δ1

¯1c|K|1´ 1

c

, (4.181)

134

where we have defined

A :“ lnpαcβc`1X q

d

3βX ln3|X |δ11 ´ δ1

` 2c

d

pα ´ 1q ln4δ11 ` 2δ1

δ11 ´ δ1

, (4.182)

d‹ :“ d‹pQX , QY |X , QY , cq. (4.183)

Proof Specialize Lemma 4.4.4 to the stationary memoryless setting:

QXY Ð QbnXY . (4.184)

Suppose we first pick the parameter δ3 P pδ1, 1q in Lemma 4.4.4 arbitrarily, which in

turn yields the following choices of the other parameters therein:

δ “δ3 ´ δ1

2; (4.185)

δ4 “δ1 ` δ3

2δ3

. (4.186)

Moveover, let ε1 “ 1`δ42

in Lemma 4.4.4. Then we invoke Theorem 4.3.1, with ν “ QY

and

η “ 1´ ε1. (4.187)

Then (4.163) implies that

1´ δ2 ´3δ3 ´ δ1

2ď

1

|K| `exp

´

n d‹

c` A

c

?n` ln 4δ3

δ3´δ1

¯

|W |´

δ3´δ14δ3

¯1c|K|1´ 1

c

(4.188)

which is equivalent to (4.181) upon setting δ11 “3δ3´δ1

2.

135

4.5 About Extensions to Processes with Weak

Correlation

The majority of this chapter focuses on the stationary memoryless settings, and in

particular, we succeeded in using the pumping-up approach to determine that the

second-order rate is the square root of the blocklength in such settings for many prob-

lems from the network information theory. However, the potential of the pumping-up

approach is not limited to the stationary memoryless setting.

Recall that the pumping-up approach consists of two ingredients: one is reverse

hypercontractivity (4.24) and the other is to show that the integral of the function

does not increase too much after applying the operator (4.25). Regarding the first in-

gredient, we note that reverse hypercontractivity follows from a modified log-Sobolev

inequality (also known as the 1-log Sobolev inequality) [77]. There have already been

results on a modified log-Sobolev inequality for “weakly correlated processes” (e.g.

[153, Theorem 2.1]). The second ingredient can also be extended to certain weakly

correlated processes. For example, consider the steps (4.35)-(4.38) where Pbn is re-

placed with some PY n which is not necessarily stationary memoryless. Those steps

continue to hold under the assumption that

α :“

›

›

›

›

dPYi|Yi“yidQ

›

›

›

›

8

ă 8, @i “ 1, . . . , n, yi P Yn´1. (4.189)

136

Chapter 5

More Dualities

In Chapters 2-4 we have introduced the three ingredients of the functional approach

to information-theoretic converse: duality, smoothing, and reverse hypercontractivity,

and we have focused on the example of the Brascamp-Lieb divergence to illustrate

these ideas. The present chapter is a case study of two other information measures,

their dual representations, and the applications to the achievability and the converse

to coding theorems.

5.1 Air on g‹

The section title only bears a superficial connection to Bach’s “Air on the G String”.

We will revisit the image-size characterization, a technique in the classical book [8,

Chap 16] which is built on the blowing-up lemma and succeeded in showing the strong

converse property of all source-channel networks with known first-order rate region in

[8]. The problem considered in Section 4.3 is essentially a basic example of the image-

size problem consisting only of a “forward channel”. The purpose of this section is

to further demonstrate that the pumping-up argument is capable of substituting the

blowing-up argument whenever the latter is used for the image-size analysis, to yield

the optimal second-order bounds for the related coding theorems. To achieve this goal,

we must show how the pumping-up approach applies to a more involved image size

137

problem containing a “reverse channel”. In particular, we introduce an information-

theoretic quantity, denoted by g‹, to replace the g-function for the image size of a

set used in [8] in some intermediate steps. Informally speaking, [8] upper bounds the

image-size gpCq in terms of gpBq, where B and C are the images of a set A under two

channels as shown in Figure 5.1. In contrast, we upper bound g‹pCq in terms of gpBq.

However the end result needed for the converse proofs of the operational problems

in [8] is an upper-bound on |A| in terms of gpBq when A is a good codebook for

the Z-channel. We prove Theorem 5.1.9, a second-order sharpening of such an end

result, which can be readily plugged into the converse proofs of the various operational

problems in [8] to obtain Op?nq second-order rates.

In addition to getting the optimal second-order bounds, we find that the new

approach also has the advantage of simplifying certain technical aspects in dealing

with the reverse channel; for example we introduce a simpler definition of the image

of a set so that the “maximal code lemma” used in [8, Chap 16] is no longer needed

in the proof of certain results. Once again, this is made possible by the fact that we

are looking at functions (not necessarily the indicator functions of sets) instead of

restricting our attention to sets.

We start by reviewing some relevant concepts from [8] in Section 5.1.1. Then we

introduce g‹ in Section 5.1.2. In Section 5.1.3 we show that in the discrete memoryless

case, g and g‹ differ by a sub-exponential factor.

5.1.1 Review of the Image-size Problem

Definition 5.1.1 ([8, Definition 6.2]) Consider QZ|X , a nonnegative measure νZ,

and η P p0, 1q. An η-image of A Ď X is any set B Ď Z for which QZ|XrBsˇ

ˇ

A ě η. 1

1QZ|X rBsˇ

ˇ

A denotes the restriction of the function QZ|Xp1Bq on A.

138

X

A

Y

B

Z

C

Figure 5.1: The tradeoff between the sizes of the two images.

Define the η-image size of A

gZ|XpA, νZ , ηq :“ infB : η-image of A

νZrBs. (5.1)

Fundamental properties of g are discussed in [8], which we informally summarize

below (for precise statements see [8, Lemma 15.2]).

Proposition 5.1.1 ([8, Lemma 15.2]) Consider QbnZ|X , counting measure νbnZ , and

any η P p0, 1q. For any A Ď X ,

1

nlog gZn|XnpA, νbnZ , ηq ě

1

nHpZn

q ´ op1q, (5.2)

where Zn is the channel output of Xn which is equiprobably distributed on A.

Proposition 5.1.2 ([8, Lemma 15.2]) In the setting of Proposition 5.1.1, if, addi-

tionally, A is the codebook of a certain pn, εq-fixed composition code under the average

139

error criterion, then

1

nlog gZn|XnpA, νbnZ , ηq ď

1

nHpZn

q ` ε log |X | ` op1q. (5.3)

Proposition 5.1.3 ([8, Theorem 15.10]) Consider QX , QY |X and QZ|X where

|X | ă 8. Let η P p0, 1q and let νY and νZ be the counting measures. Then for any

A Ď X n,

1

nlog gZn|XnpAX Cn, νbnZ , ηq ´

1

nlog gY n|XnpAX Cn, νbnY , ηq

ď supPUX : PX“QX

tHpZ|Uq ´HpY |Uqu ` op1q (5.4)

where Cn :“ txn : | pPxn ´QX | P ωpn´ 1

2 q X op1qu denotes the typical set.

Remark 5.1.1 In the proof of Proposition 5.1.3, [8] used a technical result called the

maximal code lemma [8, Lemma 6.3]. Roughly speaking, the proof therein proceeded

by connecting g with the entropy of the output for the equiprobable distribution on the

input set (Proposition 5.1.1-5.1.2). Since the bound in Proposition 5.1.1 is generally

not tight when A is not a good codebook, [8] used the maximal code lemma to find

a good codebook A Ď A such that gZn|XnpA, νbnZ , ηq « gZn|XnpA, νbnZ , ηq, and then

finished the proof by focusing on A instead. On the other hand, our approach below

using convex duality sidesteps the maximal code lemma.

Since g is defined as the minimum size of an image, (5.4) of course continues to hold

when gY n|XnpA X Cn, νbnY , ηq is replaced by the size of any image of A X Cn. (For

example, in the proof of the strong converse of asymmetric broadcast channel with

a common message [8], A is taken as the set of inputs for a fixed common message,

then its decoding set will be an image of A X Cn.) But (5.4) no longer holds when

gZn|XnpA X Cn, νbnZ , ηq is replaced by the size of an arbitrary image, therefore when

(5.4) is used in the proof of coding theorems, we never substitute a decoding set for

140

gZn|XnpA X Cn, νbnZ , ηq. Instead, one uses the following property for the image size

when A is a good codebook of P -typical sequences:

gZn|XnpA, νbnZ , ηq ě |A|e´nDpQZ|XνZ |P q`opnq; (5.5)

see [8, Lemma 6.4] for the precise statements, and see Proposition 5.1.6 below for a

refinement. Using (5.5) and Proposition 5.1.3 we have the following.

Proposition 5.1.4 In the setting of Proposition 5.1.3, let Cn be the QX-typical set

and let A be an pn, εq-maximal error codebook for QbnZ|X . Then

1

nlog |AX Cn| ´

1

nlog gY n|XnpAX Cn, νbnY , ηq

ď supPUX : PX“QX

tIpZ;X|Uq ´HpY |Uqu ` op1q (5.6)

Proposition 5.1.4 is the key argument used in [8] for the strong converse proof of

coding theorems including2

• Asymmetric broadcast channel with a common message [8].

• Zigzag network [8].

• Gelfand-Pinsker channel coding [154].

In the remainder of this chapter, we use the pumping-up approach to sharpen

Proposition 5.1.4 (see Theorem 5.1.9 below), which will imply converses of the above

operational problems with an Op?nq second-order term.

5.1.2 g‹ and the Optimal Second-order Image Size Lemmas

We now introduce a variant of image size, which is technically more convenient

(e.g. avoids the use of the maximal code lemma [8, Lemma 6.3] when we charac-

2In the Gelfand-Pinsker case, [154] defined a deterministic channel QZn|Xn , for which any inputset is obviously a good codebook.

141

terize the exponents of the image sizes), but can be applied to the converse proof of

operational problems in the same way. By the end of this subsection, we present the

main result Theorem 5.1.9, which is a second order refinement of Proposition 5.1.4,

allowing one to show an Op?nq second-order converse for those network information

theory problems listed in the end of the previous subsection.

Definition 5.1.2 Consider QZ|X and nonnegative measure νZ. Define

g‹Z|XpA, νZq :“ infgPHr0,8qpZq : QZ|Xpln gq|Aě 0

νZpgq. (5.7)

In our approach to converse proofs, we will use g‹ instead of g since the former

is mathematically more convenient (e.g. it satisfies the exact tensorization property).

However, in Section 5.1.3, we show that g and g‹ differ by a factor of eopnq for discrete

memoryless channels.

The following exact formula of g˚ in terms of the relative entropy can be viewed

as an improvement of Proposition 5.1.1-5.1.2.

Proposition 5.1.5 Fix QZ|X , νZ, and let C be an arbitrary subset of X . Then

g‹Z|XpA, νZq “ exp

ˆ

´ infPX : supppPXqĎC

DpPZνZq

˙

(5.8)

where PX is supported on C, and PX Ñ QZ|X Ñ PZ.

While it is possible to provide a direct proof of Proposition 5.1.5, we shall derive it

as a special case of Theorem 5.1.7 below. We now lower bound g‹ in terms of A in

the case of a good codebook:

Proposition 5.1.6 Consider |Z| ă 8, QZ|X , νZ. Let A Ď X n be an pn, εq-codebook

for QbnZ|X under the geometric criterion (i.e. the codebook satisfies the assumption in

142

Theorem 4.2.2). Then

ln g‹Zn|XnpA, νbnZ q ě ln |A| ´DpQbnZ|XνbnZ |PXnq

´ 2

c

pα ´ 1qn ln1

1´ ε´ ln

1

1´ ε(5.9)

where PXn is the equiprobable distribution on A.

Proof

ln g‹Zn|XnpA, νbnZ q ě ´DpPZnνbnZ q (5.10)

“ ´DpQbnZ|XνbnZ |PXnq ` IpX

n; Znq (5.11)

ě ln |A| ´DpQbnZ|XνbnZ |PXnq

´ 2

c

pα ´ 1qn ln1

1´ ε´ ln

1

1´ ε(5.12)

where (5.10) used Proposition 5.1.5; and (5.12) applied the sharp Fano inequality

(Theorem 4.2.2).

The following key observation, which follows from convex duality, establishes the

equivalence of a functional inequality and an entropic inequality. Note that the Z “ X

(i.e. QZ|X is the identity) special case has been considered in the previous section.

Theorem 5.1.7 Consider finite alphabets X , Y, Z, random transformations QY |X ,

QZ|X , nonnegative measures νZ fully supported on Z and νY fully supported on Y.

For d P R, the following statements are equivalent:

infg : QZ|Xpln gqěQY |Xpln fq

νZpgq ď edνY pfq, @f ě 0; (5.13)

DpPZνZq ` d ě DpPY νY q, @PX , (5.14)

143

where PX Ñ QY |X Ñ PY and PX Ñ QZ|X Ñ PZ.

Proof (5.13)ñ(5.14). For any PX and ε P p0,8q,

DpPY νY q “ PXPY |Xpln fq ´ ln νY pfq (5.15)

ď PXPZ|Xpln gq ´ ln νY pgq ` ε` d (5.16)

“ PZpln gq ´ ln νY pgq ` ε` d (5.17)

ď DpPZνY q ` ε` d (5.18)

where

• In (5.15) we have chosen f :“ dPYdνY

.

• By (5.13) we can choose g such that (5.16) holds.

• (5.18) uses the variational formula of the relative entropy.

(5.14)ñ(5.13). For the simplicity of presentation we shall assume, without loss of

generality, that νZ is a probability measure. We first note that the infimum in (5.13)

is achieved. Let a :“ maxypQY |X ln fqpyq P r´8,8q. Then by considering g :“ ea we

see that


νZpgq ď ea. (5.19)

But if gpz1q ą ea maxz ν´1Z pzq for some z1 P Z then νZpgq ą ea. This means that


νZpgq “ infgPC

νZpgq (5.20)

144

where C “ C1 X C2 and

C1 :“ tg P H`pZq : QZ|Xpln gq ě QY |Xpln fqu; (5.21)

C2 :“ Hr0, ea maxz ν´1Z pzqspZq. (5.22)

Note that the closed set C1 is the domain of the infimum on the left side of (5.20), and

C2 is compact. Therefore C is compact, and the infimum on the right side of (5.20) is

achieved at some g‹ P C. The infimum on the left side of (5.20) is thus also achieved

at g‹.

Next, define a probability measure PZ by dPZdνZ

“ 1νZpg‹q

g‹, and we will argue that

that there exists some PZ such that PX Ñ QY |X Ñ PZ . Indeed, apply Lagrange

multiplier method to the minimization in (5.13),

mingtνZpgq ´ λXrQZ|Xpln gq ´QY |Xpln fqsu (5.23)

where λX is a nonnegative measure on X . Then there must be some λX for which

g‹ is a stationary point for the functional in (5.23). Then the claim follows since the

desired PX is the λX normalized to a probability measure. From the complementary

slackness condition, we also see that

QZ|X ln g “ QY |X ln f, PX ´ a.s. (5.24)

With those preparations the rest of the proof follows immediately:

ln νZpg‹q “ PXQZ|Xpln gq ´DpPZνZq (5.25)

“ PXQY |Xpln fq ´DpPY q ` d (5.26)

ď ln νY pfq ` d (5.27)

145

where (5.26) uses (5.24) and (5.14); and (5.27) uses the variational formula of the

relative entropy.

Proof of Proposition 5.1.5 We first note that it is without loss of generality to

consider the case of X “ C. The claim then follows from Theorem 5.1.7 by taking Y

to be a singleton.

Applying the now familiar reverse hypercontractivity argument to Theorem 5.1.7,

we obtain the following result, which can be viewed as a second order sharpening of

Proposition 5.1.3.

Theorem 5.1.8 For (QZ|X , QY |X , νZ, νY , C), where C Ď X , define

dpQZ|X , QY |X , νZ , νY , Cq :“ supPX : supppPXqĎC

tDpPY νY q ´DpPZνZqu, (5.28)

where PX Ñ QY |X Ñ PY and PX Ñ QZ|X Ñ PZ. Let αY :“ supx

›

›

›

dQY |X“xdνY

›

›

›

8. In the

stationary memoryless case, for any η P p0, 1q and Cn Ď X n, we have

ln g‹Zn|XnpCn X tQbnY |Xpfq ě ηu, νbnZ q

ď ln νbnZ pfq ` dpQbnZ|X , QbnY |X , ν

bnZ , νbnY , Cnq

` 2

c

pαY ´ 1qn ln1

η` ln

1

η. (5.29)

146

where we defined

A :“ txn : QY n|Xn“xnpfq ě ηu. (5.38)

This proves (5.29).

The proof of (5.30) follows by applying the single-letterization argument to (5.29).

More specifically, for any PXn supported on

Cn :“ txn : pPxn ď p1` εnqQXu, (5.39)

we have

DpPY nνbnY q ´DpPZnν

bnZ q

“

nÿ

i“1

rDpPY iZni`1νbiY b ν

bpn´iqZ q ´DpPY i´1Zni

νbpi´1qY b ν

bpn´i`1qZ qs (5.40)

“

nÿ

i“1

rDpPY iZni`1νY b PY i´1Zni`1

q ´DpPY i´1ZniPY i´1Zni`1

b νZqs (5.41)

“ nrDpPYI |Y I´1ZnI`1νY |PIY I´1ZnI`1

q ´DpPZI |Y I´1ZnI`1νZ |PIY I´1ZnI`1

qs (5.42)

ď n supPUX : PXďp1`εnqQX

tDpPY |UνY |PUq ´DpPZ|UνZ |PUqu (5.43)

“ n supPUX : PXďQX


` nεn ln βXαY αZ (5.44)

where we defined I to be an equiprobable random variable on t1, . . . , nu independent

of XnY nZn, and

• (5.42) follows from the definition of the conditional relative entropy.

• To see (5.43) we note that U ´ XI ´ YIZI forms a Markov chain if U Ð

IY I´1ZnI`1, and that PXI ď p1` εnqQX .

148

• The proof of (5.44) follows from some simple estimates of the relative entropy.

Abbreviate εn as ε. Consider PUX where U “ U Y t‹u, X “ X , defined as the

follows

PU :“1

1` εPU `

ε

1` εδ‹; (5.45)

PX|U“u :“ PX|U“u, @u P U ; (5.46)

PX|U“‹ :“1` ε

ε

ˆ

QX ´1

1` εPX

˙

(5.47)

where δ‹ is the one-point distribution on ‹. Then observe that PX “ QX , and

DpPZ|UνZ |PUq ´DpPZ|UνZ |PUq

“ D

ˆ

PZ|U

›

›

›

›

1

|νZ |νZ

ˇ

ˇ

ˇ

ˇ

PU

˙

´D

ˆ

PZ|U

›

›

›

›

1

|νZ |νZ

ˇ

ˇ

ˇ

ˇ

PU

˙

(5.48)

ě ´εD

ˆ

1` ε

εQZ ´

1

εPZ

›

›

›

›

1

|νZ |νZ

˙

(5.49)

ě ´ε lnαZ . (5.50)

An upper bound on DpPY |UνY |PUq can also be obtained, the steps for which

we omit here.

The proof of (5.30) is completed by choosing εn “b

3βXn

ln |X |δ

, and using

Pr pPXn ď p1` εnqQXs ě 1´ δ. (5.51)

.

Finally, applying Proposition 5.1.6 to (5.30), we obtain a second-order sharpening of

Proposition 5.1.4:

Theorem 5.1.9 Fix QX , QY |X , QZ|X , η, δ, ε P p0, 1q. Then for n ą 3βX ln 2|X |δ

,

there exists Cn Ď X n with QbnX rCns ě 1 ´ δ such that for any A Ď Cn which is an

149

by Hoeffding’s inequality we have

QbnX rDns ě 1´ δ. (5.58)

If Cn is as in Theorem 5.1.8 for which (5.30) holds, then for any A P Cn X Dn, by

(5.56) and (5.30) we have

ln |A| ď ln gY n|XnpA, νbnZ , ηq

` n supPXU : PX“QX


`

c

3nβX ln|X |δ

lnαY αZβX

` 2

c

pαY ´ 1qn ln1

η` lnαZ

c

n

2ln

1

δ

` 2

c

pαZ ´ 1qn ln1

1´ ε` ln

1

1´ ε` ln

1

η. (5.59)

5.1.3 Relationship between g and g‹

Proposition 5.1.10 Consider QZ|X , the counting measure νZ on the finite Z, η P

p0, 1q, λ P p0, 1 ´ ηq, and A Ď X n. We claim that in the definition of gZ|X , the

assumption g P Ht0,1u can be changed to g P Hr0,1s without essential difference. More

precisely,

infgPHr0,1spZq : QZ|Xpgq|Aěη

νZpgq ď gZ|XpA, νZ , ηq (5.60)

gZ|XpA, νZ , ηq ď λ´1 infgPHr0,1spZq : QZ|Xpgq|Aěη`λ

νZpgq. (5.61)

151

The proof of (5.62) follows from the fact that for any g P Hr0,1spZq and x,

QZ|X“xpηgq ě ηeQZ|X“xpln gq (5.69)

and obviously ηg P Hr0,1spZq.

To see (5.63), given g P Hr0,1spZnq such that QbnZ|Xpgqˇ

ˇ

ˇ

Aě η, let g :“

´

1η

¯1` 1t

ΛνZα,tg

(see (4.44)), with t :“b

1pα´1qn

ln 1η. Then by the same argument as in the proof of

Theorem 4.2.2,

QZ|X pln gqˇ

ˇ

A ě 0, (5.70)

νbnZ pgq ď1

ηe

2b

pα´1qn ln 1η νbnZ pgq. (5.71)

Finally, to see (5.64), consider

gZn|XnpA, νbnZ , ηq

ďÿ

P : n-type

gZn|XnpAX T nP , νbnZ , ηq (5.72)

ď polypnq ¨maxP

gZn|XnpAX T nP , νbnZ , ηq (5.73)

ď polypnq ¨M ¨ enHpQZ|X |P q`npε´ηq (5.74)

ď polypnq ¨ eHpZnXnq`npε´ηq (5.75)

ď polypnq ¨ eHpZnq`Op

?nq`npε´ηq (5.76)

where

• In (5.74) we used [8, Lemma 6.3] to find a ε-maximal error code where the

codebook A Ď AX T nP is of size M and ε P pη, 1q is arbitrary. We have chosen

P to be one that achieves the maximum in (5.73).

153

• In (5.75) we set Xn to be equiprobable on A and let Zn be the corresponding

output.

• (5.76) used the sharp Fano’s inequality (Theorem 4.2.2).

The proof is completed noting that ε can be arbitrarily close to η and using

eHpZnqď sup

PXn : supppPXn qĎAeHpPZn q (5.77)

“ g‹Zn|XnpA, νbnZ q (5.78)

ď g‹Zn|XnpA, νbnZ q. (5.79)

5.2 Variations on Eγ

Fundamental to the information spectrum approach [44][43] to information theory

is the quantity PrıP QpXq ą λs, where X „ P and λ P R. The knowledge of

PrıP QpXq ą λs for all parameters λ is equivalent to the knowledge of the binary

hypothesis region for the probability measures P and Q (see e.g. [43]). In this sec-

tion, we study another information theoretic quantity, the knowledge of which (for

all parameters) is also equivalent to the knowledge of the binary hypothesis testing

region. We show how its variational formula (i.e. the functional representation of

this quantity) is used in the achievability and converse proofs of some operational

problems.

We define Eγ and discuss its mathematical properties in Section 5.2.1. Sec-

tion 5.2.2 discusses the relationship between Eγ and the smooth Renyi divergence,

which suggests the potential of Eγ in non-asymptotic information theory and the large

deviation analysis. To illustrate such a potential, in Section 5.2.3 we generalize the

channel resolvability problem [155] by replacing the total variation metric with the

154

Eγ metric, and show how results on Eγ-resolvability can be applied in source coding,

the mutual covering lemma [156] and the wiretap channels in Sections 5.2.4-5.2.6.

5.2.1 Eγ: Definition and Dual Formulation

Definition 5.2.1 Given probability distributions P ! Q on the same alphabet and

γ ě 1, define3

EγpP Qq :“ PrıP QpXq ą log γs ´ γ PrıP QpY q ą log γs (5.80)

where X „ P and Y „ Q.

Then we see that Eγ is an f -divergence [158] with

fpxq “ px´ γq`. (5.81)

We now find a variational formula for Eγ via convex duality. Recall that the Legendre-

Fenchel dual of a convex functional φp¨q on the set of measurable functions is defined

by

φ‹pπq :“ supf

"ż

f dπ ´ φpfq

*

(5.82)

for any finite measure π, and under mild regularity conditions (see e.g. [105, Lemma

4.5.8]), we have

φpfq :“ supπ

"ż

f d π ´ φ‹pπq

*

(5.83)

3We found the Eγ notation in an email communication with Yury Polyanskiy [157]. However, weare uncertain about the genesis of such a notation.

155

Now, definition (5.80) can be extended to the set of all finite (not necessarily non-

negative and normalized) measures by

EγpπQq “

żˆ

dπ

dQ´ γ

˙`

dQ (5.84)

which is convex in π. Then for any bounded function f , it is easy to verify that

supπ

"ż

fdπ ´ EγpπQq

*

“ supπ

#

ż

«

dπ

dQf ´

ˆ

d π

dQ´ γ

˙`ff

dQ

+

(5.85)

“

$

’

&

’

%

γş


`8, otherwse.(5.86)

We thus obtain

Theorem 5.2.1 For a given γ P r1,8q and a probability measure Q, Eγp¨Qq is the

Legendre-Fenchel dual of the functional

f ÞÑ“

$

’

&

’

%

γş


`8, otherwse.(5.87)

Thus

EγpπQq “ supf : 0ďfď1

"ż

f d π ´ γ

ż

f dQ

*

. (5.88)

The variational formula (5.88) can also be derived from (5.80) by other means; the

present derivation emphasizes the dual optimization viewpoint of this thesis.

Below, we prove some basic properties of Eγ useful for later sections. Additional

properties of Eγ can be found in [132][146, Theorem 21][43].

Proposition 5.2.2 Assume that P ! S ! Q are probability distributions on the

same alphabet, and γ, γ1, γ2 ě 1.

156

1 1.8 40

0.4

1

γ

Eγ(P‖Q)

Figure 5.2: EγpP Qq as a function of γ where P “ Berp0.1q and Q “ Berp0.5q.

1. r1,8q Ñ r0,8q : γ ÞÑ EγpP Qq is convex, non-increasing, and continuous.

2. For any event A,

QpAq ě 1

γpP pAq ´ EγpP Qqq. (5.89)

3. Triangle inequalities:

Eγ1γ2pP Qq ď Eγ1pP Sq ` γ1Eγ2pSQq, (5.90)

EγpP Qq ` EγpP Sq ěγ

2|S ´Q| ` 1´ γ. (5.91)

4. Monotonicity: if PXY “ PXPY |X and QXY “ QXQY |X are joint distributions

on X ˆ Y, then

EγpPXQXq ď EγpPXY QXY q (5.92)

where equality holds for all γ ě 1 if and only if PY |X “ QY |X .

157

5. Given PX , PY |X and QY |X , define

EγpPY |XQY |X |PXq :“ ErEγpPY |Xp¨|XqQY |Xp¨|Xqqs (5.93)

where the expectation is w.r.t. X „ PX . Then we have

EγpPXPY |XPXQY |Xq “ EγpPY |XQY |X |PXq. (5.94)

6.

1´ γ

ˆ

1´1

2|P ´Q|

˙

ď EγpP Qq ď1

2|P ´Q|. (5.95)

Proof The proofs of 1), 2), 4), 5) and the second inequality in (5.95) are omitted

since they follow either directly from the definition of Eγ or are similar to the proofs

of the corresponding properties for total variation distance.

For 3), observe that

Eγ1γ2pP Qq “ maxApP pAq ´ γ1SpAq ` γ1SpAq ´ γ1γ2QpAqq (5.96)

ď maxApP pAq ´ γ1SpAqq `max

Apγ1SpAq ´ γ1γ2QpAqq (5.97)

“ Eγ1pP Sq ` γ1Eγ2pSQq, (5.98)

158

and that

EγpP Qq ` EγpP Sq “ maxApP pAq ´ γQpAqq

`maxAp1´ P pAq ´ γ ` γSpAqq (5.99)

ě maxApP pAq ´ γQpAq

` 1´ P pAq ´ γ ` γSpAqq (5.100)

“ γmaxApSpAq ´QpAqq ` 1´ γ (5.101)

“γ

2|S ´Q| ` 1´ γ. (5.102)

As far as 6) note that the left inequality in (5.95) follows by setting S “ P in (5.91).

The Eγ metric provides a very convenient tool for change-of-measure purposes: if

EγpP Qq is small, and the probability of some event is large under P , then the

probability of this event under Q is essentially lower-bounded by 1γ. In the special

case of γ “ 1 (total variation distance), this change-of-measure trick has been widely

used, see [159][160], but the general γ ě 1 case is more versatile, since P and Q need

not be essentially the same.

5.2.2 The Relationship of Eγ with Excess Information and

Smooth Renyi Divergence

This subsection shows that Eγ (as a function of γ) is the inverse function of the

smooth Renyi divergence [53] of order `8. Various bounds among Eγ and other

popular information measures used in non-asymptotic information theory are pre-

sented. These facts buttress the potential of Eγ in non-asymptotic analysis and large

deviation theory.

159

Definition 5.2.2 For γ ą 0, a probability measure P , and a nonnegative σ-finite

measure µ, where P ! µ, we define the complementary relative information spectrum

metric with threshold γ,

FγpP µq :“ PrıP µpXq ą log γs (5.103)

where X „ P .

As such, the complementary relative information spectrum can be expressed in

terms of the relative information spectrum (see [43]) as

FγpP µq “ 1´ FP µplog γq. (5.104)

Also, it is clear from (5.103) that Fγ upper-bounds Eγ.

Note that (5.103) is nonnegative and vanishes when P “ µ provided that γ ą 1,

and can therefore be considered as a measure of the discrepancy between P and µ. In

addition to playing an important role in one-shot analysis (see [161]), (5.103) provides

richer information than the relative entropy measure since

DpP µq :“ ErıP µpXqs (5.105)

“

ż

r0,`8q

PrıP µpXq ą τ sdτ

´

ż

p´8,0s

p1´ PrıP µpXq ą τ sqdτ. (5.106)

Moreover, the complementary relative information spectrum is also related to total

variation distance since for probability measures P and Q,

1

2|P ´Q| “ PrıP QpXq ą 0s ´ PrıP QpY q ą 0s, (5.107)

160

where X „ P and Y „ Q [162]. However, perhaps surprisingly, the complementary

relative information spectrum metric does not satisfy the data processing inequality, in

contrast to the relative entropy and total variation distance; the proof of such a result,

along with other properties of the complementary relative information spectrum, can

be found in [143].

Another information measure closely related to Eγ is the smooth Renyi. Let us

first generalize some of our definitions to allow nonnegative finite measures that are

not necessarily probability measures:

Definition 5.2.3 For γ ě 1, nonnegative finite measures µ and ν on X , µ ! ν,

|µ´ ν| :“

ż

|dµ´ dν|, (5.108)

and4

Eγpµνq :“ supAtµpAq ´ γνpAqu (5.109)

“ pµ´ γνqpıµν ą log γq (5.110)

“1

2pµ´ γνqpX q ´ 1

2pµ´ γνqpıµν ď log γq

`1

2pµ´ γνqpıµν ą log γq (5.111)

“1

2µpX q ´ γ

2νpX q ` 1

2|µ´ γν|. (5.112)

Note that E1pP µq ‰12|P ´ µ| when µ is not a probability measure.

The following result is a generalization of the γ1 “ 1 case of (5.90) to unnor-

malized measures, and the proof is immediate from the definition of (5.109) and the

subadditivity of the sup operator.

4 Following established usage in measure theory, we use λpıµν ą log γq as an abbreviation ofλptx : ıµνpxq ą log γuq for an arbitrary signed measure λ.

161

Proposition 5.2.3 Triangle inequality: if µ, ν and θ are nonnegative finite measures

on the same alphabet, µ ! θ ! ν, then

Eγpµνq ď E1pµθq ` Eγpθνq. (5.113)

The Renyi divergence is not an f -divergence, but is a monotonic function of the

Hellinger distance [42]. We have seen the definition of the Renyi divergence between

two probability measures in Section 2.5.1. Below we extend the definition to the case

of unnormalized measures:

Definition 5.2.4 (Renyi α-divergence) Let µ be a nonnegative finite measure and

Q a probability measure on X , µ ! Q, X „ Q. For α P p0, 1q Y p1,`8q,

DαpµQq :“1

α ´ 1logE

„ˆ

dµ

dQpXq

˙α

, (5.114)

and

D0pµQq :“ log1

QpıµQ ą ´8q, (5.115)

D8pµQq :“ µ- ess sup ıµQ (5.116)

which agree with the the limits as α Ó 0 and α Ò 8.

DαpP Qq is non-negative and monotonically increasing in α. More properties about

the Renyi divergence can be found, e.g. in [163][130][43].

Definition 5.2.5 (smooth Renyi α-divergence) For α P p0, 1q, ε P p0, 1q,

D`εα pP Qq :“ supµPBεpP q

DαpµQq; (5.117)

162

for α P p1,8s, ε P p0, 1q,

D´εα pP Qq :“ infµPBεpP q

DαpµQq, (5.118)

where BεpP q :“ tµ nonnegative : E1pP µq ď εu is the ε-neighborhood of P in E1.

The definition of smooth 8-divergence given here agrees with the smooth max

Renyi divergence in [164], although the definitions look different. However, our

smooth 0-divergence is different from the smooth min Renyi divergences in [165]

and [164, Definition 1] except for non-atomic measures.5

Remark 5.2.1 The smooth Renyi divergence is a natural extension of the smooth

Renyi entropy Hεα defined in [53] (which can be viewed as a special case where the

reference measure is the counting measure). In [164] the smooth min and max Renyi

divergences are introduced, which correspond to the α “ 0 and α “ `8 cases of

Definition 5.2.5. Moreover, we have introduced `´ in the notation to emphasize the

difference between the two possible ways of smoothing in (5.117) and (5.118).

The following quantity, which characterizes the binary hypothesis testing error, is

a g-divergence but not a monotonic function of any f -divergence [42]. We will see in

Proposition 5.2.4 that it is a monotonic function of the 0-smooth Renyi divergence.

Definition 5.2.6 For nonnegative finite measures µ and ν on X , define

βαpµ, νq :“ minA:µpAqěα

νpAq. (5.119)

5With the definition of smooth min Renyi divergence in [165] and [164, Definition 1], Proposi-tion 5.2.4-7) would hold with the βα as defined in (5.120) rather than (5.119).

163

Remark 5.2.2 In the literature, the definition of βαpP,Qq is usually restricted to

probability measures and allows randomized tests:

βαpPW , QW q :“ min

ż

PZ|W p1|wqdQW pwq (5.120)

where the minimization is over all random transformations PZ|W : Z Ñ t0, 1u such

that

ż

PZ|W p1|wqdPW pwq ě α. (5.121)

In contrast to Eγ, allowing randomization in the definition can change the value of

βα except for non-atomic measures. Nevertheless, many important properties of βα

are not affected by this difference.

We conclude this section with some inequalities relating those distance measures

we have discussed. These results show the asymptotic equivalence among these dis-

tance metrics: Eγ is essentially equivalent to Fγ, and the value of γ for which they

vanish is essentially the exponential of the relative entropy; the smooth Renyi entropy

is also asymptotically equivalent to the relative entropy.

Proposition 5.2.4 Suppose µ is a finite nonnegative measure and P and Q are

probability measures, all on X , X „ P , ε P p0, 1q, and γ ě 1.

1. For a ą 1,

EγpP Qq ď FγpP Qq ďa

a´ 1E γ

apP Qq. (5.122)

2. D`εα pP Qq is increasing in α P r0, 1s and D´εα pP Qq is increasing in α P r1,8s.

164

3. If P is a probability measure, then

DpP Qq ď

ż 8

0

PrıP QpXq ą τ s dτ ; (5.123)

DpP Qq ě EγpP Qq log γ ´ 2e´1 log e. (5.124)

4. If α P r0, 1q then 6

DαpµQq ď log γ ´1

1´ αlog pµpX q ´ EγpµQqq . (5.125)

If α P p1,8q then

DαpµQq ě log γ `1

α ´ 1logEγpµQq. (5.126)

5. Suppose α P r0, 1q, EγpP Qq ă 1´ ε, then

D`εα pP Qq ď log γ ´1

1´ αlog p1´ ε´ EγpP Qqq . (5.127)

Suppose α P p1,8s, EγpP Qq ą ε, then

D´εα pP Qq ě log γ `1

α ´ 1log pEγpP Qq ´ εq . (5.128)

6. D´ε8 pP Qq ď log γ ô EγpP Qq ď ε. That is, D´ε8 pP Qq “ log inftγ :

EγpP Qq ď εu.

7. D`ε0 pP Qq “ ´ log β1´εpP,Qq.

6The special case of (5.125) for α “ 12 , µpX q “ 1 and γ “ 1 is equivalent to a well-known bound

on a quantity called fidelity using total variation distance, see [166].

165

8. Fix τ P R and ε ě PrıP QpXq ď τ s, where X „ P and P is non-atomic. Then

D`ε0 pP Qq ě τ ` log1

1´ ε. (5.129)

Moreover, D`ε0 pP Qq ě τ holds when P is not necessarily non-atomic.

Proof 1. The first inequality in (5.122) is evident from the definition. For the

second inequality in (5.122), consider the event

A :“ tx P X : ıP Qpxq ą log γu. (5.130)

Then

E γapP Qq

ě P pAq ´ γ

aQpAq

ě PrıP QpXq ą log γs ´γ

a¨

1

γPrıP QpXq ą log γs (5.131)

ěa´ 1

aPrıP QpXq ą log γs (5.132)

and the result follows by rearrangement.

2. Direct from the monotonicity of Renyi divergences in α and the definitions of

their smooth versions.

3. The bound (5.123) follows from DpP Qq “ ErıP QpXqs and (5.124) can be seen

from

DpP Qq ě Er|ıP QpXq|s ´ 2e´1 log e (5.133)

ě log γPrıP QpXq ą log γs ´ 2e´1 log e (5.134)

ě log γEγpP Qq ´ 2e´1 log e (5.135)

166

where (5.133) is due to Pinsker [167, (2.3.2)], (5.134) uses Markov’s inequality,

and (5.135) is from the definition (5.80).

4. By considering the dµdQ

ą γ and dµdQ

ď γ cases separately, we can check the

following (homogeneous) inequalities: for each α ă 1,

ˇ

ˇ

ˇ

ˇ

dµ

dQ´ γ

ˇ

ˇ

ˇ

ˇ

ědµ

dQ´ 2γ1´α

ˆ

dµ

dQ

˙α

` γ, (5.136)

and for each α ą 1,

ˇ

ˇ

ˇ

ˇ

dµ

dQ´ γ

ˇ

ˇ

ˇ

ˇ

ď ´dµ

dQ` 2γ1´α

ˆ

dµ

dQ

˙α

` γ. (5.137)

Integrating with respect to dQ both sides of (5.136), we obtain

|µ´ γQ| ě µpX q ´ 2γ1´α

żˆ

dµ

dQ

˙α

dQ` γ (5.138)

and (5.125) follows by rearrangement. Integrating with respect to dQ on both

sides of (5.137), we obtain

|µ´ γQ| ď ´µpX q ` 2γ1´α

żˆ

dµ

dQ

˙α

dQ` γ (5.139)

and (5.126) follows by rearrangement.

5. Immediate from the previous result, the definition of smooth Renyi divergence,

and the triangle inequality (5.90).

167

6. ñ: By assumption there exists a nonnegative finite measure µ such that

E1pP µq ď ε and µ ď γQ. Then from Proposition 5.2.3,

EγpP Qq ď E1pP µq ` EγpµQq

ď ε` 0.

ð: Define µ by dµdQ

:“ mintdPdQ, γu. Since

!

dPdQą γ

)

“

!

dPdµą 1

)

,

E1pP µq “ P

ˆ

dP

dQą γ

˙

´ µ

ˆ

dP

dQą γ

˙

“ P

ˆ

dP

dQą γ

˙

´ γQ

ˆ

dP

dQą γ

˙

ď EγpP Qq.

Then D8pµQq ď log γ implies that D´ε8 pP Qq ď log γ.

7. “ě”: let A be a set achieving the minimum in (5.119). Let µ be the restriction

of P on A. Then by definition (5.119) we have

E1pP µq “ P pAcq ď ε (5.140)

therefore

D`ε0 pP Qq ě ´ log β1´εpP,Qq. (5.141)

“ď”: fix arbitrary δ ą 0 and let µ be a nonnegative measure satisfying

D`ε0 pP Qq ă D0pµQq ` δ. (5.142)

168

Define A :“ supppµq. Then

P pAcq “ P pAcq ´ µpAcq (5.143)

ď E1pP µq (5.144)

ď ε, (5.145)

hence

D`ε0 pP Qq ´ δ ď D0pµQq (5.146)

“ ´ logQpAq (5.147)

ď ´ log β1´εpP,Qq (5.148)

and the result follows by setting δ Ó 0.

8. In the case of non-atomic P , there exists a set A Ď tx : ıP Qpxq ą τu such that

P pAq “ 1´ ε. Then

β1´εpP,Qq ď QpAq (5.149)

ď expp´τqP pAq. (5.150)

We obtain (5.129) by taking the logarithms on both sides of the above and

invoking Part 7. When P is not necessarily non-atomic, since P pıP Q ą τq ě

1´ ε,

β1´εpP,Qq ď QpıP Q ą τq (5.151)

ď expp´τqP pıP Q ą τq (5.152)

ď expp´τq (5.153)

169

and again the result follows by taking the logarithms on both sides of the above.

5.2.3 Eγ-Resolvability

Given the close connection between Eγ and the smooth Renyi divergence, and the

variational formula (5.88), we expect that Eγ has a great potential in various venues

of the non-asymptotic information theory. However, in this thesis we only discuss the

usage of Eγ in the setting of channel resolvability.

After an introduction to the resolvability problem and the formulation of its ex-

tension to the Eγ setting, we present a general one-shot achievability bound on Eγ-

resolvability, which is the basis of all information-theoretic of Eγ that we will discuss in

this thesis. Then, a refinement of such a result using concentration inequalities leads

to a one-shot tail bound on the approximation error for random codebooks, which

is useful for some problems in the information-theoretic literature [168]. Finally, we

argue the asymptotic tightness of the one-shot bounds in the discrete memoryless

settings using the method of types.

Later, we will discuss the applications of the results on Eγ-resolvability in non-

asymptotic information theory (Sections 5.2.4-5.2.6).

The Resolvability Problem: An Introduction

Channel resolvability, introduced by Han and Verdu [155], is defined as the minimum

randomness rate required to synthesize an input so that its corresponding output dis-

tribution approximates a target output distribution. While the resolvability problem

itself differs from classical topics in information theory such as data compression and

transmission, [155] unveils its potential utility in operational problems through the

solution of the strong converse problem of identification coding [22]. Other applica-

tions of distribution approximation in information theory include common random-

170

ness of two random variables [6], strong converse in identification through channels

[155], random process simulation [169], secrecy [170][171][172][173], channel synthesis

[174][159], lossless and lossy source coding [169][155][160], and the empirical distri-

bution of a capacity-achieving code [175][149]. The achievability part of resolvability

(also known as the soft-covering lemma in [159]) is particularly useful, and coding the-

orems via resolvability have certain advantages over what is obtained from traditional

typicality-based approaches (see e.g. [172]).

If the channel is stationary memoryless and the target output distribution is in-

duced by a stationary memoryless input, then the resolvability is the minimum mutual

information over all input distributions inducing the (per-letter) target output dis-

tribution, no matter when the approximation error is measured in total variation

distance [155][176, Theorem 1], normalized relative entropy [6, Theorem 6.3][155],

or unnormalized relative entropy [173]. In contrast, relatively few measures for the

quality of the approximation of output statistics have been proposed for which the

resolvability can be strictly smaller than mutual information. As shown by Steinberg

and Verdu [169], one exception is the Wasserstein distance measure, in which case

the finite precision resolvability for the identity channel can be related to the rate-

distortion function of the source where the distortion is the metric in the definition

of the Wasserstein distance [169].

We remark that in view of convex duality, channel resolvability in the relative

entropy is equivalent to (our new definition of) the image-size problem in a precise

sense (Proposition 5.1.5): there exist an input distribution supported on a set of a

given size whose output approximates a given output distribution if and only if the

“image” of that set takes up almost all of that output distribution.

In this section we generalize the theory of resolvability by considering a distance

measure, Eγ, defined in Section 5.2.1, of which the total variation distance is a special

171

case where γ “ 1. The Eγ metric7 is more forgiving than total variation distance

when γ ą 1, and the larger γ is, the less randomness is needed at the input for

approximation in Eγ. Various achievability and converse bounds for resolvability in

the Eγ metric will be derived.

• In the case of a discrete memoryless channel with a given stationary memory-

less target output, we provide a single-letter characterization of the minimum

exponential growth rate of γ to achieve approximation in Eγ.

• In addition to achievability results in terms of the expectation of the approx-

imation error over a random codebook, we also prove achievability under the

high probability criteria8 for a random codebook. The implications of the lat-

ter problem in secrecy has been noted by several authors [168][177][178]. Here

we adopt a simple non-asymptotic approach based on concentration inequali-

ties, dispensing with the finiteness or stationarity assumptions on the alphabet

required by the previous proof method [168] based on Chernoff bounds.

• An asymptotically matching converse bound is derived for the stationary memo-

ryless case using the method of types. The analysis is different from the previous

proof for the γ “ 1 (total variation) special case and, in particular, we need to

introduce a slightly new definition of conditional typicality [18] for this proof.

We remark that while the conventional resolvability theory (γ “ 1) also considers

the amount of randomness needed to approximate the worst case output distribution,

we do not discuss the worst case counterpart for general γ in this thesis. However,

upper and lower bounds for the worst case Eγ-resolvability can be found in our journal

version [143].

7“Metric” or “distance” are used informally since, other than nonnegativity, Eγ , in general, doesnot satisfy any of the other three requirements for a metric.

8More precisely, by “the high probability criteria” we that mean that the approximation errorsatisfies a Gaussian concentration result (which ensures a doubly exponential decay in blocklengthwhen the single-shot result is applied to the multi-letter setting).

172

Problem Formulation

The setup in Figure 5.3 is the same as in the original paper on channel resolvability

under total variation distance [155]. Given a random transformation QX|U and a

target distribution πX , we wish to minimize the size M of a codebook cM “ pcmqMm“1

such that when the codewords are equiprobably selected, the output distribution

QXrcM s :“1

M

Mÿ

m“1

QX|U“cm (5.154)

approximates πX . The difference from [155] is that we use Eγ (and other metrics) to

measure the level of the approximation.

m pcmqMm“1

QX|U QXrcM s « πX

Figure 5.3: Setup for channel resolvability.

The fundamental one-shot tradeoff is the minimum M required for a prespecified

degree of approximation. One-shot bounds are general in that no structural assump-

tions on either the random transformation or the target distribution are imposed.

The corresponding asymptotic results can usually be recovered quite simply from the

one-shot bounds using, say, the law of large numbers in the memoryless case. Asymp-

totic limits are of interest because of the compactness of the expressions, and because

good one-shot bounds are not always known, especially in the converse parts.

Unless otherwise stated, we do not restrict the alphabets to be finite or countable.

All one-shot achievability bounds for Eγ-resolvability apply to general alphabets; some

asymptotic converse bounds assume finite input alphabets or discrete memoryless

channels (DMC)9.

9A DMC is a stationary memoryless channel whose input and output alphabets are finite, whichis denoted by the corresponding per-letter random transformation (such as QX|U) in this chapter.

173

Next, we define the achievable regions in the general asymptotic setting (when

sources and channels are arbitrary, see [155][171]) as well as the case of stationary

memoryless channels and memoryless outputs. Boldface letters such as X denote a

general sequence of random variables pXnq8

n“1, and sanserif letters such as X denote

the generic distributions in iid settings.

Definition 5.2.7 Given a channel10 pQXn|Unq8n“1 and a sequence of target distribu-

tions pπXnq8n“1, the triple pG,R,Xq is ε-achievable (0 ă ε ă 1) if there exists pcMnq8n“1,

where cMn :“ pc1, . . . , cMnq and cm P Un for each m “ 1, . . . ,Mn, and pγnq8n“1, so that

lim supnÑ8

1

nlogMn ď R; (5.155)

lim supnÑ8

1

nlog γn ď G; (5.156)

supnEγnpQXnrcMn sπXnq ď ε. (5.157)

Moreover, pG,R,Xq is said to be achievable if it is ε-achievable for all 0 ă ε ă 1.

Define the asymptotic fundamental limits11

GεpR,Xq :“ mintg : pg,R,Xq is ε-achievableu; (5.158)

GpR,Xq :“ supεą0

GεpR,Xq, (5.159)

and

SεpG,Xq :“ mintr : pG, r,Xq is ε-achievableu; (5.160)

SpG,Xq :“ supεą0

SεpG,Xq, (5.161)

10In this setting a “channel” refers to a sequence of random transformations.11We can write min instead of inf in (5.159) and (5.160) since for fixed X the set of pG,Rq such

that pG,R,Xq is ε-achievable is necessarily closed.

174

which, in keeping with [155], we refer to as the resolvability function. In the special

case of QXn|Un “ QbnX|U and target πXn “ πbnX , we may write the quantities in (5.159)

and (5.160) as GpR, πXq and SpG, πXq.

Note that by [155][176], p0, R, πXq is achievable if and only if R ě IpPU, QX|Uq for

some PU satisfying PU Ñ QX|U Ñ πX.

A useful property is that the approximation error (5.157) actually converges uni-

formly. A similar observation was made in the proof of [155, Lemma 6] in the context

of resolvability in total variation distance.

Proposition 5.2.5 If pG,Rq is ε-achievable for pQXn|Unq8n“1, then there exists

pγnq8n“1 and pMnq

8n“1 such that

lim supnÑ8

1

nlog γn ď G; (5.162)

lim supnÑ8

1

nlogMn ď R; (5.163)

supn

supPUn

infcMn

EγnpQXnrcMn sQXnq ď ε, (5.164)

where QUn Ñ QXn|Un Ñ QXn.

Proof Fix arbitrary G1 ą G and R1 ą R. Observe that the sequence

supQUn

infcMn

EexppnG1qpQXnrcMn sQXnq (5.165)

where Mn :“ exppnR1q, must be upper-bounded by ε for large enough n. For if

otherwise, there would be a sequence pQUnq8n“1 such that the infimum in (5.165) is

not upper-bounded by ε for large enough n, which is a contradiction since we can find

pcMnq8n“1 and pγnq8n“1 in Definition 5.2.7 such that Mn ă exppnR1q and γn ď exppnG1q

for n large enough, and apply the monotonicity of Eγ in γ. Finally, since pG1, R1q can

be arbitrarily close to pG,Rq and the ε-achievable region is a closed set, we conclude

that pG,Rq is ε-achievable.

175

One-shot Achievability Bounds

While our goal is to bound Eγ in the resolvability problem, we consider instead the

problem of bounding the complementary relative information spectrum F , which,

according to Proposition 5.2.4, is (up to constant factors) equivalent to bounding Eγ.

First, we study a simple special case where the target distribution matches the

input distribution according to which the codewords are generated.

Theorem 5.2.6 (Softer-covering Lemma) Fix QUX “ QUQX|U . For an arbi-

trary codebook rc1, . . . , cM s, define

QXrcM s :“1

M

Mÿ

m“1

QX|U“cm . (5.166)

Then for any γ, ε, σ ą 0 satisfying γ ´ 1 ą ε` σ and τ P R,

ErFγ`

QXrUM sQX

˘

s ď P rıU ;XpU ;Xq ą logMσs

`1

εPrıU ;XpU ;Xq ą logM ´ τ s

`expp´τq

pγ ´ 1´ ε´ σq2(5.167)

where UM „ QU ˆ ¨ ¨ ¨ ˆQU , pU,Xq „ QUQX|U , and the information density

ıU ;Xpu;xq :“ logdQX|U“u

dQX

pxq. (5.168)

Remark 5.2.3 Theorem 5.2.6 implies (the general asymptotic version of) the soft-

covering lemma based on total variation (see [155],[171] [159]). Indeed if Mn “

exppnRq at blocklength n where R ą IpU; Xq, we can select τn Ðn2pR ´ IpU; Xqq.

Moreover for any γ ą 1 we can pick constant ε, σ ą 0 such that γ ´ 1 ą ε` σ in the

176

theorem, to show that

limnÑ8

ErFγ`

QXnrUnMn sQXn

˘

s “ 0 (5.169)

which, by (5.95) and by taking γ Ó 1, implies that

limnÑ8

E|QXnrUnMn s ´QXn | “ 0. (5.170)

We refer to Theorem 5.2.6 as the softer-covering lemma since for a larger γ it allows

us to use a smaller codebook to cover the output distribution more softly (i.e. approx-

imate the target distribution under a weaker metric).

Proof Define the “atypical” set

Aτ :“ tpu, xq : ıU ;Xpu;xq ď logM ´ τu. (5.171)

Now, let XM be such that pUM , XMq „ QUX ˆ ¨ ¨ ¨ ˆQUX , The joint distribution of

pUM , Xq is specified by letting X „ QXrcM s conditioned on UM “ cM . We perform a

change-of measure step using the symmetry of the random codebook:

ErFγpQXrUM sQXqs “ P„

dQXrUM s

dQX

pXq ą γ

(5.172)

“1

M

ÿ

m

P„

dQXrUM s

dQX

pXmq ą γ

(5.173)

“ P„

dQXrUM s

dQX

pX1q ą γ

(5.174)

where (5.173) is because of (5.166), and (5.174) is because the summands in (5.173)

are equal. Note that X1 is correlated with only the first codeword U1.

177

Next, because of the relation

dQXrcM s

dQX

pxq “1

M

ÿ

m

exppıU ;Xpcm, xqq, (5.175)

we can upper-bound (5.174) by the union bound as

PrexppıU ;XpU1;X1qq ąMσs

` P

«

1

M

Mÿ

m“2

exppıU ;XpUm;X1qq1Acτ pUm, X1q ą ε

ff

` P

«

1

M

Mÿ

m“2

exppıU ;XpUm;X1qq1Aτ pUm, X1q ą γ ´ ε´ σ

ff

(5.176)

where we used the fact that 1Aτ ` 1Acτ “ 1. Notice that the first term of (5.176) may

be regarded as the probability of “atypical” events and accounts for the first term in

(5.167). The second term of (5.176) can be upper-bounded with Markov’s inequality:

1

Mε

Mÿ

m“2

Erexp pıU ;XpUm;X1qq 1Acτ pUm, X1qs

ď1

Mε

Mÿ

m“2

Er1Acτ pU,Xqs (5.177)

ď1

εPrıU ;XpU ;Xq ą logM ´ τ s (5.178)

accounting for the second term in (5.167) where (5.177) is a change-of-measure step

using the fact that pUl, X1q „ QU ˆQX for m ě 2.

178

Finally we take care of the last term in (5.176), again using the independence of

Um and X1 for m ě 2. Observe that for any x P X ,

µ :“ E

«

1

M

Mÿ

m“2

exppıU ;XpUm;xqq1Aτ pUm, xq

ff

(5.179)

ď1

M

Mÿ

m“2

E rexppıU ;XpUm;xqqs (5.180)

“M ´ 1

M(5.181)

ď 1, (5.182)

whereas

Var

˜

1

M

Mÿ

m“2

exppıU ;XpUm;xqq1Aτ pUm, xq

¸

“1

M2

Mÿ

m“2

Var pexppıU ;XpUm;xqq1Aτ pUm, xqq (5.183)

ď1

MVar pexppıU ;XpU ;xqq1Aτ pU, xqq (5.184)

ď1

ME rexpp2ıU ;XpU ;xqq1Aτ pU, xqs (5.185)

ď expp´τqE rexppıU ;XpU ;xqqs (5.186)

“ expp´τq, (5.187)

where the change of measure step (5.186) uses (5.171). It then follows from Cheby-

shev’s inequality that

P

«

1

M

Mÿ

m“2

exppıU ;XpUm;xqq1Aτ pUm, xq ą γ ´ ε´ σ

ff

(5.188)

ď P

«

1

M

Mÿ

m“2

exppıU ;XpUm;xqq1Aτ pUm, xq ´ µ ą γ ´ ε´ σ ´ 1

ff

(5.189)

ďexpp´τq

pγ ´ ε´ σ ´ 1q2. (5.190)

179

For the asymptotic analysis, we are interested in the regime where M and γ are

growing exponentially. In this case, the right hand side of (5.167) can be regarded as

essentially

P rıU ;XpU ;Xq ą logMγs (5.191)

modulo nuisance parameters. This can be seen from the choice of parameters in

Corollary 5.2.8. Thus the sum rate of M and γ has to exceed the sup information

rate in order that the approximation error vanishes asymptotically.

Extending Theorem 5.2.6 to the more general scenario where the target distribu-

tion may not have any relation with the input distribution, we have the following

result, where we allow πX to be an arbitrary positive measure,

Theorem 5.2.7 (Softer-covering Lemma: Unmatched Target Distribution)

Fix πX and QUX “ QUQX|U . For an arbitrary codebook rc1, . . . , cM s, define QXrcM s

as in (5.166). Then for any γ, ε, σ ą 0 satisfying γ ą ε` σ, τ P R and 0 ă δ ă 1,

ErFγ`

QXrUM sπX˘

s

ď PrıQXπX pXq ą logpγ ´ σ ´ εq or ıU ;XpU ;Xq

` ıQXπX pXq ą log δMσs

`γ ´ ε´ σ

εPrıU ;XpU ;Xq ą logM ´ τ s

`expp´τqpγ ´ ε´ σq2

p1´ δq2σ2(5.192)

where UM „ QU ˆ ¨ ¨ ¨ ˆQU and pU,Xq „ QUQX|U .

Proof Sketch Similarly to the proof of Theorem 5.2.6, we first use a symmetry

argument and change-of-measure step so that the random variable of the channel

180

output is correlated only with the first codeword, to obtain

ErFγ`

QXrUM sπX˘

s ď P„

dQXrUM s

dπXpX1q ą γ

. (5.193)

Then in the next union bound step we have to take care of another “atypical” event

that dQXdπX

pX1q ą γ2, where

γ2 :“ γ ´ ε´ σ. (5.194)

More precisely, we have

P„

dQXrUM s

dπXpX1q ą γ

ď PrξpX1q ą γ2 or ηpU1, X1q ą δMσs

` P

«

1

M

Mÿ

m“2

ηpUm, X1q1Acτ pUm, X1q ą ε, ξpX1q ď γ2

ff

` P

«

1

M

Mÿ

m“2

ηpUm, X1q1Aτ pUm, X1q ą γ ´ ε´ δσ, ξpX1q ď γ2

ff

(5.195)


ηpu, xq :“dQX|U“u

dπXpxq; (5.196)

ξpxq :“dQX

dπXpxq. (5.197)

As before the first term of (5.195) may be regarded as the probability of “atypical”

events and accounts for the first term in (5.192). The second and the third terms of

181

(5.195) can be upper-bounded by

P

«

1

M

Mÿ

m“2

ηpUm, X1q

ξpX1q1Acτ pUm, X1q ą

ε

γ2

ff

ď P

«

1

M

Mÿ

m“2

exp pıU ;XpUm;X1qq 1Acτ pUm, X1q ąε

γ2

ff

(5.198)

and

P

«

1

M

Mÿ

m“2

ηpUm, X1q

ξpX1q1Aτ pUm, X1q ą

γ ´ ε´ δσ

γ2

ff

ď P

«

1

M

Mÿ

m“2

exppıU ;XpUm;X1qq1Aτ pUm, X1q ą 1`p1´ δqσ

γ2

ff

. (5.199)

The rest of the proof is similar to that of Theorem 5.2.6 and is omitted.

For the purpose of asymptotic analysis in the stationary memoryless setting, the right

hand side of (5.192) can be regarded as essentially

P“

ıQXπX pXq ą log γ‰

` P“

ıU ;XpU ;Xq ` ıQXπX pXq ą logMγ‰

(5.200)

modulo nuisance parameters.

Remark 5.2.4 By setting τ Ð `8 and letting δ Ò 1, the bound in Theorem 5.2.7

can be weakened in the following slightly simpler form:

ErFγpQXrUM sπXqs ď P„

dQX

dπXpXq ą γ ´ σ ´ ε

` P„

dQX|U

dπXpX|Uq ěMσ

`γ ´ σ ´ ε

ε. (5.201)

182

In fact, assuming τ “ `8 we can simplify the proof of the theorem and strengthen

(5.201) to

ErFγpQXrUM sπXqs ď P„

dQX

dπXpXq ą γ2

` P„

dQX|U

dπXpX|Uq ąMpγ ´ εq

`γ2

ε(5.202)

for any γ2 ą 0 and 0 ă ε ă γ. As we show in Corollary 5.2.8, the weakened

bounds (5.201) and (5.202) are still asymptotically tight provided that the exponent

with which the threshold γ grows is strictly positive. However, when the exponent

is zero (corresponding to the total variation case), we do need τ in the bound for

asymptotic tightness.

Remark 5.2.5 In the case of QX “ πX , we can set γ2 Ð 1 and εÐ γ2

in (5.202) to

obtain the simplification

ErFγpQXrUM sQXqs ď P„

exppıU ;XpU ;Xqq ą logMγ

2

`2

γ. (5.203)

The general one-shot achievability bound in Theorem 5.2.7 implies the following

asymptotic result, which we will later prove to be tight.

Corollary 5.2.8 Fix per-letter distributions πX on X and QUX “ QUQX|U on U ˆX ,

and E,R P p0,8q. For each n, define γn :“ exppnEq and Mn “ texppnRqu; let

UMn “ pU1, . . . , UMnq have independent coordinates each distributed according to QbnU .

183

Given any cMn “ pcmqMnm“1, where each cm P Un, define12

QXrcMn s :“1

Mn

Mnÿ

m“1

QbnX|Up¨|cmq. (5.204)

Then

limnÑ8

ErEγnpQXrUMn s||πbnX qs “ lim

nÑ8ErFγnpQXrUMn sπ

bnX qs (5.205)

“ 0 (5.206)

provided that

E ą DpQX||πXq ` rIpQU, QX|Uq ´Rs`. (5.207)

Proof Choose E 1 such that

E ą E 1 ą DpQX||πXq ` rIpQU, QX|Uq ´Rs`. (5.208)

Set δ “ 12, εn “ exppnEq ´ exppnE 1q and σn “

12pγn ´ εnq “

12

exppnE 1q, and apply

(5.201). Notice that

E„

logdQX|U

dπXpX|Uq

“ nrIpQU, QX|Uq `DpQX||πXqs (5.209)

12We define QbnX|U by QbnX|Up¨|unq :“

śni“1QX|U“ui

for any un. Also note that in this chapter

we differentiates per-letter symbols such as U between one-shot/block symbols such as U (so thatU “ Un in this corollary).

184

where pX,Uq „ QbnXU, QX|U :“ QbnX|U, and πX :“ πbnX . By the law of large numbers,

the first and second terms in (5.201) vanish because

DpQX||πXq ă E 1; (5.210)

IpQU, QX|Uq `DpQX||πXq ă E 1 `R (5.211)

are satisfied.

The QU that minimizes the right hand side of (5.207) generally does not satisfy

QU Ñ QX|U Ñ QX. This means that in the large deviation analysis, for the best

approximation of a target distribution in Eγ, we generally should not generate the

codewords according to a distribution that corresponds to the target through the

channel. This is a remarkable distinction from approximation in total variation dis-

tance, in which case an unmatched input distribution would result in the maximal

total variation distance asymptotically. However, if we stick to matching input code-

word distributions, then a simple and general asymptotic achievability bound can be

obtained. Recall that [155] defined the sup-information rate

IpU; Xq :“ inf

"

R : limnÑ8

P„

1

nıUn;XnpUn;Xn

q ą R

“ 0

*

(5.212)

and the inf-information rate

IpU; Xq :“ sup

"

R : limnÑ8

P„

1

nıUn;XnpUn;Xn

q ă R

“ 0

*

. (5.213)

185

Theorem 5.2.9 For any channel W “ pQXn|Unq8n“1, sequence of inputs U “

pUnq8n“1, and G ą 0, we have

SpG,Xq ď IpU; Xq ´G (5.214)

where X is the output of U through the channel W. As a consequence,

SpGq ď supU

IpU; Xq ´G (5.215)

For channels satisfying the strong converse property, the right hand side of (5.215)

can be related to the channel capacity because of the relations [155][179]

supU

IpU; Xq “ supU

IpU; Xq ” CpWq. (5.216)

We conclude the subsection by remarking that had we used the soft-covering

lemma (see Remark 5.2.3) to bound total variation distance and in turn, bounded

the complementary relative information spectrum with total variation distance, we

would not have obtained Theorem 5.2.9. Indeed, consider Mn “ exppnRq and let

V1, . . . , VMn be i.i.d. according to QUn . Regardless of how fast γn grows, we cannot

conclude from that

ErFγnpQXnrVMn sπXnqs (5.217)

vanishes unless we show that E|QXnrVMn s´πXn | vanishes (the general and tight bounds

between Eγ and the total variation distance are characterized in [143]). Since the total

variation distance vanishes only when R ą IpU; Xq by the conventional resolvability

theorem, the conventional resolvability theorem only gives an upper bound IpU; Xq

which is looser than Theorem 5.2.9 when G ą 0.

186

Tail Bound of Approximation Error for Random Codebooks

For applications such as secrecy and channel synthesis, it is sometimes desirable

to prove that the approximation error vanishes not only in expectation (e.g. Theo-

rem 5.2.7), but also with high probability (see Footnote 8), in the case of a random

codebook [168][177][178]. If the probability that the approximation error exceeds an

arbitrary positive number vanishes doubly exponentially in the blocklength, then the

analyses in these applications carry through because a union bound argument can be

applied to exponentially many events. Previous proofs (e.g. [168]) based on carefully

applying Chernoff bounds to each QXrUM spxq ´ QXpxq and then taking the union

bound over x require finiteness of the alphabets.

Here we adopt a different approach. Using concentration inequalities we can di-

rectly bound the probability that the error EγpQXrUM sQXq deviates from its expec-

tation, without any restrictions on the alphabet and in fact the bound only depends

on the number of codewords. Therefore if the rate is high enough for the approxima-

tion error to vanish in expectation (by Theorem 5.2.7), we can also conclude that

the error vanishes with high probability. The crux of the matter is thus resolved by

the following one-shot result:

Theorem 5.2.10 Fix πX and QUX “ QUQX|U . For an arbitrary codebook

rc1, . . . , cM s, define QXrcM s as in (5.166). Then, for any r ą 0,

PrEγpQXrUM sπXq ´ ErEγpQXrUM sπXqs ą rs ď expp´2Mr2q (5.218)

where the probability and the expectation are with respect to UM „ QX ˆQX .

Proof Consider f : cM ÞÑ EγpQXrcM sπXq. By the definition (5.166) and the triangle

inequality (5.90), we have the following uniform bound on the discrete derivative:

supc,c1PX

|fpci´11 , c, cMi`1q ´ fpc

i´11 , c1, cMi`1q| ď

1

M, @i, cM . (5.219)

187

The result then follows by McDiarmid’s inequality (see e.g. [48, Theorem 2.2.3]).

Remark 5.2.6 If we are interested in bounding both the upper and the lower tails

then the right side of (5.218) gains a factors of 2. Other concentration inequalities

may also be applied here; the transportation method gives the same bound in this

example.

Converse for Stationary Memoryless Outputs

We now establish a matching asymptotic converse about Corollary 5.2.8 for the Eγ-

resolvability rate for stationary memoryless outputs and discrete memoryless chan-

nels.

Theorem 5.2.11 (Resolvability for Stationary Memoryless Outputs) For a

DMC QX|U and a nonnegative finite measure πX,

GεpR, πXq “ minQU

tDpQXπXq ` rIpQU, QX|Uq ´Rs`u (5.220)

where QU Ñ QX|U Ñ QX, for any 0 ă ε ă 1.

Remark 5.2.7 When resolvability was introduced in [155], the resolvability rate (un-

der total variation distance) was formulated for outputs of stationary memoryless

inputs, rather than all the tensor power distributions on the output alphabet, because

otherwise there is no guarantee that the output process can be approximated under

total variation distance even with an arbitrarily large codebook. Here we can extend

the scope because all stationary memoryless distributions on the output alphabet (sat-

isfying the mild condition of being absolutely continuous with respect to some output)

can be approximated under Eγ as long as γ is sufficiently large.

The achievability part of Theorem 5.2.11 is already shown in Corollary 5.2.8. For the

converse, we need a notion of conditional typicality specially tailored for our problem

188

which differs from the definitions of conditional typicality in [25] or [180] (see also

[18]). This can be viewed as an intermediate of the those two definitions.

Definition 5.2.8 (Moderate Conditional Typicality) The δ-typical set of un P

Un with respect to the discrete memoryless channel with per-letter conditional distri-

bution QX|U is defined as

T nrQX|Usδpunq :“

txn : @a, b, | pPunxnpa, bq ´QX|Upb|aq pPunpaq| ď δQX|Upb|aqu (5.221)

where pPunxn denotes the empirical distribution of pun, xnq.

Remark 5.2.8 In addition to its broad interest, Definition 5.2.8 plays an important

role in obtaining the uniform bound in Lemma 35, as well as in Lemma 36. This

definition of conditional typicality is of broad interest because of Lemma 5.2.12 and

Lemma 5.2.13 ahead, and in particular the uniform bound in Lemma 5.2.12. Note

that the definition in [25] corresponds to replacing the term δQX|Upb|aq in (5.221) with

δ, in which case we cannot bound the probability of a sequence in the typical set as in

Lemma 5.2.13. The “robust typicality” definition of [180] (see also [18]) corresponds

to replacing this term with δQX|Upb|aq pPunpaq, which does not give the uniform lower

bound on the probability of conditional typical set as in Lemma 5.2.12.

Lemma 5.2.12 For fixed δ ą 0 and QX|U, there exists a sequence pγnq such that

limnÑ8 γn “ 0 and

QbnX|UpTnrQX|Usδ

punq|unq ě 1´ γn (5.222)

for all un P Un.

189

we have

1

nlog

1

QbnX|Upxn|unq

“ÿ

a,b

pPunxnpa, bq log1

QX|Upb|aq(5.229)

ďÿ

a,b

rQX|Upb|aq pPunpaq ` δQX|Upb|aqs log1

QX|Upb|aq(5.230)

“ HpQX|U|pPunq ` δ

ÿ

a

HpQX|U“aq (5.231)

ď HpQX|U|pPunq ` δ|U | log |X |. (5.232)

Lemma 5.2.14 For any type PX and sequence un,

|T nrQX|Usδpunq X TPX

| ď exppnrHrPXsδ ` δ|U | log |X |sq (5.233)


HrPXsδ :“ maxQU:|QX´PX|ďδ|U |

HpQX|U|QUq (5.234)

where QU Ñ QX|U Ñ QX, and the maximum in (5.234) is understood as ´8 if the

set tQU : |QX ´ PX| ď δ|U |u is empty.

Proof For any un, we have the upper bound

|T nrQX|Usδpunq X TPX

| ď |T nrQX|Usδpunq| (5.235)

ď

˜

minxnPTn

rQX|Usδpunq

QbnX|Upxn|unq

¸´1

(5.236)

ď exppnrHpQX|U|pPunq ` δ|U | log |X |sq (5.237)

191

where we used Lemma 5.2.13 in (5.237). Moreover, if un satisfies |PX ´QX|U ˝pPun | ą

δ|U | where QX|U ˝pPun :“

ş

QX|U“ad pPunpaq, then T nrQX|Usδ

punq X TPXis empty, because

| pPxn ´QX|U ˝pPun | “

ÿ

a,b

| pPunxnpa, bq ´QX|Upb|aq pPunpaq| (5.238)

ďÿ

a,b

δQX|Upb|aq (5.239)

“ δ|U | (5.240)

implies that any xn in T nrQX|Usδ

punq does not have the type PX. Therefore the desired

result follows by taking the maximum of (5.237) over type QU satisfying |QX´PX| ď

δ|U |.

Proof [Proof of Converse of Theorem 5.2.11] Fix a codebook pc1, . . . , cMq and type

PX. Define

An :“Mď

m“1

T nrQX|Usδpcmq. (5.241)

Then

πbnX pAn X TPXq

“ÿ

xnPAnXTPX

πbnX pxnq (5.242)

“ exppńrHpPXq `DpPXπXqsq ¨ |An X TPX| (5.243)

ď exppńrHpPXq `DpPXπXqsq ¨Mÿ

m“1

|T nrQX|Usδpcmq X TPX

| (5.244)

ď exppńrHpPXq `DpPXπXqsq

¨M exppnrHrPXsδ ` δ|U | log |X |sq (5.245)

“ exppńrDpPXπXq `HpPXq ´HrPXsδ ´R ´ δ|U | log |X |sq. (5.246)

192

where (5.245) is from Lemma 5.2.14. Whence (5.246) and the trivial bound

πbnX pAn X TPXq ď πbnX pTPX

q (5.247)

ď exppńDpPXπXqq (5.248)

ď exppńrDpPXπXq ´ δ|U | log |X |sq (5.249)

yield the bound

πbnX pAn X TPXq ď exppńrfpδ, PXq ´ δ|U | log |X |sq, (5.250)

where we have defined the function

fpδ, PXq :“ DpPXπXq ` rHpPXq ´HrPXsδ ´Rs` (5.251)

for δ ą 0 and PX ! πX. Define13

gpδq :“ minPX

fpδ, PXq, (5.252)

Then

πbnX pAnq “ÿ

PX

πbnX pAn X TPXq (5.253)

ďÿ

PX

exppńrgpδq ´ δ|U | log |X |sq (5.254)

ď pn` 1q|X | exppńrgpδq ´ δ|U | log |X |sq (5.255)

where the summation is over all type PX absolutely continuous with respect to πX,

and (5.254) is from (5.250). Then for any real number G ă gpδq ´ δ|U | log |X | we

13The reason why we can write minimum in (5.252) is explained in Remark 5.2.9.

193

have

EexppnGqpPXnrcM sπbnX|Uq

ě PXnrcM spAnq ´ exppnGqπbnX|UpAnq (5.256)

ě1

M

Mÿ

m“1

QbnX|UpTnrQX|Usδ

pcmq|cmq ´ exppnGqπbnX|UpAnq (5.257)

ě 1´ γn ´ exppnGqπbnX|UpAnq (5.258)

Ñ 1, nÑ 8. (5.259)

where

• (5.257) uses T nrQX|Usδ

pcmq Ď An, and we used the notation of the tensor power

for the conditional law QbnX|Up¨|unq :“

śni“1QX|U“ui .

• (5.258) is from Lemma 5.2.12,

• (5.259) uses (5.255).

Since δ ą 0 was arbitrary, we thus conclude

GεpR, πXq ě supδą0tgpδq ´ δ|U | log |X |u (5.260)

ě lim infδÑ0

gpδq (5.261)

ě gp0q (5.262)

where (5.262) is from Lemma 5.2.15. Since gp0q is the right side of (5.220), the

converse bound is established.

Lemma 5.2.15 The functions f and g defined in (5.251) and (5.252) are both lower

semicontinuous.

Remark 5.2.9 We can write minimum instead of infimum in (5.252) and hence

(5.220) because of the lower semicontinuity of f .

194

Proof Consider a lower semicontinuous function χ where χpδ, PX, QUq equals

DpPXπXq ` rHpPXq ´HpQX|U|QUq ´Rs` (5.263)

if |QX ´ PX| ď δ|U | and `8 otherwise. Then fpδ, PXq “ minQXχpδ, PX, QUq is lower

semicontinuous, as it is the pointwise infimum of a lower semicontinuous functions over

a compact set (see for example the proof in [181, Lemma 9]). The lower semicontinuity

of g follows for the same reason.

The function GpR, πXq in (5.220) satisfies some nice properties.

Proposition 5.2.16 Denote GpR, πXq as GpR, πX, QX|Uq. Let X and U be arbitrary

alphabets (measurable spaces).

1. The function being minimized in (5.220), denoted as F pQU, R, πX, QX|Uq, is con-

vex in QU.

2. Additivity: for any R ą 0, πXi and QXi|Ui (i “ 1, 2),

GpR, πX1πX2 , QX1|U1QX2|U2q

“ minR1,R2:R1`R2ďR

tGpR1, πX1 , QX1|U1q

`GpR2, πX2 , QX2|U2qu, (5.264)

where we have abbreviated πX1 ˆ πX2 and QX1|U1 ˆ QX2|U2 as πX1πX2 and

QX1|U1QX2|U2.

3. GpR, πX, QX|Uq is continuous in R.

4. GpR, πX, QX|Uq is convex in R.

195

Proof 1. The function of interest is the maximum of the two functions DpQXπXq

and

DpQXπXq ` IpQU, QX|Uq ´R “ E”

ıQX|UπXpX|Uqı

´R (5.265)

where pU,Xq „ QUX, QU Ñ QX|U Ñ QX, and the conditional relative informa-

tion

ıQX|UπXpx|uq :“ logdQX|U“u

dπXpxq, @u, x. (5.266)

The former is convex and the latter is linear in QU.

2. The ď direction is immediate from the single-letter formula (5.220) and the

inequality

ras` ` rbs` ě ra` bs` (5.267)

for any a, b P R. For the ě direction, suppose QU1U2 achieves the minimum in

the single-letter formula of GpR, πX1πX2 , QX1|U1QX2|U2q. Observe that

F pQU2 , R, πX1πX2 , QX1|U1QX2|U2q

´ F pQU1QU2 , R, πX1πX2 , QX1|U1QX2|U2q

“ IpX1;X2q ` rIpQU1U2 , QX1|U1QX2|U2q ´Rs`

´

«

2ÿ

i“1

IpQUi , QXi|Uiq ´R

ff`

(5.268)

ě rIpX1;X2q ` IpQU1U2 , QX1|U1QX2|U2q ´Rs`

´

«

2ÿ

i“1

IpQUi , QXi|Uiq ´R

ff`

(5.269)

“ 0 (5.270)

196

where (5.268) uses DpQX1X2πX1πX2q ´DpQX1πX1q ´DpQX2πX2q “ IpX1;X2q,

and (5.269) follows from (5.267). Therefore

GpR, πX1πX2 , QX1|U1QX2|U2q

“ F pQU1QU2 , R, πX1πX2 , QX1|U1QX2|U2q. (5.271)

But clearly there exists R1 and R2 summing to R such that

R.H.S. of (5.271)

“ F pQU1 , R1, πX1 , QX1|U1q ` F pQU2 , R2, πX2 , QX2|U2q (5.272)

ě F pR1, πX1 , QX1|U1q ` F pR2, πX2 , QX2|U2q (5.273)

and the result follows.

3. Fix any two numbers 0 ě R1 ă R. Choose QU such that

GpR, πX, QX|Uq “ F pQU, R, πX, QX|Uq. (5.274)

Then

0 ď GpR1, πX, QX|Uq ´GpR, πX, QX|Uq (5.275)

ď F pQU, R1, πX, QX|Uq ´ F pQU, R, πX, QX|Uq (5.276)

“ rIpQU, QX|Uq ´R1s`´ rIpQU, QX|Uq ´Rs

` (5.277)

ď rR ´R1s` (5.278)

where (5.275) follows because Gp¨, πX, QX|Uq is non-increasing, and (5.278) uses

(5.267) again. Thus GpR, πX, QX|Uq is actually 1-Lipschitz continuous in R.

197

4. Fix R1, R2 ě 0, α P r0, 1s, and let QUi maximize F p¨, Ri, πX, QX|Uq for i “ 1, 2.

Define

Rα :“ p1´ αqR0 ` αR1; (5.279)

QUα :“ p1´ αqQU0 ` αQU1 . (5.280)

In both IpQUα , QX|Uq ą Rα and IpQUα , QX|Uq ď Rα cases one can explicitly

calculate that

F pQUα , Rα, πX, QX|Uq ď p1´ αqF pQU0 , R0, πX, QX|Uq

` αF pQU1 , R1, πX, QX|Uq (5.281)

and the convexity follows.

5.2.4 Application to Lossy Source Coding

The simplest application of the new resolvability result (in particular, the softer-

covering lemma) is to derive a one-shot achievability bound for lossy source coding,

which is most fitting in the regime of low rate and exponentially decreasing success

probability. The method is applicable to general sources. In the special case of

i.i.d. sources, it recovers the “success exponent” in lossy source coding originally

derived by the method of types [25] for discrete memoryless sources. The achievability

bound in [160] can be viewed as the γ “ 1 special case, which is capable of recovering

the rate-distortion function, but cannot recover the exact rate-distortion-exponent

tradeoff.

Theorem 5.2.17 Consider a source with distribution πX and a distortion function

dp¨, ¨q on U ˆ X . For any joint distribution QUQX|U , γ ě 1, d ą 0 and integer M ,

198

there exists a random transformation πU |X (stochastic encoder) whose output takes at

most M values, and

PrdpU , Xq ď ds ě1

γ

`

PrdpU,Xq ď ds ´ ErEγpQXrUM sπXqs˘

(5.282)

where pU , Xq „ πU |XπX , pU,Xq „ QUX , and UM „ QbMU .

Proof Given a codebook pc1, . . . , cMq P U , let PU be the equiprobable distribution

on pc1, . . . , cMq and set

PUX :“ QX|UPU . (5.283)

The likelihood encoder is then defined as a random transformation

πU |X :“ PU |X (5.284)

so that the joint distribution of the codeword selected and the source realization X

is

πUX “ πXPU |X (5.285)

From Proposition 5.2.2-2) and Proposition 5.2.2-4) we obtain

γπUXpdp¨, ¨q ď dq ě PUXpdp¨, ¨q ď dq ´ EγpPXU ||πXUq (5.286)

“ PUXpdp¨, ¨q ď dq ´ EγpPX ||πXq (5.287)

where pU , Xq „ PUX . Note that PUX and πU |X depend on the codebook cM . Now

consider a random codebook cM Ð UM . Taking the expectation on both sides of

199

(5.287) with respect to UM , we have

γErπUXpdp¨, ¨q ď dqs

ě ErPUXpdp¨, ¨q ď dqs ´ ErEγpQXrUM sπXqs (5.288)

“ PrdpU,Xq ď ds ´ ErEγpQXrUM sπXqs (5.289)

where in (5.289) we used the fact that ErPUXs “ QUX . Finally we can choose one

codebook (corresponding to one πU |X) such that πUXpdp¨, ¨q ď dq is at least its expec-

tation.

Remark 5.2.10 In the i.i.d. setting, let RpπX, dq be the rate-distortion function when

the source has per-letter distribution πX. The distortion function for the block is

derived from the per-letter distortion by

dnpun, xnq :“1

n

nÿ

i“1

dpui, xiq. (5.290)

Let pXn, Unq be the source-reconstruction pair distributed according to πXnUn. If 0 ď

R ă RpπX, dq, the maximal probability that the distortion does not exceed d converges

to zero with the exponent

limnÑ8

1

nlog

1

PrdnpUn, Xnq ď ds“ GpR, dq (5.291)

where

GpR, dq :“ minQrDpQ||P q ` rRpQ, dq ´Rs`s. (5.292)

A weaker achievability result than (5.292) was proved in [182, p168], whereas the final

form (5.292) is given in [25, p158, Ex6] based on method of types. Here we can easily

prove the achievability part of (5.292) using Theorem 5.2.17 and Corollary 5.2.8 by

200

setting QX to be the minimizer of (5.292) and QU|X to be such that

ErdpU,Xqs ď d, (5.293)

IpQU, QX|Uq ď R. (5.294)

Then γn “ exppnEq with

E ą DpQX||πXq ` rIpQU, QX|Uq ´Rs`, (5.295)

ensures that

PrdnpUn, Xnq ď ds ě1

2expp´nEq (5.296)

for n large enough, by the law of large numbers.

Remark 5.2.11 Since the Eγ metric reduces to total variation distance when γ “ 1,

Theorem 5.2.17 generalizes the likelihood source encoder based on the standard soft-

covering/resolvability lemma [160]. In [160], the error exponent for the likelihood

source encoder at rates above the rate-distortion function is analyzed using the expo-

nential decay of total variation distance in the approximation of output statistics, and

the exponent does not match the optimal exponent found in [25]. It is also possible

to upper-bound the success exponent of the total variation distance-based likelihood

encoder at rates below the rate-distortion function by analyzing the exponential con-

vergence to 2 of total variation distance in the approximation of output statistics;

however that does not yield the optimal exponent (5.292) either. This application

illustrates one of the nice features of the Eγ-resolvability method: it converts a large

deviation analysis into a law of large numbers analysis, that is, we only care about

whether Eγ converges to 0, but not the speed, even when dealing with error exponent

problems.

201

5.2.5 Application to Mutual Covering Lemma

Another application of the softer-covering lemma is a one-shot generalization of the

mutual covering lemma in network information theory [156]. The asymptotic mu-

tual covering lemma says, fixing a (per-letter) joint distribution PUV, if enough Un-

sequences and Vn-sequences are independently generated according to PUn and PVn

respectively, then with high probability we will be able to find one pair jointly typical

with respect to PUV. In the one-shot version, the “typical set” is replaced with an ar-

bitrarily high probability (under the given joint distribution) set. The original proof

of the asymptotic mutual covering lemma [156][18] used the second-order moment

method.

The one-shot mutual covering lemma can be used to prove a one-shot version of

Marton’s inner bound for the broadcast channel with a common message14 without

time-sharing, improving the proof in [161] based on the basic covering lemma where

time-sharing is necessary. More discussions about the background and the derivation

of the one-shot Marton’s inner bound can be found in our conference paper [186].

For general discussions on single-shot covering lemmas, see [161][187]. To avoid using

time-sharing in the single-shot setting, [188] pursued a different approach to derive a

single-shot Marton’s inner bound. Moreover, a version of one-shot mutual covering

lemma can be distilled from their approach [189]. We compare their approach and

ours at the end of the section.

In our conference paper [186] we applied the one-shot mutual covering lemma to

derive a one-shot version of Marton’s inner bound for the broadcast channel with a

common message, without using time-sharing/common randomness. In our confer-

ence paper [190] we use a concentration inequality of Talagrand to prove that the

14More precisely, we are referring to the three auxiliary random variables version due to Liangand Kramer [183, Theorem 5] (see also [18, Theorem 8.4]), which is equivalent to an inner boundobtained by Gelfand and Pinsker [184] upon optimization (see [185] or [18, Remark 8.6]).

202

covering error probability is doubly exponentially vanishing in the blocklength. Since

those results are not directly related to Eγ, we exclude them from this thesis.

Now, we proceed to provide a simple derivation of a mutual covering using the

softer-covering lemma.

Lemma 5.2.18 Fix PUV and let

PUMV L :“ PU ˆ ¨ ¨ ¨ ˆ PUlooooooomooooooon

M

ˆPV ˆ ¨ ¨ ¨ ˆ PVlooooooomooooooon

L

. (5.297)

Then

P

«

M,Lč

m“1,l“1

tpUm, Vlq R Fuff

ď PrpU, V q R Fs ` PrıU ;V pU ;V q ě logML´ τ s

`exppτq

maxtM,Lu` e´

12

exppτq. (5.298)

for all τ ą 0 and event F .

Proof Assume without loss of generality that L ěM . For any u P U , define

Fu :“ tv : pu, vq P Fu, (5.299)

and for any uM P UM , define

AuM :“Mď

m“1

Fum . (5.300)

203

Now fix a U -codebook cM and observe that

γPV pAcM q ě PV rcM spAcM q ´ EγpPV rcM sPV q (5.301)

ě1

M

Mÿ

m“1

PV |U“cmpFcmq ´ EγpPV rcM sPV q (5.302)

where we recall that

PV rcM s :“1

M

Mÿ

m“1

PV |U“cm . (5.303)

(5.301) is from the definition of Eγ and (5.302) is because Fcm ĎŤMm“1Fcm “ AcM .

Denote by Γpc1, . . . , cMq the right side of (5.302), which is trivially upper-bounded

by 1. Next, we show that

P

«

M,Lč

m“1,l“1

tpUm, Vlq R Fuˇ

ˇ

ˇ

ˇ

ˇ

UM“ cM

ff

ď 1´ ΓpcMq ` e´Lγ (5.304)

which is trivial when ΓpcMq ă 0. In the case of ΓpcMq P r0, 1s,

P

«

M,Lč

m“1,l“1

tpUm, Vlq R Fuˇ

ˇ

ˇ

ˇ

ˇ

UM“ cM

ff

“ r1´ PV pAcM qsL (5.305)

ď

«

1´ΓpcMqL

γ

L

ffL

(5.306)

ď 1´ ΓpcMq ` e´Lγ (5.307)

where (5.305) is from the definition of AcM , and (5.306) is from (5.302). The last step

(5.307) uses the basic inequality

´

1´pα

M

¯M

ď 1´ p` e´α (5.308)

204

for M,α ą 0 and 0 ď p ď 1, which has been useful in the proofs of the basic covering

lemma (see [3][161][43]). Integrating both sides of (5.304) over cM with respect to

PU ˆ ¨ ¨ ¨ ˆ PU ,

P

«

M,Lč

m“1,l“1

tpUm, Vlq R Fuff

ď PUV pF cq ` ErEγpPV rUM sPV qs ` e´Lγ , (5.309)

where we have used the fact that ErPV |UpFUm |Umqs “ PUV pFq for each m. Applying

the “softer-covering lemma” as in Remark 5.2.5, the middle term on the right hand

side of (5.309) is upper-bounded by

P„

ıV ;UpV ;Uq ě logMγ

2

`2

γ(5.310)

and the result follows by γ Ð 2L expp´τq.

Remark 5.2.12 From the above derivation we see that for the proof of the basic

covering lemma (M “ 1 case) we will need the “softest-covering lemma” (the case

of one codeword) rather than the soft-covering lemma (case of γ “ 1 and L ą 1

codewords). However, it is still possible to prove the basic covering lemma using the

soft-covering lemma using a different argument; see the discussion in [189], which is

essentially based on the idea in [159].

Lemma 5.2.19 below is a strengthened version of the one-shot-mutual covering

lemma, which improves Lemma 5.2.18 in terms of the error exponent. The proof of

Lemma 5.2.19 essentially combines the proof the achievability part of resolvability

and the proof Lemma 5.2.18, and the improvement results from not treating the two

steps separately. The proof is not as conceptually simple as Lemma 5.2.18 since the

complexities are no longer buried under the softer-covering lemma.

205

Lemma 5.2.19 Under the same assumptions as Lemma 5.2.18,

P

«

M,Lč

m“1,l“1

tpUm, Vlq R Fuff

ď P rpU, V q R F or exppıU ;V pU ;V qq ąML expp´γq ´ δs

`mintM,Lu ´ 1

δ` e´ exppγq. (5.311)

for all δ, γ ą 0 and event F .

Proof See Lemma 1 and Remark 4 in the conference version [191].

Remark 5.2.13 An advantage of Lemma 5.2.19 over Lemma 5.2.18 is that the

upper-bound in the former contains a probability of a union of two events, rather

than the sum of the probability of the two events. This yields a strict improvement

in the second order rate analysis. Moreover, by setting δ Ó 0 and M “ 1 we exactly

recover the basic one-shot covering lemma in [161].

In terms of the second order rates, the one-shot Marton’s inner bound for broadcast

obtained from our one-shot mutual covering lemma ([186, Theorem 10]) is equiva-

lent to the achievability bound claimed in [37, Theorem 4] based on the stochastic

likelihood encoder. However, although it is not demonstrated explicitly in [186, The-

orem 10], we can improve the analysis of [186, Theorem 10] by using various nuisance

parameters rather than a single γ, to obtain a one-shot Marton’s bound which gives

strictly better error exponents than [37, Theorem 4]. The reason for such improve-

ment is that the third term in (5.311) is doubly exponential and the second term

converges to zero with a large exponent. On the other hand, the approach of [37,

Theorem 4] has the advantage of being easily extendable to the case of more than

two users (which would correspond to a multivariate mutual covering lemma).

206

5.2.6 Application to Wiretap Channels

Our final application of the Eγ-resolvability (in particular, the softer-covering lemma)

is in the wiretap channel, whose setup is as depicted in Figure 5.4. The receiver and

the eavesdropper observe y P Y and z P Z, respectively. Given a codebook cML,

the input to PY Z|X is cwl where w P t1, . . . ,Mu is the message to be sent, and l

is equiprobably chosen from t1, . . . , Lu to randomize the eavesdropper’s observation.

We call such a cML an pM,Lq-code. Moreover, the eavesdropper’s observation has

the distribution πZ when no message is sent. In this setup, we don’t need to assume

a prior distribution on the message/non-message. We wish to design the codebook

such that the receiver can decode the message (reliability) whereas the eavesdropper

cannot detect whether a message is sent nor guess which message is sent (security).

For general wiretap channels the performance may be enhanced by appending a con-

ditioning channel QX|U at the input of the original channel [171]. In that case the

same analysis can be carried out for the new wiretap channel QY Z|U . Thus the model

in Figure 5.4 entails no loss of generality.

PY Z|XXTransmitter Y Receiver

Z

Eavesdropper

W W

Figure 5.4: The wiretap channel

In Wyner’s setup (see for example [18]), secrecy is measured in terms of the

conditional entropy of the message given the eavesdropper observation. In contrast,

we measure secrecy in terms of the size of the list that the eavesdropper has to declare

for the message to be included with high probability. Practically, the message W is

the compressed version of the plaintext. Assuming that the attacker knows which

compression algorithm is used, the plaintext can be recovered by running each of the

207

items in the eavesdropper output list through the decompressor and selecting the one

that is intelligible.

While there have been previous proofs for wiretap channels using the conventional

resolvability (soft-covering lemma) [171][172], the conventional resolvability based on

the total variation metric is only suitable when the communication rate is low enough

to achieve perfect secrecy. In contrast, Eγ-resolvability yields lower bounds on the

minimum size of the eavesdropper list for an arbitrary rate of communication. This

interpretation of security in terms of list size is reminiscent of equivocation [192], and

indeed we recover the same formula in the asymptotic setting, even though there is

no direct correspondence between both criteria. Moreover, we also consider a more

general case where the eavesdropper wishes to reliably detect whether a message is

sent, while being able to produce a list including the actual message if it decides it is

present. This is a practical setup because “no message” may be valuable information

which the eavesdropper wants to ascertain reliably. The idea is reminiscent of the

stealth communication problem (see [193][194] and the references therein) which also

involves a hypothesis test on whether a message is sent. However, the setup and the

analysis (including the covering lemma) are quite different from [193] and [194]. In

comparison, our results are more suitable for the regime with higher communication

rates and lower secrecy demands. In the discrete memoryless case, we obtain single-

letter expressions of the tradeoff between the transmission rate, eavesdropper list,

and the exponent of the false-alarm probability for the eavesdropper (i.e. declaring

the presence of a message when there is none).

Summary of The Asymptotic Results

We need the following definitions to quantify the eavesdropper ability to de-

tect/decode messages.

208

Definition 5.2.9 For a fixed codebook and channel PZ|X , we say the eavesdropper

can perform pA, T, εq-decoding if upon observing Z, it outputs a list which is either

empty or of size T , such that

• Prlist ‰ H|no messages ď A´1.

• There exists εm P r0, 1s, m “ 1, . . . ,M satisfying ε “ 1M

řMm“1 εm such that

Prm R list|W “ ms ď εm. (5.312)

Although the decoder in Definition 5.2.9 is reminiscent of erasure and list decoding

[21], for the former it is possible that actually no message is sent, and we treat the

undetected and detected errors together.

The logarithm of T can be intuitively understood as the equivocation HpW |Zq

[192]. However, log T can be much smaller than HpW |Zq: a distribution can have

99% of its mass supported on a very small set, and yet have an arbitrarily large

entropy.

The quantity A ą 0 characterizes how well the eavesdropper can detect that no

message is sent, which is related the notion of stealth communication [193] [194]. The

“non-stealth” is measured by DpPZπZq in [193], and is measured by |PZ´πZ | in [194].

Although both the relative entropy and total variation are related to error probability

hypothesis testing, their results cannot be directly compared with ours, since they

are interested in the regime where non-stealth vanishes while the transmission rate

is below the secrecy capacity. In contrast, we are mainly interested in the regime

where A grows exponentially (so that the “non-stealth” in their definition grows) in

the blocklength, but the transmission rate is above the secrecy capacity.

The asymptotic version of the eavesdropper achievability is as follows.

Definition 5.2.10 Fix a sequence of codebooks and a eavesdropper channel

pPZn|Xnq8n“1. The rate pair pα, τq is ε-achievable by the eavesdropper if there

209

exist sequences pAnq and pTnq with

lim infnÑ8

1

nlogAn ě α (5.313)

lim supnÑ8

1

nlog Tn ď τ (5.314)

such that for sufficiently large n, the eavesdropper can achieve pAn, Tn, εq-decoding.

By the diagonalization argument [44, P56], the set of ε-achievable pα, τq is closed.

An pM,L,QXq-random code is defined as the ensemble of the codebook cML where

each codeword cwl is i.i.d. chosen according to QX , w P t1, . . . ,Mu, m P t1, . . . ,Mu.

We shall focus on random codes, for which reliability is guaranteed by channel coding

theorems, so we only need to consider the security condition.

First, we extend the notions of achievability to the case of a random ensemble of

codes by taking the average: we say for a random ensemble of codes the eavesdropper

can perform pA, T, εq-decoding, if there exists εpcMLq such that for each cML the

eavesdropper can perform pA, T, εpcMLqq-decoding (in the sense of Definition 5.2.9),

and the average of εpcMLq with respect to the codebook distribution is upper-bounded

by ε. Similarly, Definition 5.2.10 can be extended to random codes. Then, the

following is our main result which characterizes the set of eavesdropper achievable

pairs for stationary memoryless channels.

Theorem 5.2.20 Fix any QX, R, RL and 0 ă ε ă 1. Consider pexppnRq, exppnRLq, QbnX q-

random codes and stationary memoryless channel with per-letter conditional distri-

bution QZ|X. Then the pair pα, τq is ε-achievable by the eavesdropper if and only

if

$

’

&

’

%

α ď DpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`;

τ ě R ´ rIpQX, PZ|Xq ´RLs`.

(5.315)

where QX Ñ QZ|X Ñ QZ.

210

From the noisy channel coding theorem, the supremum randomization rate RL such

that the sender can reliably transmit messages at the rate R is IpQX, PY|Xq ´ R.

The larger RL the less reliably the eavesdropper can decode, so the optimal encoder

chooses RL as close to this supremum as possible. Thus Theorem 5.2.20 implies the

the following result

Theorem 5.2.21 Given a stationary memoryless wiretap channel with per-letter

conditional distribution PYZ|X, there exists a sequence of codebooks such that messages

at the rate R can be reliably transmitted to the intended receiver and that pα, τq is

not ε-achievable, for any ε P p0, 1q, by the eavesdropper if there exists some QX such

that

R ă IpQX, PY|Xq (5.316)

and either

α ą DpQZπZq ` rIpQX, PZ|Xq ´ IpQX, PY|Xqs` (5.317)

or

τ ă R ´ rIpQX, PZ|Xq ´ IpQX, PY|Xq `Rs`. (5.318)

Remark 5.2.14 In general the sender-receiver want to minimize α and maximize τ

obeying the tradeoff (5.317), (5.318) by selecting QX. In the special case where α has

no importance and R is larger than the secrecy capacity C :“ supQXtIpQX, PY|Xq ´

IpQX, PZ|Xqu, we see from (5.318) that the supremum τ is C. The formula for the

supremum of τ is the same as the equivocation measure defined as 1nHpW |Znq [192],

but technically our result does not follow directly from the lower bound on equivocation,

since it may be possible that the a posterior distribution of W is concentrated on

211

a small list but has a tail spread over an exponentially large set, resulting a large

equivocation.

Next, we need to prove the “only if” and “if” parts of Theorem 5.2.20.

Converse for the Eavesdropper: One-Shot Bounds

The “only if” part (the eavesdropper converse) of Theorem 5.2.20 follows by applying

the following non-asymptotic bounds to different regions and invoking Corollary 5.2.8.

Theorem 5.2.22 In the wiretap channel, fix an arbitrary distribution µZ. Suppose

the eavesdropper can either detect that no message is sent upon observing z P D0 with

µZpD0q ě 1´ A´1 (5.319)

for some A P r1,8q, or outputs a list of T pzq messages upon observing z R D0 that

contains the actual message m P t1, . . . ,Mu with probability at least 1´ εm, for some

εm P r0, 1s. Define the average quantities

T :“1

µZpDc0q

ż

Dc0T pzqdµZpzq, (5.320)

ε :“1

M

Mÿ

m“1

εm. (5.321)

Then for any γ P r1,`8q,

1

Aě

1

γp1´ ε´ EγpPZπZqq , (5.322)

where we recall that πZ is the non-message distribution, PZ :“ 1M

řMm“1 PZ|W“m,

and PZ|W“m is the distribution of the eavesdropper observation for the message m

212

(assuming an arbitrary codebook is used). Moreover,

T

MAě

1

γ

˜

1´ ε´1

M

Mÿ

m“1

EγpPZ|W“mµZq

¸

. (5.323)

We will choose µZ “ PZ when we use Theorem 5.2.22 to prove Theorem 5.2.20,

although (5.323) holds for any µZ .

From the eavesdropper viewpoint, a larger A and a smaller T is more desirable

since it will then be able to find out that no message is sent with smaller error proba-

bility or narrow down to a smaller list when a message is sent. This observation agrees

with (5.322) and (5.323): a smaller γ implies a higher degree of approximation, and

hence higher indistinguishability of output distributions which is to the eavesdropper

disadvantage.

Proof To see (5.322),

1

Aě πZpDc0q (5.324)

ě1

γpPZpDc0q ´ EγpPZπZqq (5.325)

“1

γ

˜

1

M

Mÿ

m“1

PZ|W“mpDc0q ´ EγpPZπZq¸

(5.326)

ě1

γp1´ ε´ EγpPZπZqq . (5.327)

213

To see (5.323), let Dm be the set of outputs z P Z for which the eavesdropper list

contains m P t1, . . . ,Mu. Then

T

MAě

T

MµZpDc0q (5.328)

“1

M

ż

Dc0T pzqdµZpzq (5.329)

“1

M

ż Mÿ

m“1

1tz P DmudµZpzq (5.330)

“1

M

Mÿ

m“1

ż

1tz P DmudµZpzq (5.331)

“1

M

Mÿ

m“1

µZpDmq (5.332)

ě1

Mγ

Mÿ

m“1

`

PZ|W“mpDmq ´ EγpPZ|W“mµZq˘

(5.333)

ě1

γ

˜

1´ ε´1

M

Mÿ

m“1

EγpPZ|W“mµZq

¸

. (5.334)

Next, we particularize Theorem 5.2.22 to the asymptotic setting.

Proof [Proof of “only if” in Theorem 5.2.20]

• Fix an arbitrary

α ą DpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`. (5.335)

We will show that pα, τq is not ε-achievable by the eavesdropper for any τ ą 0

and ε P p0, 1q. Pick σ ą 0 such that

α ą DpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`` 2σ (5.336)

214

and define

$

’

&

’

%

An “ exppnpα ´ σqq

Tn “ exppnpτ ` σqq.(5.337)

Assuming the eavesdropper can perform pAn, Tn, εpcMLqq-decoding for a partic-

ular realization of the codebook cML, then applying Theorem 5.2.22 with

γn “ exppnpDpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`` σqq, (5.338)

we obtain

exppnpDpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`´ α ` 2σqq

“γnAn

(5.339)

ě 1´ εpcMLq ´ EγnpPZnrcMLsπ

bnZ q. (5.340)

From (5.336), the above implies

EγnpPZnrcMLsπbnZ q ě

1´ εpcMLq

2. (5.341)

For sufficiently large n. By Corollary 5.2.8 and (5.338), the average of the left

side converges to zero as n Ñ 8, thus the average of the right side cannot be

lower bounded by 1´ε2

.

• Fix an arbitrary

τ ă R ´ rIpQX, PZ|Xq ´RLs`. (5.342)

215

We will show that pα, τq is not ε-achievable by the eavesdropper for any α ą 0

and ε P p0, 1q. Pick σ ą 0 such that

τ ` 2σ ă R ´ rIpQX, PZ|Xq ´RLs` (5.343)

and again define An and Tn as in (5.337). Assuming the eavesdropper can

perform pAn, Tn, εpcMLqq-decoding for a particular realization of the codebook

cML, then applying Theorem 5.2.22 with

µZ “ QbnZ , (5.344)

γn “ exppnprIpQX, PZ|Xq ´RLs`` σqq, (5.345)

and noting that An ě 1, we obtain

exppnpτ ´R ` rIpQX, PZ|Xq ´RLs`` 2σqq

“TnγnMn

(5.346)

ě 1´ εpcMLq ´

1

M

Mÿ

m“1

EγnpPZn|W“mQbnZ q (5.347)

“ 1´ εpcMLq ´

1

M

Mÿ

m“1

EγnpPZnrcmLsQbnZ q (5.348)

where cmL :“ pcmlq

Ll“1. From (5.343), the above implies

1

M

Mÿ

m“1

EγnpPZnrcmLsQbnZ q ě

1´ εpcMLq

2, (5.349)

for sufficiently large n. Invoking Corollary 5.2.8, we see the average of the right

side with respect to the codebook converges zero as n Ñ 8, and in particular

cannot be lower-bounded by 1´ε2

.

216

Achievability of the Eavesdropper: Ensemble Tightness

The (eavesdropper) achievability part of Theorem 5.2.20 follows by analyzing the

eavesdropper list decoding ability for different cases of the rates pR,RLq. First, con-

sider the following one-shot achievability bounds for channel coding with possibly no

message sent:

Theorem 5.2.23 Consider a random transformation PZ|X and a pM,L,QXq-random

code. Let QX Ñ PY |X Ñ QY , and let πZ be the distribution of the eavesdropper

observation when no message is sent. Define

ıZ;Xpz;xq :“ logdPZ|X“x

dπZpzq; (5.350)

ıZ;Xpz;xq :“ logdPZ|X“x

dQZ

pzq. (5.351)

Let δ, β, A, T ą 0. Then, there exist three list decoders such that for Decoder 1,

ECPrerror|no messages ď1

Aexpp´δq, (5.352)

ECPrerror|message is ms ď PrıZ;XpZ;Xq ď logpLMAq ` δs

` PrıZ;XpZ;Xq ď logLM

T` δs

`1

1` β` e´pβ`1q

` β expp´δq. (5.353)

Here an error in the case of no message means that a non-empty list is produced. An

error in the case of message m means either the list does not contain m, or the list

217

size exceeds T .15 For Decoder 2,


A; (5.354)

ECPrerror|message is ms ď P„

ıZ;XpZ;Xq ď logLM

T` δ

` PrıQZπZ pZq ď logAs

`1

1` β` e´pβ`1q

` β expp´δq, (5.355)

where the error events are defined similarly to Decoder 1. Decoder 3 either output an

empty list or a list of all messages, and


A; (5.356)

ECPrerror|message is ms ď PrıQZπZ pZq ď logAs. (5.357)

Proof We defer the proof to the end of this Section.

Under various conditions, one out of the three decoders are asymptotically op-

timal. By choosing appropriate parameters δ, β, A, T ą 0, it is clear that Theo-

rem 5.2.23 implies the following:

Corollary 5.2.24 Fix any QX, R, RL and 0 ă ε ă 1. Consider pexppnRq, exppnRLq, QbnX q-

random codes and stationary memoryless channel with per-letter conditional distri-

bution QZ|X.

15Such a decoder is a variable list-size decoder. However, we can add a post processor whichdeclares no message if the list is empty, or outputs a list of fixed size T otherwise (by arbitrarilydeleting or adding messages to the list), resulting a new decoder as considered in Definition 5.2.9,and the two types of error probability for the new decoder (i.e. the best values of 1

A and εm inDefinition 5.2.9) do not exceed the two types of error probability for the original variable list-sizedecoder.

218

• When R`RL ă IpQX, PZ|Xq, the rate pair pα, τq is ε-achievable by a Decoder 116

if

$

’

&

’

%

DpQZπZq ` IpQX, PZ|Xq ą RL `R ` α;

IpQX, PZ|Xq ą R `RL ´ τ.(5.358)

• When R ` RL ě IpQX, PZ|Xq but RL ă IpQX, PZ|Xq, the rate pair pα, τq is ε-

achievable by a Decoder 2 if

$

’

&

’

%

IpQX, PZ|Xq ą RL `R ´ τ ;

α ă DpQZπZq.(5.359)

• When RL ě IpQX, PZ|Xq, the rate pair pα, τq is ε-achievable by a Decoder 3 if

$

’

&

’

%

τ ě R;

α ă DpQZπZq.(5.360)

Proof [Proof Sketch] Consider the first case. To see the achievability of pα, τq

satisfying (5.358), choose

δn :“ n0.9, (5.361)

βn :“ n, (5.362)

An :“ exppnαq, (5.363)

Tn :“ exppnτq. (5.364)

16By which we mean it is possible to choose the δ, β,A, T parameters for the decoder to achievethe desired performance.

219

Then the right sides of (5.352) and (5.353) converges to zero as nÑ 8. The analyses

of the other two cases are similar using the same choice of the parameters as above.

The eavesdropper’s achievability (“if ” part) of Theorem 5.2.20 then follows from

Corollary 5.2.24 and an application of the standard diagonalization argument to show

that the achievable region is closed (see [44]).

Proof of Theorem 5.2.23

• Codebook generation: pcijq1ďiďM,1ďjďL according to QbMLX .

• Decoders: Fix an arbitrary constant δ ą 0. Upon observing z, Decoder 1

outputs as a list all 1 ď i ďM such that there exists 1 ď j ď L satisfying

$

’

&

’

%

ıZ;Xpz; cijq ą logpLMAq ` δ

ıZ;Xpz; cijq ą log LMT` δ

(5.365)

if there is at least one such an i, or declares that no message is sent (i.e. outputs

an empty list) if otherwise. Decoder 2 outputs as a list all 1 ď i ďM such that

there exists 1 ď j ď L satisfying

ıZ;Xpz; cijq ą logLM

T` δ (5.366)

if there exists at least one such i and in addition,

ıQZπZ pzq ą logA, (5.367)

or declares that no message is sent if otherwise. Decoder 3 outputs t1, . . . ,Mu

as the list if (5.367) holds (so that the list size equals M), or otherwise declares

no message.

220

• Error analysis: we denote by L the list of messages recovered by the eavesdrop-

per.

Decoder 1:

PrL ‰ H|no messages

ď P„

max1ďmďM,1ďlďL

ıZ;XpZ;Xmlq ą logpLMAq ` δ

(5.368)

ď1

Aexpp´δq (5.369)

where the probability is averaged over the codebook, pXML, Zq „ QbMLX ˆ πZ ,

and (5.369) used the packing lemma [161]. Moreover

Pr1 R L or L “ H|W “ 1s (5.370)

ď PrıZ;XpZ;Xq ď logpLMAq ` δs

` P„

ıZ;XpZ;Xq ď logLM

T` δ

(5.371)

221

where W denotes the message sent, and pX,Zq „ QXZ . Further,

Pr|L| ě T ` 1|W “ 1s

ď P„

|L| ě T ` 1, LX"

2, . . . ,βM

T` 1

*

“ H

ˇ

ˇ

ˇ

ˇ

W “ 1

` P„

LX"

2, . . . ,βM

T` 1

*

‰ H

ˇ

ˇ

ˇ

ˇ

W “ 1

(5.372)

ď

ˆ

1´βMT

M

˙T

` P„

LX"

2, . . . ,βM

T` 1

*

‰ H

ˇ

ˇ

ˇ

ˇ

W “ 1

(5.373)

ď1

1` β` e´pβ`1q

` P

«

max2ďmďβM

T`1,1ďlďL

ıZ;XpZ;Xmlq ą logLM

T` δ

ff

(5.374)

ď1

1` β` e´pβ`1q

` β expp´δq (5.375)

where

– To see (5.373), note that by the symmetry among the messages 2, . . . ,M ,

for any t ě T ,

P„

LX"

2, . . . ,βM

T` 1

*

“ H

ˇ

ˇ

ˇ

ˇ

W “ 1, |Lzt1u| “ t

(5.376)

“

ˆ

1´βMT

M ´ 1

˙t

(5.377)

ď

ˆ

1´βMT

M

˙T

. (5.378)

– In (5.374) pXML, Zq „ QbMLX ˆQZ , and we used the inequality (5.308).

– (5.375) used the packing lemma [161].

222

In summary,

Prerror|no messages ď1

Aexpp´δq, (5.379)

and for each m “ 1, . . . ,M , by the union bound and by the symmetry in

codebook generation we have

Prerror|W “ ms ď PrıZ;XpZ;Xq ď logpLMAq ` δs

` P„

ıZ;XpZ;Xq ď logLM

T` δ

`1

1` β` e´pβ`1q

` β expp´δq. (5.380)

Decoder 2:

PrL ‰ H|no messages ď PrıQZπZ pZq ą logAs (5.381)

ď1

APrıQZπZ pZq ą logAs (5.382)

ď1

A(5.383)

where Z „ πZ and Z „ QZ , and (5.382) used the change of measure. On the

other hand,

PrL ‰ H, 1 R L|W “ 1s

ď P„

ıZ;XpZ;Xq ď logLM

T` δ

(5.384)

and

PrL “ H|W “ 1s ď PrıQZπZ pZq ď logAs, (5.385)

223

Moreover, as in (5.375), we have

Pr|L| ě T ` 1|W “ 1s

ď1

1` β` e´pβ`1q

` β expp´δq (5.386)

By union bound,

Prerror|no messages ď1

A; (5.387)

and for each m “ 1, . . . ,M ,

Prerror|W “ ms ď P„

ıZ;XpZ;Xq ď logLM

T` δ

` PrıQZπZ pZq ď logAs

`1

1` β` e´pβ`1q

` β expp´δq. (5.388)

• Decoder 3:

The analysis is similar to that of Decoder 2 and the result follows from (5.382)

and (5.385).

224

Chapter 6

The Counterpoint

While the previous chapters have demonstrated the usages of functional formulations

of information measures and functional-analytic methods in the achievability and

converse proofs of operational problems in information theory, the present chapter is

devoted to the antithesis, that is, how entropic formulations and information-theoretic

methods can be used to prove functional inequalities. Most notably, the Gaussian

rotational invariance implies the Gaussian optimality in the Brascamp-Lieb inequal-

ity and its variations, and the proofs appear to be conveniently carried out in the

information-theoretic formulations than in their traditional functional versions [89].

Data processing property, tensorization, and convexity are studied in Section 6.1.

Section 6.2 proves the Gaussian optimality in a generalization of the Brascamp-Lieb

inequality where the deterministic linear maps are generalized by Gaussian random

transformations1. In most cases, we are able to prove the Gaussian extremality and

uniqueness of the minimizer under a certain non-degenerate assumption, while es-

tablishing the Gaussian exhaustibility in full generality2. In Section 6.3 we further

establish the Gaussian optimality in the forward-reverse Brascamp-Lieb inequality.

Section 6.4 discusses several implications of the Gaussian optimality results: some

quantities/rate regions arising in information theory can be efficiently computed by

1That is, a random transformation x ÞÑ Ax `w where A is deterministic and w is a Gaussianvector independent of x.

2See the beginning of Section 6.2 for precise definitions of extremisability and exhaustibility.

225

solving a finite dimensional optimization problem in the Gaussian cases. Examples

include multi-variate hypercontractivity, Wyner’s common information for multiple

variables, and certain secret key or common randomness generation problems. The

relationship between the Gaussian optimality in the forward-reverse Brascamp-Lieb

inequality and the transportation-cost inequalities for Gaussian measures is also dis-

cussed.

6.1 Data Processing, Tensorization and Convexity

Given QX and pQYj |Xq, denote by GBLpQX , pQYj |Xqq the set of pd, pcjqq in Theo-

rem 2.2.3 (forward Brascamp-Lieb inequality) such that (2.14) holds. In this section

we show that some elementary properties of GBLpQX , pQYj |Xqq follows conveniently

from the information-theoretic characterization (2.15).

6.1.1 Data Processing

Loosely speaking, the set GBLpQX , pQYj |Xqq characterizes the level of “uncorrelated-

ness” between X and pY1, . . . , Ymq. The following data processing property captures

this intuition:

Proposition 6.1.1 1. Given QW , QX|W and pQYj |Xqmj“1, assume that QWXYj “

QWQX|WQYj |X for each j. If p0, pcjqq P GBLpQX , pQYj |Xqq, then p0, pcjqq P

GBLpQW , pQYj |W qq.

2. Given QX , pQYj |Xqmj“1 and pQZj |Yjq

mj“1, assume that QXYjZj “ QXQYj |XQZj |Yj

for each j. Then GBLpQX , pQYj |Xqq Ă GBLpQX , pQZj |Xqq.

The proof is omitted since it follows immediately from the monotonicity of the relative

entropy and (2.15).

226

6.1.2 Tensorization

The term ”tensorization” refers to the phenomenon of additivity/multiplicativity in

certain functional inequalities under tensor products. In information theory this is

a central feature of many converse proofs, and is closely related to the fact that

some operational problems admit single-letter solutions. In functional analysis, this

provides a “particularly cute” [195] tool for proving many inequalities in arbitrary

dimensions. As a close example, Lieb’s proof [89] of the Brascamp Lieb inequality

relies on a special case of Proposition 6.1.2 below, where the proof uses the (functional

version of) Brascamp-Lieb inequality and the Minkowski inequality. The original

proof of Brascamp-Lieb inequality [83] is also based on a tensor power construction.

Proposition 6.1.2 Suppose pdpiq, pcjqq P GBLpQpiqX , pQ

piqYj |X

qq for i “ 1, 2. Then

`

dp1q ` dp2q, pcjq˘

P GBL

´

Qp1qX ˆQ

p2qX ,

´

Qp1qYj |X

ˆQp2qYj |X

¯¯

where dp1q ` dp2q is defined as the function

X p1q ˆ X p2q Ñ R; (6.1)

px1, x2q ÞÑ dp1qpx1q ` dp2qpx2q. (6.2)

We provide a simple information-theoretic proof using the chain rules of the relative

entropy. Note that the algebraic expansions here are similar to the ones in the proof

of Gaussian optimality in Section 6.2 or the converse proof for the key generation

problem in Section 6.4.5.

Proof For any arbitrary PXp1qXp2q , define PXp1qXp2qY

p1qj Y

p2qj

:“ PXp1qXp2qQp1qYj |X

Qp2qYj |X

.

Observe that

DpPXp1qXp2qQp1qX ˆQ

p2qX q “ DpPXp1qQ

p1qX q `DpPXp2q|Xp1qQ

p2qX |PXp1qq. (6.3)

227

DpPYp1qj Y

p2qjQ

p1qYjˆQ

p2qYjq “ DpP

Yp1qjQ

p1qYjq `DpP

Yp2qj |Y

p1qjQ

p2qYj|PYp1qjq (6.4)

ď DpPYp1qjQ

p1qYjq `DpP

Yp2qj |Xp1qY

p1qjQ

p2qYj|PXp1qY

p1qjq (6.5)

“ DpPYp1qjQ

p1qYjq `DpP

Yp2qj |Xp1q

Qp2qYj|PXp1qq (6.6)

where (6.5) uses Jensen’s inequality, and (6.6) is from the Markov chain Yp2qj ´ Xp1q´

Yp1qj , wherein pXpiq, Y

piqj q „ P

XpiqYpiqj

for i “ 1, 2, j “ 1, . . . ,m. By the assumption

and the law of total expectation,

DpPXp1qQp1qX q ` ErdpXp1q

qs ě

mÿ

j“1

cjDpPY p1qjQ

p1qYjq; (6.7)

DpPXp2q|Xp1qQp2qX |PXp1qq ` ErdpXp2q

qs ě

mÿ

j“1

cjDpPY p2qj |Xp1qQ

p2qYj|PXp1qq. (6.8)

Adding up (6.7) and (6.8) and applying (6.3) and (6.6), we obtain

DpPXp1qXp2qQp1qX ˆQ

p2qX q ` Erpdp1q ` dp2qqpXp1q, Xp2q

qs ě

mÿ

j“1

cjDpPY p1qj Yp2qjQ

p1qYjˆQ

p2qYjq

(6.9)

as desired.

A functional proof of the tensorization of reverse Brascamp-Lieb inequalities can be

given by generalizing the proof of the tensorization of the Prekopa-Leindler inequality

(see for example [196]). Alternatively, information-theoretic proofs of the tensoriza-

tion of these reverse-type inequalities can be extracted from the proof of the Gaussian

optimality in Theorem 6.3.1 ahead, and we omit the repetition here.

6.1.3 Convexity

Another property which follows conveniently from the information-theoretic charac-

terization of GBLp¨q is convexity:

228

Proposition 6.1.3 If pdi, pcijqq P GBLpQX , pQYj |Xq for i “ 0, 1, then pdθ, pcθjqq P

GBLpQX , pQYj |Xqq for θ P r0, 1s, where we have defined

dθ :“ p1´ θqd0` θd1, (6.10)

cθj :“ p1´ θqc0j ` θc

1j , @j P t1, . . . ,mu. (6.11)

Proof Follows immediately from (2.36) and taking convex combinations.

Note that by taking m “ 2, X “ pY1, Y2q, dip¨q “ 0 and QYj |X to be the projec-

tion to the coordinates, we recover the Riesz-Thorin theorem on the interpolation of

operator norms in the special case of nonnegative kernels. This information-theoretic

proof (for this special case) is much simpler than the common proof of the Riesz-

Thorin theorem in functional analysis based on the three-lines lemma, because the

cj’s only affect the right side of (2.15) as linear coefficients, rather than as tilting of

the distributions or functions.

6.2 Gaussian Optimality in the Forward Brascamp-

Lieb Inequality

In this section we prove the Gaussian extremality in several information-theoretic

inequalities related to the forward Brascamp-Lieb inequality. Specifically, we first

establish this for an inequality involving conditional differential entropies, which im-

mediately implies the variants involving conditional mutual informations or differ-

ential entropies; the latter is directly connected to the Brascamp-Lieb inequality, as

Theorem 2.2.3 showed. These extremal inequalities have implications for certain op-

erational problems in information theory, and quite interestingly, the essential steps

in the proofs of these extremal inequalities follow the same patterns as the converse

proofs for the corresponding operational problems.

229

Of course, in the functional analysis literature there has been a lot of work supply-

ing various proofs of the Brascamp-Lieb inequality (Theorem 2.2.1), based on different

properties of the Gaussian distribution/function which we summarize below:

• [83] The tensor power of a one-dimensional Gaussian distribution is a multidi-

mensional Gaussian distribution, which is stable under Schwarz symmetrization

(i.e. spherically decreasing rearrangement).

• [89][197] Rotational invariance: if f is a one-dimensional Gaussian function,

then

fpxqfpyq “ f

ˆ

x´ y?

2

˙

f

ˆ

x` y?

2

˙

. (6.12)

• [90, Lemma 2] The convolution of Gaussian functions is Gaussian.

• [76] If a real valued random variable is added to an independent Gaussian noise,

then the derivative of the differential entropy of the sum with respect to the

variance of the noise is half the Fisher information (de Bruijn’s identity), and of

course the non-Gaussianness of the sum eventually disappears as the variance

goes to infinity.

Among these, the rotation invariance principle also forms the basis of the proof strat-

egy in this thesis. Rotation invariance is a characterizing property of the Gaussian

distribution: two independent random variables are both Gaussian if their sum is in-

dependent of their difference (i.e. Cramer’s theorem [198]). This rotation invariance

argument3 has been used in establishing Gaussian extremality by Lieb [89], Carlen

[197] and recently in information theory by Geng-Nair [92] [199], Courtade-Jiao [200]

and Courtade [201]. Some related ideas have also appeared in the literature on the

Brascamp-Lieb inequality, such as the observation that convolution preserves the ex-

3This argument was referred to as “Op2q-invariance” in [98] and “doubling trick” in [197].

230

tremizers of Brascamp-Lieb inequality [90, Lemma 2] due to Ball. However, as keenly

noted in [92], applying the rotation invariance/doubling trick on the information-

theoretic formulation has certain advantages. For example, the chain rules provide

convenient tools, and the establishment of the extremality usually follows similar steps

as the converse proofs of the corresponding operational problems in information the-

ory. Since the optimization problems we consider involve many information-theoretic

terms, we introduce a simplification/strengthening of the Geng-Nair approach by per-

turbing the coefficients in the objective function (see Remark 6.2.2), thus giving rise

to some identities which become handy in the proof. A similar idea was used in [200],

and this should be applicable to a wide range of other problems.

In this section, X ,Y1, . . . ,Ym are assumed to be Euclidean spaces of dimensions

n, n1, . . . , nm. To be specific about the notions of Gaussian optimality, we adopt some

terminologies from [98]:

Definition 6.2.1 • Extremisability: a certain supremization/infimization is

finitely attained by some argument.

• Gaussian extremisability: a certain supremization/infimization is finitely at-

tained by Gaussian function/Gaussian distributions.

• Gaussian exhaustibility: the value of a certain supremization/infimization does

not change when the arguments are restricted to the subclass of Gaussian func-

tions/Gaussian distributions.

Most of the times, we will be able to prove Gaussian extremisability in a certain

non-degenerate case, while showing Gaussian exhaustibility in general.

231

6.2.1 Optimization of Conditional Differential Entropies

Fix M ľ 0, c0 P r0,8q, c1, . . . , cm P p0,8q, and Gaussian random transformations

QYj |X for j P t1, . . . ,mu. For each PXU4, define

F pPXUq :“ hpX|Uq ´mÿ

j“1

cjhpYj|Uq ´ c0 TrrMΣX|U s, (6.13)

where X „ PX (the marginal of PXU) and Yj has distribution induced by PX Ñ

QYj |X Ñ PYj. We have defined the differential entropy and the conditional differential

entropies as

hpXq :“ ´DpPXλq; (6.14)

hpX|U “ uq :“ ´DpPX|U“uλq, @u P U ; (6.15)

hpX|Uq :“

ż

hpX|U “ uqdPU , (6.16)

where λ is the Lebesgue measure (with the same dimension as X), and (6.16) is

defined whenever the integral exists. Moveover, we have used the notation ΣX|U :“

ErCovpX|Uqs for the expectation of the conditional covariance matrix.

Definition 6.2.2 We say pQY1|X, . . . , QYm|Xq is non-degenerate if each QYj |X“0 is

a nj-dimensional Gaussian distribution with invertible covariance matrix.

In the non-degenerate case, we can show an extremal result for the following

optimization with a regularization on the covariance of the input.

Theorem 6.2.1 If pQY1|X, . . . , QYm|Xq is non-degenerate, then supPXUtF pPXUq : ΣX|U ĺ

Σu is finite and is attained by a Gaussian X and constant U . Moreover, the covari-

ance of such X is unique.

4In the case of standard Borel space, the conditional distribution PX|U“¨ can be uniquely definedfrom the joint distribution PXU , PU -almost surely; see e.g. [43].

232

In Theorem 6.2.1, we assume that the supremum is over PXU such that PX|U“u is

absolutely continuous with respect to the Lebesgue measure (hence having a density

function) for almost every u5. Additionally, we adopt the following convention in all

the optimization problems in Section 6.2 and Section 6.4, unless otherwise specified.

This eliminates situations such as8`8 or a`8 which can be considered as legitimate

calculations but are technically difficult to deal with.

Convention 1 The sup or inf are taken over all arguments such that each term in

the objective function (e.g. (6.13)) is well-defined and finite.

Proof of Theorem 6.2.1 Assume that both PXp1qUp1q and PXp2qUp2q are maxi-

mizers of (6.13) subject to ΣX|U ĺ Σ; the proof of the existence of maximizer is

deferred to Appendix 6.5. Let pU p1q,Xp1q,Yp1q1 , . . . ,X

p1qm q „ PXp1qUp1qQY1|X . . . QYm|X

and pU p2q,Xp2q,Yp2q1 , . . . ,X

p2qm q „ PXp2qUp2qQY1|X . . . QYm|X be mutually independent.

Define

X`“

1?

2

`

Xp1q`Xp2q

˘

X´“

1?

2

`

Xp1q´Xp2q

˘

. (6.17)

Define Y`j and Y´

j similarly for j “ 1, . . . ,m, and put U “ pU p1q, U p2qq. We now

make three important observations:

1. First, due to the Gaussian nature of QYj |X, it is easily seen that Y`j |tX

` “

x`,X´ “ x´, U “ uu „ QYj |X“x` is independent of x´. Thus Y`j |tX

` “

x, U “ uu „ QYj |X“x as well. Similarly, Y´j |tX

´ “ x, U “ uu „ QYj |X“x for

j “ 1, . . . ,m.

5Since the integral in (6.16) may not be well-defined in general, some authors have restrictedthe attention to finite U in the optimization problems. In this chapter, we are allowed to drop thisrestriction as long as the integral in (6.16) is well-defined. These distinctions do not appear to makean essential difference for our purpose; see Footnote 13.

233

• (6.28) follows since PX`,U and PX´,UX` are candidate optimizers of (6.13) sub-

ject to the given covariance constraint whereas PXpiq,Upiq are the optimizers by

assumption (i “ 1, 2).

Then, the equalities in (6.26)-(6.28) must be achieved throughout, so both PX`,U and

PX´,UX` (and also PX`,UX´ , by symmetry of the argument) are maximizers of (6.13)

subject to ΣX|U ĺ Σ.

So far, we have considered fixed coefficients cm0 “ pc0, . . . , cmq. The same argument

applies for coefficients on a line:

cm0 ptq :“ tam0 , t ą 0 (6.29)

for any fixed a0 P r0,8q, a1, . . . , am P p0,8q, and we next show several properties for

a dense subset of this line. Applying Lemma 6.2.3 (following this proof) with

ppPXUq Ð hpX|Uq; (6.30)

qpPXUq Ð ´

mÿ

j“1

ajhpYj|Uq ´ a0 TrrMΣX|U s, (6.31)

fptq Ð maxPXU

rppPXUq ` tqpPXUqs, (6.32)

we obtain from the optimality of PX`,U PX´,UX` and PX`,UX´ that

hpX`|Uq “ hpX´

|X`, Uq “ hpX`|X´, Uq (6.33)

for almost all t P p0,8q, where PX`X´U depends implicitly on t. Note that (6.33)

implies that IpX`; X´|Uq “ 0 hence X` and X´ are independent conditioned on

U . Recall the following Skitovic-Darmois characterization of Gaussian distributions

(with the extension to the vector Gaussian case in [92]):

236

Lemma 6.2.2 Let A1 and A2 be mutually independent n-dimensional random vec-

tors. If A1`A2 is independent of A1´A2, then A1 and A2 are normally distributed

with identical covariances.

Using Lemma 6.2.2, we can conclude that for almost all t P p0,8q, Xpiq must be

Gaussian, with covariance not depending on U piq, thus U piq can be chosen as a constant

(i “ 1, 2). Thus for all such t,

fptq “ maxPXU : U“const.,X Gaussian

rppPXUq ` tqpPXUqs. (6.34)

Since both sides of (6.34) are concave in t, hence continuous on p0,8q, we see (6.34)

actually holds for all t P p0,8q. The proof is completed since am0 can be arbitrarily

chosen.

Lemma 6.2.3 Let p and q be real-valued functions on an arbitrary set D. If fptq :“

maxxPDtppxq ` tqpxqu is always attained, then for almost all t, f 1ptq exists and

f 1ptq “ qpx‹q, @x‹ P arg maxxPD

tppxq ` tqpxqu. (6.35)

In particular, for all such t, qpx‹q “ qpx‹q and ppx‹q “ ppx‹q for all x‹, x‹ P

arg maxxPDtppxq ` tqpxqu.

Geometrically, fptq is the support function [94] of the set S :“ tpppxq, qpxqquxPD

evaluated at p1, tq. Hence fp¨q is convex, and the left and the right derivatives are de-

termined by the two extreme points of the intersection between S and the supporting

hyperplane.

Proof of Lemma 6.2.3 The function f is convex since it is a pointwise supremum

of linear functions, and is therefore differentiable almost everywhere. Moreover, f 1ptq

(which is well-defined in the a.e. sense) is monotone increasing by convexity, and is

therefore continuous almost everywhere.

237

Let x‹t denote an arbitrary element of arg maxxPDtppxq` tqpxqu. By definition, for

any s P R, fpsq ě ppx‹t q ` sqpx‹t q. Thus, for δ ą 0,

fpt` δq ´ fptq

δě qpx‹t q and

fptq ´ fpt´ δq

δď qpx‹t q. (6.36)

If f 1ptq exists, that is, the left sides of the two inequalities above have the same limit

f 1ptq as δ Ó 0, then f 1ptq “ qpx‹t q. The second claim of the lemma follows immediately

from the first.

Remark 6.2.1 Let us remark on an interesting connection between the above proof

of the optimality of Gaussian random variable and Lieb’s proof of that Gaussian

functions maximize Gaussian kernels. Recall that [89, Theorem 3.2] wants to show

that for an operator G given by a two-variate Gaussian kernel function and p, q ą 1,

the ratio Gfqfp

is maximized by Gaussian f . First, a tensorization property is proved,

implying that f˚px1qf˚px2q is a maximizer for G b G if f˚ is any maximizer of G.

Then, Lieb made two important observations:

1. By a rotation invariance property of the Lebesgue measure/isotropic Gaussian

measure, f˚px1`x2?2qf˚px1´x2?

2q is also a maximizer of GbG.

2. An examination of the equality condition in the proof of tensorization property

reveals that any maximizer for GbG must be of a product form.

Thus Lieb concluded that f˚px1`x2?2qf˚px1´x2?

2q “ αpx1qβpx2q for some functions α and

β, and f˚ must be a Gaussian function. This is very similar to the above proof, once

we think of f˚ as the density function of P ˚X|U“u in our proof.

Remark 6.2.2 Our proof technique is essentially following ideas of Geng and Nair

[92][199] who established the Gaussian optimality for several information-theoretic

regions. However, we also added the important ingredient of Lemma 6.2.3.6 That is,

6The similar idea of differentiating the coefficients has been used in [200].

238

by differentiating with respect to the linear coefficients, we can conveniently obtain

information-theoretic identities which helps us to conclude the conditional indepen-

dence of X` and X´ quickly. For fixed m, in principle, this may be avoided by trying

various expansions of the two-letter quantities manually (e.g. as done in [199]), but

that approach will become increasingly complicated and unstructured as m increases.

Finally, we also note that a simple rotational invariance argument/doubling trick has

been used for proving that the capacity achieving distribution for an additive Gaussian

channel is Gaussian (cf. [202, P36]), which does not involve Lemma 6.2.2 and whose

extension to problems involving auxiliary random variables is not clear.

If we do not have the non-degenerate assumption and the regularization ΣX|U ĺ

Σ, it is very well possible that the optimization in Theorem 6.2.1 is nonfinite and/or

not attained by any PUX. In this case, we can show that the optimization is exhausted

by Gaussian distributions. To state the result conveniently, for any PX, define

F0pPXq :“ hpXq ´mÿ

j“1

cjhpYjq ´ c0 TrrMΣXs, (6.37)

where pX,Yjq „ PXQYj |X. Apparently, F pPXUq “ F0pPXq when U is constant.

Theorem 6.2.4 In the general (possibly degenerate) case,

1. For any given positive semidefinite Σ,

supPXU ,ΣX|UĺΣ

F pPXUq “ supPX Gaussian,ΣXĺΣ

F0pPXq (6.38)

where the left side of (6.38) follows Convention 1.

2.

supPXU

F pPXUq “ supPX Gaussian

F0pPXq. (6.39)

239

Remark 6.2.3 Theorem 6.2.4 reduces an infinite dimensional optimization problem

to a finite dimensional one. In particular, in the degenerate case we can verify that

the left side of (6.38) to be extremisable (Definition 6.2.1) if the the right side of

(6.38) is extremisable.

Proof Theorem 6.2.4 Note that the Gaussian random transformation QYj |X can

be realized as a linear transformation Bj to Rnj followed by adding an independent

Gaussian noise of covariance Σj. We will assume that Σj is non-degenerate in the

orthogonal complement of the image of Bj, since otherwise both sides of (6.38) are

`8. In particular, if QYj |X is a deterministic linear transform Bj, then it must be

onto Rnj .

1. We can use the argument at the beginning of Appendix 6.5 to show that the

left side of (6.38) equals

supPXU : U“t1,...,Du,ΣX|UĺΣ

F pPXUq (6.40)

where D is a fixed integer (depending only on the dimension of Σ).

Next, we argue that it is without loss of generality to add a restriction to (6.40)

that PX|U“u has a smooth density with compact support for each u P t1, . . . , Du,

in which case each PYj |U“u will have a smooth density by the assumption made

at the beginning of the proof. For this, define F0p¨q as in (6.37), and we will show

that for any PX absolutely continuous with respect to the Lebesgue measure

and any ε ą 0, we can choose a distribution PX2 whose density is smooth with

compact support such that

F0pPX2q ě F0pPXq ´ ε (6.41)

240

and

ΣX2 ĺ p1` εqΣX. (6.42)

First, by conditioning X on a large enough compact set and applying dominated

convergence theorem, there exists PX1 which is supported on a compact set and

whose density fX1 is bounded, such that each term in the definition of F0pPXq

is well-approximated when X is replaced with X1, so that

F0pPX1q ě F0pPXq ´ε

2(6.43)

and

ΣX1 ĺ?

1` εΣX. (6.44)

Second, we can pick PX2 with smooth density with compact support such that

fX2 ´ fX11 can be made arbitrarily small, which implies that fY2j´ fY1j

1

is small by Jensen’s inequality, and that TrrMΣX2s ´ TrrMΣX1s is small by

the fact that the densities are supported on a compact set. Since x ÞÑ x log x

is Lipschitz on a bounded interval, the differential entropy terms can be made

small as well (here we used the boundedness of fX1). As a result, we can ensure

that

F0pPX2q ě F0pP1Xq ´

ε

2(6.45)

and

ΣX2 ĺ?

1` εΣX1 , (6.46)

241

which, combined with (6.43) and (6.44), yield (6.41)-(6.42). This and the ob-

servation in (6.40) imply that

infεą0

supPXUPCε

F pPXUq ď supPXU : ΣX|UĺΣ

F pPXUq (6.47)

where Cε denotes the set of PXU such that U “ t1, . . . , Du, ΣX|U ĺ p1` εqΣ,

and the density of PX|U“u is smooth with compact support for each u. In the

final steps, we write F as FQ to indicate its dependence on pQY1|X, . . . , QYm|Xq,

and define Q the set of non-degenerate pQY1|X, . . . , QYm|X

q where each Yj is

obtained by adding an independent Gaussian noise with covariance Σj to Yj

(here the random transformation from X to Yj is fixed and Σj runs over the

set of positive definite matrices of the given dimension, j “ 1, . . . ,m). Then

supPXUPCε

FQpPXUq “ sup

PXUPCεsupQPQ

F QpPXUq (6.48)

“ supQPQ

supPXUPCε

F QpPXUq (6.49)

“ supQPQ

supPX Gaussian,ΣXĺp1`εqΣ

F Q0 pPXq (6.50)

“ supPX Gaussian,ΣXĺp1`εqΣ

supQPQ

F Q0 pPXq (6.51)

“ supPX Gaussian,ΣXĺp1`εqΣ

FQ0 pPXq (6.52)

where

• (6.48) and (6.52) are because, first, hpYjq ě hpYjq by the entropy power

inequality; second, infΣją0 hpYjq “ hpYjq. The proof of the second claim

is standard, since Yj has smooth density with fast (Gaussian like) decay,

242

and we can obtain pointwise convergence of the density function and apply

dominated convergence theorem.7

• (6.50) is from Gaussian extremality in non-degenerate case.

Then (6.47) and (6.52) gives

supPXU : ΣX|UĺΣ

F pPXUq ď supεą0

supPX Gaussian,ΣXĺp1`εqΣ

FQ0 pPXq (6.53)

“ supPX Gaussian,ΣXĺΣ

FQ0 pPXq (6.54)

where (6.54) is a property that can be verified without much difficulty for Gaus-

sian distributions. Thus the ď part of (6.38) is established. The other direction

is immediate from the definition.

2. First, observe that it is without loss of generality to assume that U is constant,

that is,

supPXU

F pPXUq “ supPX

F0pPXq. (6.55)

Next, by the same argument as 1), for any PX and ε ą 0, there exists PX1 such

that X1 ăM with probability one for some finite M ą 0, and that

F0pPX1q ě F0pPXq ´ ε. (6.56)

7Such a continuity in the variance of the additive noise can fail terribly when fX does not have thedecay properties; in [203, Proposition 4] Bobkov and Chistyakov provided an example of randomvector X with finite differential entropy such that hpX ` Zq “ 8 for each Z independent of Xand having finite differential entropy. In their example, supx:xěr fXpxq “ 1 for every r ą 0, andwe cannot obtain a dominating function for |fX`Z log fX`Z| to apply the dominated convergencetheorem.

243

Then

F0pPX1q ď supΣľ0

supPXU : ΣXĺΣ

F pPXUq (6.57)

ď supΣľ0

supPX Gaussian,ΣXĺΣ

F0pPXq (6.58)

“ supPX Gaussian

F0pPXq (6.59)

where (6.58) was established in (6.54). Finally, (6.55), (6.56), (6.59) and arbi-

trariness of ε give

supPXU

F pPXUq ď supPX Gaussian

F0pPXq. (6.60)

Thus the ď part of (6.39) is established. The other direction is trivial from the

definition.

6.2.2 Optimization of Mutual Informations

Let X,Y1, . . . ,Ym be jointly Gaussian vectors, a0, . . . , am be nonnegative real num-

bers, and M be a positive-semidefinite matrix. We are interested in minimizing

IpU ; Xq over PU |X (that is, U ´ X´Ym must hold) subject to IpU ; Yiq ě ai,

i P t1, . . . ,mu and TrrMΣX|U s ď a0. This is relevant to many problems in informa-

tion theory including the Gray-Wyner network [6] and some common randomness/key

generation problems [101] (to be discussed in Section 6.4.5). We have the following

result for the Lagrange dual of such an optimization problem:

244

Theorem 6.2.5 Fix M ľ 0, positive constants c0, c1, . . . , cm, and jointly Gaussian

vectors pX,Y1, . . . ,Ymq „ QXY1,...,Ym. Define the function

GpPU |Xq :“mÿ

j“1

cjIpYj;Uq ´ IpX;Uq ´ c0 TrrMΣX|U s. (6.61)

Then

1. If pQY1|X, . . . , QYm|Xq is non-degenerate, then supPU |X GpPU |Xq is achieved by

U‹ for which X|tU‹ “ uu is normal for PU‹-a.e. u, with covariance not depend-

ing on u. This can be realized by a Gaussian vector U˚ with the same dimension

as X.

2. In general, supPU |X GpPU |Xq “ supPU|X : pU,Xq is GaussianGpPU |Xq.

Proof Set Σ :“ ΣX. Note that for any PU |X, we have

mÿ

j“1

cjIpYj;Uq ´ IpX;Uq ´ c0 TrrMΣX|U s (6.62)

“

mÿ

j“1

cjhpYjq ´ hpXq `

˜

hpX|Uq ´mÿ

j“1

cjhpYj|Uq ´ c0 TrrMΣX|U s

¸

(6.63)

ď

mÿ

j“1

cjhpYjq ´ hpXq ` supPX1U 1 : ΣX1|U 1ĺΣ

F pPX1U 1q. (6.64)

In the non-degenerate case, by Theorem 6.2.1, the supremum is attained in the

last line by constant U 1 and X1 „ N p0,Σ1q, where Σ1 ĺ Σ. Hence, taking U‹ „

N p0,Σ´Σ1q and PX|U‹“u „ N pu,Σ1q, we have PX „ N p0,Σq as required, and

(6.64) reads as

mÿ

j“1

cjIpYj;Uq ´ IpX;Uq ´ c0 TrrMΣX|U s ď

mÿ

j“1

cjIpYj; U‹q ´ IpX; U‹

q ´ c0 TrrMΣX|U‹s,

(6.65)

245

and 1) follows since PU |X is arbitrary. The claim for the general (possibly degenerate)

case follows by invoking Theorem 6.2.4 and using a similar argument.

6.2.3 Optimization of Differential Entropies

The F0p¨q defined in Section 6.2.1 is a linear combination of differential entropies

and second order moments, which is closely related to the Brascamp-Lieb inequality.

Immediately from Theorem 6.2.1 and Theorem 6.2.4, we have the following result

regarding maximization of F0p¨q:

Corollary 6.2.6 For any Σ ľ 0,

supPX : ΣXĺΣ

F0pPXq “ supPX : Gaussian,ΣXĺΣ

F0pPXq. (6.66)

supPX

F0pPXq “ supPX : Gaussian

F0pPXq, (6.67)

where F0p¨q is defined in (6.37). In addition, in the non-degenerate case (6.66) is

always finite and (up to a translation) uniquely achieved by a Gaussian PX; (6.67) is

finite and (up to a translation) uniquely achieved by a Gaussian PX if M in (6.37) is

positive definite.

Remark 6.2.4 Since the left side of (6.38) is obviously concave as a function of Σ,

(6.38) and (6.66) combined shows that the left side of (6.66) is concave in Σ. This is

not obvious from the definition; in particular it is not clear wether the concavity still

holds for non-gaussian QX and QYj |X.

246

6.3 Gaussian Optimality in the Forward-Reverse

Brascamp-Lieb Inequality

In this section, some notations and terminologies from Sections 2.4 and 6.2 will be

used. Moreover, consider the following parameters/data:

• Fix Lebesgue measures pµjqmj“1 and Gaussian measures pνiq

li“1 on R;

• non-degenerate (Definition 6.2.2) linear Gaussian random transformation

pPYj |Xqmj“1 (where X :“ pX1, . . . , Xlq) associated with conditional expectation

operators pTjqmj“1;

• positive pcjq and pbiq.

Given Borel measures PXi on R, i “ 1, . . . , l, define

F0ppPXiqq :“ infPX

mÿ

j“1

cjDpPYjµjq ´lÿ

i“1

biDpPXiνiq (6.68)

where the infimum is over Borel measures PX that has pPXiq as marginals. The aim

of this section is to prove the following:

Theorem 6.3.1 suppPXi qF0ppPXiqq, where the supremum is over Borel measures PXi

on R, i “ 1, . . . , l, is achieved by some Gaussian pPXiqli“1.

Naturally, one would expect that Gaussian optimality can be established when pµjqmj“1

and pνiqli“1 are either Gaussian or Lebesgue. We made the assumption that the former

is Lebesgue and the latter is Gaussian so that certain technical conditions can be

justified conveniently, while the crux of the matter–the tensorization steps can still

be demonstrated. More precisely, we have the following observation:

247

Proposition 6.3.2 suppPXi qF0ppPXiqq is finite and there exist σ2

i P p0,8q,

i “ 1, . . . , l such that it equals

suppPXi q : ErX

2i sďσ

2i

F0ppPXiqq. (6.69)

Proof when µj is Lebesgue and PYj |X is non-degenerate, DpPYjµjq “ ´hpPYjq ď

´hpPYj |Xq is bounded above (in terms of the variance of additive noise of PYj |X).

Moreover, DpPXiνiq ě 0 when νi is Gaussian, so suppPXi qF0ppPXiqq ă 8. Further,

choosing pPXiq “ pνiq and applying a covariance argument to lower bound the first

term in (6.68) shows that suppPXi qF0ppPXiqq ą ´8.

To see (6.69), notice that

DpPXiνiq “ DpPXiν1iq ` Erıν1iνipXqs (6.70)

“ DpPXiν1iq `Dpν

1iνiq (6.71)

ě Dpν 1iνiq (6.72)

where ν 1i is a Gaussian distribution with the same first and second moments as Xi „

PXi . Thus DpPXiνiq is bounded below by some function of the second moment of Xi

which tends to 8 as the second moment of Xi tends to 8. Moreover, as argued in

the preceding paragraph the first term in (6.68) is bounded above by some constant

depending only on pPYj |Xq. Thus, we can choose σ2i ą 0, i “ 1, . . . , l large enough

such that if ErX2i s ą σ2

i for some of i then F0ppPXiqq ă suppPXi qF0ppPXiqq, irrespective

of the choices of PX1 , . . . , PXi´1, PXi`1

, . . . , PXl . Then these σ1, . . . , σl are as desired

in the proposition.

As we saw in Section 6.2, the regularization ensures that the supremum is achieved,

although it might be possible to prove Gaussian exhaustibility results by taking limits.

248

Proposition 6.3.3 1. For any pPXiqli“1, the infimum in (6.68) is attained.

2. If pPYj |Xlqmj“1 are non-degenerate (Definition 6.2.2), then the supremum in

(6.69) is achieved by some pPXiqli“1.

Proof 1. For any ε ą 0, by the continuity of measure there exists K ą 0 such

that

PXipr´K,Ksq ě 1´ εl, i “ 1, . . . , l. (6.73)

By the union bound,

PXpr´K,Kslq ě 1´ ε (6.74)

wherever PX is a coupling of pPXiq. Now let PpnqX , n “ 1, 2, . . . be a such that

limnÑ8

mÿ

j“1

cjDpPpnqYjµjq “ inf

PX

mÿ

j“1

cjDpPYjµjq (6.75)

where PYj :“ T ˚j PX, j “ 1, . . . ,m. The sequence pPpnqX q is tight by (6.74),

Thus invoking Prokhorov theorem and by passing to a subsequence, we may

assume that pPpnqX q converges weakly to some P ‹X. Therefore P

pnqYj

converges to

P ‹Yj weakly, and by the semicontinuity property in Lemma 6.5.2 we have

mÿ

j“1

cjDpP‹Yjµjq ď lim

nÑ8

mÿ

j“1

cjDpPpnqYjµjq (6.76)

establishing that P ‹X is an infimizer.

249

2. Suppose pPpnqXiq1ďiďl,ně1 is such that ErX2

i s ď σ2i , Xi „ P

pnqXi

, where pσiq is as in

Proposition 6.3.2, and

limnÑ8

F0

´

pPpnqXiqli“1

¯

“ suppPXi q : ΣXi

ĺσ2i

F0ppPXiqli“1q. (6.77)

The regularization on the covariance implies that for each i, pPpnqXiqně1 is a tight

sequence. Thus upon the extraction of subsequences, we may assume that for

each i, pPpnqXiqně1 converges to some P ‹Xi , and a simple truncation and min-max

inequality argument (see e.g. (6.169)) shows that ErX2i s ď σ2

i , Xi „ P ‹Xi . Then

by Lemma 6.5.2,

ÿ

i

biDpP‹Xiνiq ď lim

nÑ8

ÿ

i

biDpPpnqXiνiq (6.78)

Under the covariance regularization and the non-degenerateness assumption,

we showed in Proposition 6.3.2 that the value of (6.69) cannot be `8 or

´8. This implies that we can assume (by passing to a subsequence) that

PpnqXi

! λ, i “ 1, . . . , l since otherwise F ppPXiqq “ ´8. Moreover, since´

ř

j cjDpPpnqYjµjq

¯

ně1is bounded above under the non-degenerateness assump-

tion, the sequence´

ř

i biDpPpnqXiνiq

¯

ně1must also be bounded from above,

which implies, using (6.78), that

ÿ

i

biDpP‹Xiνiq ă 8. (6.79)

In particular, we have P ‹Xi ! λ for each i. Let P ‹X be an infimizer as in Part 1) for

marginals pP ‹Xiq. In view of Lemma 6.3.4, by possibly passing to subsequences

we can assume that each pPpnqXiqli“1 admits a coupling P

pnqX , such that P

pnqX Ñ P ‹X

weakly as n Ñ 8. In the non-degenerate case the output differential entropy

is weakly continuous in the input distribution under the covariance constraint

250

(see for example [92, Proposition 18]), which establishes that

infPX : S˚i PX“P

‹Xi

ÿ

j

cjDpT˚j PXµjq “

ÿ

j

cjDpP‹Yjµjq (6.80)

“ limnÑ8

ÿ

j

cjDpPpnqYjµjq (6.81)

ě limnÑ8

infPX : S˚i PX“P

pnqXi

ÿ

j

cjDpT˚j PXµjq (6.82)

Thus (6.78) and (6.82) show that pP ‹Xiq is in fact a maximizer.

Lemma 6.3.4 Suppose that for each i “ 1, . . . , l (l ě 2), PXi is a Borel measure on

R and PpnqXi

converges weakly to PXi as n Ñ 8. If PX is a coupling of pPXiq1ďiďl,

then, upon extraction of a subsequence, there exist couplings PpnqX for pP

pnqXiq1ďiďl which

converge weakly to PX as nÑ 8.

Remark 6.3.1 Lemma 6.3.4 will be used to show that the infimum of an upper semi-

continuous (w.r.t. the joint distribution) functional over couplings is also upper semi-

continuous (w.r.t the marginal distributions), which will be the key to the proof of

Proposition 6.3.3. Another application is to prove the weak continuity of the optimal

transport cost (the lower semicontinuity part being trivial) in the theory of optimal

transportation, which may also be proved using the “stability of optimal transport” in

[57, Theorem 5.20]. However, we note that the approach in [57, Theorem 5.20] relies

on cyclic monotonicity, and hence cannot be extended to the setting of general upper

semicontinuous functionals such as in Corollary 6.3.5.

Corollary 6.3.5 (Weak stability of optimal coupling) In the case of non-

degenerate pPYj |Xq, suppose for each n ě 1, PpnqXi

is a Borel measure on R, i “ 1, . . . , l,

whose second moment is bounded by σ2i ă 8. Assume that P

pnqX is a coupling of

pPpnqXiq that minimizes

řlj“1 cjDpP

pnqYjµjq. If P

pnqX converges weakly to some P ‹X, then

P ‹X minimizesřlj“1 cjDpP

‹Yjµjq given the marginals pP ‹Xiq.

251

Proof Lemma 6.3.4 and the fact that the differential entropy of the output of a

non-degenerate Gaussian channel is weakly semicontinuous with respect to the input

distribution under moment constraint (see e.g. [92, Proposition 18], [204, Theorem 7],

or [205, Theorem 1, Theorem 2]) imply that there exists a strictly increasing sequence

of positive integers pnkqkě1 such that

minPX : S˚i PX“P

‹Xi

ÿ

j

cjDpT˚j PXµjq ě lim sup

kÑ8min

PX : S˚i PX“Ppnkq

Xi

ÿ

j

cjDpT˚j PXµjq. (6.83)

On the other hand, the assumption on the convergence of the optimal couplings and

Lemma 6.5.2 imply that

ÿ

j

cjDpT˚j P

‹Xµjq ď lim inf

kÑ8min

PX : S˚i PX“Ppnkq

Xi

ÿ

j

cjDpT˚j PXµjq (6.84)

so the left side of (6.84) is no larger than the left side of (6.83).

Proof of Lemma 6.3.4 For each integer k ě 1, define the random variable

Wrksi :“ φkpXiq where φk is the following “dyadic quantization function”:

φk : x P R ÞÑ

$

’

&

’

%

t2kxu |x| ď k, x R 2´kZ;

e otherwise,(6.85)

and let Wrks :“ pWrksi q

li“1. Denote by W rks :“ t´k2k, . . . , k2k ´ 1, eu the alphabet of

Wrksi .

For simplicity of the presentation, we shall assume that the set of “dyadic points”

has measure zero:

PXip8ď

k“1

2´kZq “ 0, i “ 1, . . . , l. (6.86)

252

This is not an essential restriction, since the property of 2´kZ used in the proof is

that it is a countable dense subset of R. In general, PXi can have a positive mass

only on countably many points, and in particular there exists a point of zero mass in

any open interval. Thus, we can always choose a countable dense subset of R with

PXi measure zero (i “ 1) and appropriately modify (6.85) with points in this set for

quantization instead.

Since PpnqXi

Ñ PXi weakly and the assumption in the preceding paragraph pre-

cluded any positive mass on the quantization boundaries under PXi , for each k ě 1

there exists some n :“ nk large enough such that

Ppnq

Wrksi

pwq ě p1´1

kqP

Wrksipwq, (6.87)

for each i and w PW rks. Now define a coupling Ppnq

Wrks compatible with the

ˆ

Ppnq

Wrksi

˙l

i“1

induced by´

PpnqXi

¯l

i“1, as follows:

Ppnq

Wrks :“ p1´1

kqPWrks ` kl´1

lź

i“1

ˆ

Ppnq

Wrksi

´ p1´1

kqP

Wrksi

˙

. (6.88)

Observe that this is a well-defined probability measure because of (6.87), and in-

deed has

ˆ

Ppnq

Wrksi

˙l

i“1

as the marginals. Moreover, by triangle inequality we have the

following bound on the total variation distance

ˇ

ˇ

ˇPpnq

Wrks ´ PWrks

ˇ

ˇ

ˇď

2

k. (6.89)

253

Next, construct8 PpnqX :

PpnqX :“

ÿ

wlPWrksˆ¨¨¨ˆWrks

Ppnq

Wrks

`

wl˘

śli“1 P

pnq

Wrksi

pwiq

lź

i“1

PpnqXi|φ´1k pwiq

. (6.90)

Observe that the PpnqX so defined is compatible with the P

pnq

Wrks defined in (6.88), and

indeed has pPpnqXiqli“1 the marginals. Since n :“ nk can be made increasing in k, we

have constructed the desired sequence pPpnkqX q8k“1 converging weakly to PX. Indeed,

for any bounded open dyadic cube9 A, using (6.89) and the assumption (6.86), we

conclude

lim infkÑ8

PpnkqX pAq ě PXpAq. (6.91)

Moreover, since bounded open dyadic cubes form a countable basis of the topology

in Rl, we see (6.91) actually holds for any open set A (by writing A as a countable

union of dyadic cubes, using the continuity of measure to pass to a finite disjoint

union, and then apply (6.91)), as desired.

Next, we need the following tensorization result:

Lemma 6.3.6 Fix pPXp1qiq, pP

Xp2qiq, pµjq, pTjq, pcjq P r0,8q

m, and let Sj be induced

by coordinate projections. Then

infPXp1,2q

: S˚b2i P

Xp1,2q“P

Xp1qi

ˆPXp2qi

mÿ

j“1

cjDpPY p1,2qjµb2

j q “ÿ

t“1,2

mÿ

j“1

cj infPXptq

: S˚i PXptq“P

Xptqi

DpPYptqjµjq

(6.92)

8We use P |A to denote the restriction of a probability measure P on measurable set A, that is,P |ApBq :“ P pAX Bq for any measurable B.

9That is, a cube whose corners have coordinates being multiples of 2´k where k is some integer.

254

where for each j,

PYp1,2qj

:“ T ˚b2j PXp1,2q (6.93)

on the left side and

PYptqj

:“ T ˚b2j PXptq (6.94)

on the right side, t “ 1, 2.

Proof We only need to prove the nontrivial ě part. For any PXp1,2q on the left side,

choose PXptq on the right side by marginalization. Then

DpPYp1,2qj

µb2j q ´

ÿ

t

DpPYptqjµjq “ IpY

p1qj ;Y

p2qj q (6.95)

ě 0 (6.96)

for each j.

We are now in the position of proving the main result of this section.

Proof of Theorem 6.3.1

1. Assume that pPXp1qiq and pP

Xp2qiq are maximizers of F0 (possibly equal). Let

PX1,2i

:“ PXp1qiˆ P

Xp2qi

. Define

X` :“1?

2

`

Xp1q`Xp2q

˘

; (6.97)

X´ :“1?

2

`

Xp1q´Xp2q

˘

. (6.98)

Define pY `j q and pY ´j q analogously. Then Y `j |tX` “ x`,X´ “ x´u „ QYj |X“x`

is independent of x´ and Y ´j |tX` “ x`,X´ “ x´u „ QYj |X“x´ is independent

of x`.

255

2. Next we perform the same algebraic expansion as in the proof of tensorization:

2ÿ

t“1

F0p

´

PXptqi

¯

q “ infPXp1,2q

: S˚b2j P

Xp1,2q“P

Xp1,2qj

ÿ

j

cjDpPY p1,2qjµb2

j q ´ÿ

i

biDpPXp1,2qiνb2i q

(6.99)

“ infPX`X´ : S˚b2

j PX`X´“PX`jX´j

ÿ

j

cjDpPY `j Y´jµb2

j q ´ÿ

i

biDpPXì Xíνb2i q

(6.100)

ď infPX`X´ : S˚b2

j PX`X´“PX`jX´j

ÿ

j

cj

”

DpPY `j µjq `DpPY´j |X

`µj|PX`q

ı

´ÿ

i

bi

”

DpPXì νiq `DpPXí |X

ìνi|PXì q

ı

(6.101)

ďÿ

j

cj

”

DpP ‹Y `jµjq `DpP

‹

Y ´j |X`µj|P

‹X`q

ı

´ÿ

i

bi

”

DpP ‹Xìνiq `DpP

‹

Xí |X`νi|P

‹X`q

ı

(6.102)

“ F0p

´

P ‹Xì

¯

q `

ż

F0p

´

P ‹Xí |X

`

¯

qdP ‹X` (6.103)

ď

2ÿ

t“1

F0p

´

PXptqi

¯

q (6.104)

where

• (6.99) uses Lemma 6.3.6.

• (6.101) is because of the Markov chain Y `j ´X` ´ Y ´j (for any coupling).

• In (6.102) we selected a particular instance of coupling PX`X´ , constructed

as follows: first we select an optimal coupling PX` for given marginals

pPXì q. Then, for any x` “ pxì qli“1, let PX´|X`“x` be an optimal coupling

of pPXí |Xì “x

ìq. 10 With this construction, it is apparent that X`

i ´X`´

10Here we need to justify that we can select optimal coupling PX´|X`“x` in a way that PX´|X`

is indeed a regular conditional probability distribution, or equivalently, PX´|X`“x` is Borel mea-surable, where we endow the weak topology on the space of probability measures. This is justifiedby two observations: (a) x` ÞÑ pPX´

i |X`“x`q is Borel measurable; (b) the marginalization map

256

Xí and hence

DpPXí |Xìνi|PXì q “ DpPXí |X`νi|PX`q. (6.105)

• (6.103) is because in the above we have constructed the coupling optimally.

• (6.104) is because pPptqXiq maximizes F0, t “ 1, 2.

3. Thus in the expansions above, equalities are attained throughout. Using the

differentiation technique as in the case of forward inequality, for almost all pbiq,

pcjq, we have

DpPXí |Xìνi|PXì q “ DpPXì νiq (6.106)

“ DpPXí νiq, @i (6.107)

where the last equality is because by symmetry we can perform the algebraic

expansions in a different way to show that pPXí q is also a maximizer of F0.

Then IpXì ;X´

i q “ 0, which, combined with IpXp1qi ;X

p2qi q, shows that X

p1qi and

Xp2qi are Gaussian with the same covariance. Lastly, using Lemma 6.3.6 and

the doubling trick one can show that the optimal coupling is also Gaussian.

φ : O Ñś

i PpXiq, PX ÞÑ pPXiq admits a Borel right inverse, where O is the set of optimal couplings(w.r.t. the relative entropy functional) whose marginals satisfy the second moment constraint. Part(a) is equivalent to the regularity of PX´

i |X` , and the existence of such a regular conditional dis-

tribution is guaranteed for joint distributions on Polish spaces (whose measurable space structureis isomorphic to a standard Borel space); see e.g. [43]. Part (b) is justified by measurable selectiontheorems (see e.g. references in [57, Corollary 5.22]). In particular, a similar argument as [57, Corol-lary 5.22] can also be applied here, since Corollary 6.3.5 implies that O is closed and the pre-imagesφ´1ppPXi

qq are all compact, and Proposition 6.3.3.1 justifies that φ is onto.

257

6.4 Consequences of the Gaussian Optimality

In this section, we demonstrate several implications of the entropic Gaussian optimal-

ity results in Sections 6.2-6.3 to functional inequalities, transportation-cost inequali-

ties, and network information theory.

6.4.1 Brascamp-Lieb Inequality with Gaussian Random

transformations: an Information-Theoretic Proof

We give a simple proof of an extension of the Brascamp-Lieb inequality using Corol-

lary 6.2.6. That is, we give an information-theoretic proof of the following result:

Theorem 6.4.1 Suppose µ is either a Gaussian measure or the Lebesgue measure

on X “ Rn. Let QYj |X be Gaussian random transformations where Yj “ Rnj , and

cj P p0,8q for j P t1, . . . ,mu. For any non-negative measurable functions fj : Yj Ñ R,

j P t1, . . . ,mu, define

Hpf1, . . . , fmq “ log

ż

exp

˜

mÿ

j“1

Erlog fjpYjq|X “ xs

¸

dµpxq ´ logmź

j“1

fj 1cj

(6.108)

where the norm fj 1cj

is with respect to the Lebesgue measure and the expectation is

with respect to QYj |X“x. Then

supf1,...,fm

Hpf1, . . . , fmq “ supGaussian f1,...,fm

Hpf1, . . . , fmq. (6.109)

Moreover, if µ is Gaussian, pQYj |Xq is non-degenerate in the sense of Definition 6.2.2,

then (6.109) is finite and uniquely attained by a set of Gaussian functions.

Proof We only prove the case of Gaussian µ, since the proof for the Lebesgue case

is similar. By the translation and scaling invariance, it suffices to consider the case

where µ is centered Gaussian with density expp´xJMxq for some w P R and M ľ 0.

258

Then

ıλµpxq “ xJMx (6.110)

where λ denotes the Lebesgue measure on Rn. Therefore,

DpPX||µq `mÿ

j“1

cjhpPYjq “ ´hpPXq `

mÿ

j“1

cjhpPYjq ` ErıλµpXqs (6.111)

“ ´hpPXq `

mÿ

j“1

cjhpPYjq ` ErXJMXs. (6.112)

where X „ PX. Then (6.109) is established by

supf1,...,fm

Hpf1, . . . , fmq “ ´ infPX

#

DpPX||µq `mÿ

j“1

cjhpPYjq

+

(6.113)

“ ´ infGaussian PX

#

DpPX||µq `mÿ

j“1

cjhpPYjq

+

(6.114)

“ supGaussian f1,...,fm

Hpf1, . . . , fmq (6.115)

where

• (6.113) is from Remark 2.2.3;

• (6.114) is from (6.112) and Corollary 6.2.6;

• (6.115) is essentially a “Gaussian version” of the equivalence of the two inequal-

ities in Theorem 2.2.3, which is easily shown with the same steps in the proof

of Theorem 2.2.3, noting that (2.17) sends Gaussian distributions to Gaussian

functions and (2.24) sends Gaussian functions to Gaussian distributions.

In the case of Gaussian µ and non-degenerate pQYj |Xq, Corollary 6.2.6 implies that

(6.114) is finitely attained by a unique Gaussian PX. By Remark 2.2.3, (6.109) is

finitely attained by a unique set of Gaussian functions.

259

Remark 6.4.1 The proof of the Brascamp-Lieb inequality by Carlen and Erausquin

[76] also relies on dual information-theoretic formulation. However, their proof of the

differential entropy inequality uses a different approach based on superadditivity of

Fisher information. That approach applies to the case where each QYj |X is determin-

istic rank-one linear map, and it requires the problem to be first reduced to a special

case called the geometric Brascamp-Lieb inequality (proposed by K. Ball [206]).

6.4.2 Multi-variate Gaussian Hypercontractivity

In this section we show a multivariate extension of Gaussian hypercontractivity.

An m-tuple of random variables pX1, . . . , Xmq „ QXm is said to be pp1, . . . , pmq-

hypercontractive for pl P r1,8s, l P t1, . . . ,mu if

E

«

mź

l“1

flpXlq

ff

ď

mź

l“1

flpXlqpl (6.116)

for all bounded real-valued measurable functions fl defined on Xl, l P t1, . . . ,mu.

Define the hypercontractivity region11

RpQXmq :“ tpp1, . . . , pmq P Rm` : pX1, . . . , Xmq is pp1, . . . , pmq hypercontractiveu.

(6.117)

By Theorem 2.2.3, the inequality (6.116) is true if and only if

DpPXm ||QXmq ě

mÿ

j“1

1

pjDpPXj ||QXjq (6.118)

11Note that this definition is similar to the hypercontractivity ribbon defined in [9] but withouttaking the Holder conjugate of one of the two exponent, which is more symmetrical and convenientin the multivariate setting.

260

holds for any PXm ! QXm . In fact, [9] showed that (6.118) is equivalent to the

following (which hinges on the fact that the constant term d “ 0 in (6.118)):

IpU ;Xmq ě

mÿ

j“1

1

pjIpU ;Xjq, @U. (6.119)

In the case of Gaussian QXm , by Theorem 6.2.5 the inequality (6.119) holds if it

holds for all U jointly Gaussian with Xm and having dimension at most m. When

restricted to such U , (6.118) becomes an inequality involving the covariance matrices,

and some elementary computations show that:

Proposition 6.4.2 Suppose QXm “ N p0,Σq where Σ is a positive semidefinite ma-

trix whose diagonal values are all 1. Then pm P RpQXmq if and only if

P ě Σ (6.120)

where P is a diagonal matrix with pj, j P t1, . . . ,mu as its diagonal entries.

Proof Let A be the covariance matrix of Xm conditioned on U , and put C :“ P´1.

We will use the lowercase letters such as a1, . . . , am to denote the diagonal entries of

the corresponding matrices. Then in view of (6.119), we see that the goal is to show

that (6.120) is a necessary and sufficient condition for

log|Σ|

|A|ěÿ

j

cj log1

ai, @0 ĺ A ĺ Σ. (6.121)

Let us first assume that Σ is invertible. Define D :“ Σ´A, then rewrites as

|I´Σ´1D| ďź

j

p1´ djqcj , @0 ĺ D ĺ Σ. (6.122)

261

If (6.120), then C ĺ Σ´1, and we have

|I´Σ´1D| ď |IĆD| (6.123)

ďź

j

p1´ cjdjq (6.124)

ďź

j

p1´ djqcj (6.125)

where (6.124) is because t1ćjdju is the diagonal entries of IĆD, which is majorized

by the eigenvalues of that matrix. Inequality (6.125) is because C ĺ Σ´1 and D ĺ Σ

imply the nonnegativity of 1´ cjdj.

Conversely, if (6.122) holds, we can apply Taylor expansions on both sides of

(6.122):

1´ TrpΣ´1Dq ` opDq ď 1´ÿ

j

cjdj ` opDq, (6.126)

where D can be chosen as, say, the trace norm. Comparing the first order terms,

we see that

TrppΣ´1ĆqDq ľ 0 (6.127)

must hold for any 0 ĺ D ĺ Σ. Therefore (6.120) holds.

More generally, if Σ is not necessarily invertible, we can consider the inverse of

its restriction Σ|V where V is its column space. Then the above arguments still carry

262

through with trivial modifications, and we can show that (6.121) holds if and only if

Σ|V´1

ľ C|V (6.128)

ðñΣ|V12 C|V Σ|V

12 ĺ I (6.129)

ðñΣ12 CΣ

12 “ Σ

12 P´1Σ

12 ĺ I (6.130)

ðñP´ 12 ΣP´ 1

2 ĺ I (6.131)

ðñΣ ĺ P (6.132)

where (6.131) is because Σ12 P´1Σ

12 and P´ 1

2 ΣP´ 12 have the same set of eigenvalues.

Remark 6.4.2 When m “ 2, Proposition 6.4.2 reduces to Nelson’s hypercontractivity

theorem for a pair of Gaussian scalar random variables X1 and X2, that is, pp1, p2q P

RpQX2q if and only if

pp1 ´ 1qpp2 ´ 1q ě ρ2pX1;X2q, (6.133)

where ρ2 denotes the squared Pearson correlation.

6.4.3 T2 Inequality for Gaussian Measures

Consider the Euclidean space endowed with `2-norm pRn, ¨ 2q. Talagrand [207]

showed that the standard Gaussian measure Q “ N p0, Inq satisfies the T2p1q inequal-

ity (see Definition 2.5.1). Below we give a new proof using the Gaussian optimality in

the forward-reverse Brascamp-Lieb inequality. By the tensorization of T2 inequality

[137], it suffices to prove the n “ 1 case. By continuity, it suffices to prove T2pλq for

any λ ą 1. Moreover, as one can readily check, when λ ą 1 and Q “ N p0, 1q, (2.168)

is satisfied for all Gaussian P (in which case the optimal coupling in (2.168) is also

Gaussian), so it suffices to prove Gaussian extremisability in (2.168).

263

Let λ P p1,`8q, b2 P p0,`8q, and

F0pPX1 , PX2q :“ infPX1X2

Er|X1 ´X2|2s ´ 2λDpPX1Qq ´ b2DpPX2Qq, (6.134)

where the infimum is over coupling PX1X2 of PX1 and PX2 . Using the rotation in-

variance argument/doubling trick in the proof of Theorem 6.3.1, we can show that

if pPX1 , PX2q maximizes F0pPX1 , PX2q, then they must be Gaussian, and the optimal

coupling PX1X2 is also Gaussian12. Letting b2 Ñ 8, we see the Gaussian optimality

in the T2 inequality. To make the above argument rigorous, we need to take care of

two technical issues:

(a) We approximated the functional (2.175) with b2DpPX2Qq and let b2 Ñ 8, but

we want that Gaussian optimality continue to hold when the last term in (6.134)

is exactly (2.175).

(b) We want to show the existence of a maximizer pPX1 , PX2q for (6.134).

While it might be possible to provide a formal justification of the limit argument

in (a), a slicker way is to circumvent it by directly working with (2.175) instead of

b2DpPX2Qq with b2 Ñ 8. From the tensorization of the functional (6.134), it is

relatively easy to distill the tensorization of the T2 inequality (see also [137] for a

direct proof of tensorization of T2 inequality), and then use the rotation invariance

argument/doubling trick to conclude Gaussian optimality.

12In Theorem 6.3.1 the first term of the objective function is the infimum of the relative entropy,rather than the infimum of the expectation of a quadratic cost function. However, the argument inTheorem 6.3.1 also works in the latter case, since the expectation functional has a similar tensoriza-tion property, and the quadratic cost function also has a rotational invariance property.

264

As for (b), note that if b2DpPX2Qq is replaced with (2.175), we want to show the

existence of a maximizer pPX1 , Qq for (6.134). If σ2 :“ Er|X1|2s, then

infPX1X2

Er|X1 ´X2|2s ď σ2

` Er|X2|2s (6.135)

“ σ2` 1. (6.136)

On the other hand,

DpPX1Qq “ DpPX1N p0, σ2qq ` ErıN p0,σ2qQpX1qs “

σ2

2` opσ2

q. (6.137)

Therefore, if λ ą 1 and pPptqX1q8t“1 is a supremizing sequence, then pP

ptqX1q8t“1 must have

bounded second moment, hence must be tight. Thus Prokhorov Theorem implies

the existence of a subsequence weakly converging to some P ‹X1, and a semicontinuity

argument similar to Proposition 6.3.3 shows that P ‹X1is in fact a maximizer.

6.4.4 Wyner’s Common Information for m Dependent Ran-

dom Variables

Wyner’s common information for m dependent random variables X1, . . . , Xm is com-

monly defined as [208]

CpXmq :“ inf

UIpU ;Xm

q (6.138)

where the infimum is over PU |Xm such thatX1, . . . , Xm are independent conditioned on

U . Previously, to the best of our knowledge, the common information for m Gaussian

scalars X1, . . . , Xm could only be obtained in the special case where the correlation

coefficient between Xi and Xj are equal for all 1 ď i, j ď m [208, Corollary 1] via a

different approach.

265

Using Theorem 6.2.5 and setting X Ð Xm and Yj Ð Xj, we immediately obtain

the following characterization of the multivariate common information for Gaussian

sources:

Theorem 6.4.3 The common information of m Gaussian scalar random variables

X1, . . . , Xm with covariance matrix Σ ą 0 is given by

CpXmq “

1

2infΛ

log|Σ|

|Λ|(6.139)

where the infimum is over all diagonal matrices Λ satisfying Λ ĺ Σ. More generally

when Σ is not necessarily invertible, then Σ and Λ in (6.139) should be replaced by

the restrictions Σ|V and Λ|V to the column space V of Σ.

Remark 6.4.3 After we completed a draft of the journal version of this work [93],

Jun Chen and Chandra Nair independently found and showed us a simple proof of

Theorem 6.4.3 without invoking Theorem 6.2.5: using the fact that Gaussian distri-

bution maximizes differential entropy given a covariance constraint and the concavity

of the log-determinant function,

IpU ;Xmq “ hpXm

q ´ hpXm|Uq (6.140)

ě1

2log |Σ| ´ E

„

1

2log |CovpX|Uq|

(6.141)

ě1

2log |Σ| ´

1

2log |ΣX|U |. (6.142)

Since ΣX|U ĺ Σ and ΣX|U is diagonal, we establishes the nontrivial “ě” part of

(6.139) in the case of invertible Σ. This argument is also related to the proof of The-

orem 6.2.5 in the sense that both convert a mutual information optimization problem

(without a covariance constraint) to a conditional differential entropy optimization

problem with a covariance constraint.

266

6.4.5 Key Generation with an Omniscient Helper

T1 T2. . . Tm

T0Xm

K1 K2 Km

K

X1 X2 Xm

W1 W2Wm

Figure 6.1: Key generation with an omniscient helper

As an example of applications in the network information theory, we give a simple

characterization of the achievable rate region for secret key generation with an om-

niscient helper [101] in the case of stationary memoryless Gaussian sources. This

problem can be viewed as the omniscient CR generation problem in Chapter 2

with an additional secrecy constraint. For completeness we formula the problem

below: Let QXm be the per-letter joint distribution of sources X1, . . . , Xm. As in

Figure 6.1, the Terminals T1, . . . ,Tm observe i.i.d. realizations of X1, . . . , Xm, respec-

tively, whereas the omniscient helper T0 has access to all the sources. Suppose the ter-

minals perform block coding with length n. The communicator computes the integers

W1ppXmqnq, . . . ,WmppX

mqnq possibly stochastically and sends them to T1, . . . ,Tm,

respectively. Then, all the terminals calculate integers KppXmqnq, K1ppX1qn,W1q,

. . . ,KmppXmqn,Wmq possibly stochastically. The goal is to make K “ K1 “ ¨ ¨ ¨ “ Km

with high probability and K almost equiprobable and independent of each message

Wl. In other words, we want to minimize the following quantities:

εn “ max1ďlďm

PrK ‰ Kls, (6.143)

νn “ max1ďlďm

tlog |K| ´HpK|Wlqu. (6.144)

267

Thus, the difference with the omniscient helper CR generation problem is that here

K needs to be (asymptotically) independent of Wl, l “ 1, . . . ,m.

An pm ` 1q-tuple pR,R1, . . . , Rmq is said to be achievable if a sequence of key

generation schemes can be designed to fulfill the following conditions:

lim infnÑ8

1

nlog |K| ě R; (6.145)

lim supnÑ8

1

nlog |Wl| ď Rl, l P t1, . . . ,mu; (6.146)

limnÑ8

εn “ 0; (6.147)

limnÑ8

νn “ 0. (6.148)

Notice that a small νn does not imply that K is nearly independent with all the

messages Wm; the problem appears to be harder to solve if a the stronger requirement

that

limnÑ8

plog |K| ´HpK|Wmqq “ 0 (6.149)

is imposed in place of (6.148) [101].

Theorem 6.4.4 [101] The set of achievable rates is the closure of

ď

QU |Xm

$

’

’

’

’

&

’

’

’

’

%

pR,R1, . . . , Rmq :

R ď mintIpU ;Xmq, HpX1q, . . . , HpXmqu;

Rl ě IpU ;Xmq ´ IpU ;Xlq, 1 ď l ď m

,

/

/

/

/

.

/

/

/

/

-

. (6.150)

A priori, computing the rate region from (6.150) requires solving an optimization

with possibly infinite dimensions. However, using Theorem 6.2.5 we easily see that

the problem can be reduced to a matrix optimization in the case of Gaussian sources:

268

Theorem 6.4.5 For stationary memoryless sources where the per-letter joint distri-

bution is jointly Gaussian with non-degenerate covariance matrix Σ, the achievable

region for omniscient helper key generation can be represented as the closure of

ď

0ĺΣ1ĺΣ

$

’

’

’

’

&

’

’

’

’

%

pR,R1, . . . , Rmq :

R ď 12

log |Σ||Σ1|

;

Rl ě12

log |Σ||Σ1|´ 1

2log Σll

Σ1ll, 1 ď l ď m

,

/

/

/

/

.

/

/

/

/

-

. (6.151)

As alluded, omniscient helper CR generation is a related problem where there is

no secrecy (independence) assumption. For CR generation, [16, Theorem 4.2] showed

that the achievable region is similar to that of Theorem 6.4.4, except that the first

bound in (6.150) needs to be replaced by

R ď IpU ;Xmq. (6.152)

Note that such a characterization is equivalent to the supporting hyperplane charac-

terization we gave in (2.40). In particular, for continuous sources which has infinite

Shannon entropy, the achievable region is the same for CR generation and key gen-

eration!

We remark that despite the superficial similarity of the achievable regions for CR

and key generation, the achievability part of Theorem 6.4.4 requires more sophisti-

cated coding technique to guarantee secrecy. We observe that Theorem 6.4.5 also

applies to CR generation from Gaussian sources, because in that case HpXjq “ 8

and hence does not effectively change the bound on R.

As we introduced in Chapter 2, a generalization of the omniscient helper problem

is called the one communicator problem in [101], where in Figure 6.1 the terminal

T0 does not see all the random variables Xm but instead another random variable

Z which can be arbitrarily correlated with Xm. The achievable rate region for one

269

communicator CR generation is known [16, Theorem 4.2] to be the closure of

ď

QU |Z

$

’

’

’

’

&

’

’

’

’

%

pR,R1, . . . , Rmq :

R ď IpU ;Zq;

Rl ě IpU ;Zq ´ IpU ;Xlq, 1 ď l ď m

,

/

/

/

/

.

/

/

/

/

-

(6.153)

and obviously we can use Theorem 6.2.5 to reduce (6.153) to a matrix optimization

problem in the case of Gaussian pZ,X1, . . . , Xmq. The achievable rate region is also

derived for key generation with one communicator in [101]. But the expression of

that region is more complicated involving m` 1 auxiliary random variables, and it is

not immediate to conclude that Gaussian auxiliary random variables suffice merely

using Theorem 6.2.5.

6.5 Appendix: Existence of Maximizer in Theo-

rem 6.2.1

Proposition 6.5.1 In the non-degenerate case, for any Σ ľ 0,

φpΣq :“ supPUX : ΣX|UĺΣ

F pPUXq (6.154)

is finite and is attained by some PUX with |U | ă 8.

Proof First, observe that if we let φp¨q be the supremum in (6.154) with the addi-

tional restriction that |U | ă 8, then φp¨q is a concave function on a convex set of finite

dimension. Hence Jensen’s inequality13 implies that φp¨q ď φp¨q, while φp¨q ď φp¨q is

obvious from the definition. Thus φp¨q “ φp¨q.

13Luckily, φ is defined on a finite dimensional set of matrices (rather than a possibly infinitedimensional set of distributions PX). In the infinite dimensional case without further continuityassumptions, Jensen’s inequality can fail; see the example in [209, equation (1.3)].

270

The set

C :“ď

PX

tpF0pPXq,CovpXqqu (6.155)

lies in a linear space of dimension 1 ` dimpX qpdimpX q`1q2

. By Caratheodory’s theorem

[94, Theorem 17.1], each point in the convex hull of C is a convex combination of at

most D :“ 2` dimpX qpdimpX q`1q2

points in C:

ď

PUX : U finite

tpF pPUXq,ΣX|Uqu “ď

PUX : |U |ďD

tpF pPUXq,ΣX|Uqu (6.156)

hence

φpΣq “ supPUX : U“t1,...,Du,ΣX|UĺΣ

F pPUXq. (6.157)

Now suppose tPUnXnuně1 is a sequence satisfying ΣXn|Un ĺ Σ, |Un| “ t1, . . . , Du for

each n, and

limnÑ8

F pPUnXnq “ (6.157). (6.158)

We can assume without loss of generality that PUn converges to some PU‹ , since

otherwise we can pass to one convergent subsequence instead. Moreover, by the

translation invariance we can assume without loss of generality that

ErXn|Un “ us “ 0 (6.159)

for each u and n.

271

If u P t1, . . . , Du is such that PU‹puq ą 0, then for n sufficiently large, we have

PUn ąPU‹ puq

2and

CovpXn|Un “ uq ĺ2Σ

PU‹puq. (6.160)

Thus tPXn|Un“uuně1 is a tight sequence of measures by Chebyshev’s inequality, and

Prokhorov’s theorem [109] guarantees the existence of a subsequence of tPXn|Un“uuně1

converging weakly to some Borel measure PX‹u . We might as well assume that

tPXn|Un“uuně1 converges to PX‹u since otherwise we pass to a convergent subsequence

instead. This argument can applied to each u P t1, . . . , Du satisfying PU‹puq ą 0

iteratively, hence we can assume the existence of the weak limits

limnÑ8

PXn|Un“u “ PX‹u (6.161)

for all such u. Next, we show that

lim supnÑ8

F0pPXn|Un“uq ď F0pPX‹uq (6.162)

for all such u. Using Lemma 6.5.2 below, we obtain

lim supnÑ8

hpXn|Un “ uq ď hpX‹uq. (6.163)

Because of the moment constraint (6.160), the differential entropy of the output

distribution, which is smoothed by the Gaussian kernel, enjoys weak continuity in the

input distribution (see e.g. [92, Proposition 18], [204, Theorem 7], or [205, Theorem 1,

Theorem 2]):

limnÑ8

hpYjn|Un “ uq “ hpY‹j |U

‹“ uq for each u P t1, . . . , Du (6.164)

272

where pU,Xn,Yjnq „ PXnUnQYj |X and pU‹,X‹,Y‹j q „ PX‹U‹QYj |X. As for the trace

term, consider

lim infnÑ8

TrrM CovpXn|Un “ uqs “ lim infnÑ8

ErTrrMXnXJn s|Un “ us (6.165)

“ lim infnÑ8

supKą0

ErTrrMXnXJn s ^K|Un “ us (6.166)

ě supKą0

lim infnÑ8

ErTrrMXnXJn s ^K|Un “ us (6.167)

ě supKą0

ErTrrMX‹uX

‹Ju s ^Ks (6.168)

“ ErTrrMX‹uX

‹Ju ss (6.169)

where “^” takes the minimum of two numbers, (6.166) and (6.169) are from monotone

convergence theorem, and (6.168) uses the weak convergence (6.161). The proof of

(6.162) is finished by combining (6.163) (6.164) and (6.169).

The final step deals with any u P t1, . . . , Du satisfying PU‹puq “ 0. The variance

constraint implies that

CovpXn|Un “ uq ĺ1

PUnpuqΣ, (6.170)

hence by the fact that Gaussian distribution maximizes the differential entropy under

a covariance constraint, we have the bound

hpXn|Un “ uq ďdimpX q

2logp2πq `

1

2log e`

1

2log

ˇ

ˇ

ˇ

ˇ

1

PUnpuqΣ

ˇ

ˇ

ˇ

ˇ

(6.171)

“dimpX q

2logp2πq `

1

2log e`

1

2log |Σ| `

dimpX q2

log1

PUnpuq.

(6.172)

This combined with the fact that hpYjn|U “ uq ě hpYjn|Xnq is bounded below,

j “ 1, . . . ,m in the non-degenerate case, implies that if PUnpuq converges to zero,

273

then

lim supnÑ8

PUnpuqF0pPXn|Un“uq ď 0. (6.173)

Combining (6.162) and (6.173), we see

F pPU‹X‹q ě lim supnÑ8

F pPUnXnq, (6.174)

where PX‹|U‹“u :“ PX‹u for each u “ 1, . . . , D.

Lemma 6.5.2 Suppose pPXnq is a sequence of distributions on Rd converging weakly

to PX‹, and

ErXnXJn s ĺ Σ (6.175)

for all n. Then

lim supnÑ8

hpXnq ď hpX‹q. (6.176)

The result fails without the condition (6.175). Also, related results when the weak

convergence is replaced with pointwise convergence of density functions and certain

additional constraints was shown in [205, Theorem 1, Theorem 2] (see also the proof of

[92, Theorem 5]). Those results are not applicable here since the density functions of

Xn do not converge pointwise. They are applicable for the problems discussed in [92]

because the density functions of the output of the Gaussian random transformation

enjoy many nice properties due to the smoothing effect of the “good kernel”.

Proof It is well known that in metric spaces and for probability measures, the

relative entropy is weakly lower semicontinuous (cf. [43]). This fact and a scaling

274

argument immediately show that, for any r ą 0,

lim supnÑ8

hpXn|Xn ď rq ď hpX‹|X‹

ď rq. (6.177)

Let pnprq :“ PrXn ą rs, then (6.175) implies

ErXXJ|Xn ą rs ď

1

pnprqΣ. (6.178)

Therefore, since the Gaussian distribution maximizes differential entropy given a sec-

ond moment upper bound, we have

hpXn|Xn ą rq ď1

2log

p2πqde|Σ|

pnprq. (6.179)

Since limrÑ8 supn pnprq “ 0 by (6.175) and Chebyshev’s inequality, the above implies

that

limrÑ8

supnpnprqhpXn|Xn ą rq “ 0. (6.180)

The desired result follows from (6.177), (6.180) and the fact that

hpXnq “ pnprqhpXn|Xn ą rq ` p1´ pnprqqhpXn|Xn ď rq ` hppnprqq. (6.181)

275

Chapter 7

Conclusion and Future Work

Let us recall the main idea of the thesis: information measures (or their linear combi-

nation) generally have two equivalent characterizations: the functional representation

and the entropic representation. The existence of such equivalent characterizations

can be understood in terms of the duality of convex optimizations, even though tech-

nically speaking the optimization problem is not always convex (consider the case

where the linear combination of convex information measures may contain both pos-

itive and negative coefficients). As far as we have seen, the functional representation

is very useful in proving non-asymptotic bounds of operational problems. Especially,

the reverse hypercontractivity argument introduced in Chapter 4 appears to be a

tailor-made for the functional representation. On the other hand, the entropic for-

mulation appears more convenient in proving certain abstract properties such as data

processing, tensorization and the Gaussian optimality, leveraging tools familiar to

information theorists such as the chain rule.

Below, we mention two scenarios of common randomness generation where the

extension of the proposed functional approach appears challenging. These scenarios

are related to some of the author’s publications which have not been discussed in the

main text. The strong converse counterpart of these results appears to be open.

• Rate limited interactive communication. In the literature, a very common pro-

tocol for common randomness generation allows terminals to communicate to

276

each other [210], in contrast to the one-communicator protocol we discussed

(Figure 1.1) where only one terminal is allowed to send messages to others. We

remark that without communication rate constraints, the maximum secret key

rate has known single-letter expressions [210][211] and second-order rate results

[55]. However when the communication rate is limited, we can only characterize

the tradeoff between the communication rate and the secret key (or common

randomness) rate using the same number of auxiliary random variables as the

number of rounds of communications [102]. Does the strong converse hold in

this setting? The challenge appears to be that no corresponding functional in-

equality is known when two or more auxiliary random variables are involved. A

related open problem is the source-channel network with two or more helpers:

not only the strong converse but even the first-order rate region is generally

unknown [8].

• Demanding secrecy. If we further impose that the common randomness gen-

erated in the one-communicator problem (Figure 1.1) is independent of the

public communications (i.e. the secret key generation problem), can we still

show a strong converse or even a second-order result? The secret key genera-

tion problem is formally related to the wiretap channel (see the remark in [18]).

We note that the second-order rate for wiretap channel has been open for non-

degenerate channels [212], in which case the rate region involves an auxiliary

random variable, and is very similar to the rate region of the one-communicator

problem whose second-order rate we solved in Chapter 4.

277

Bibliography

[1] C. E. Shannon, “A mathematical theory of communication,” Bell System Tech-nical Journal, vol. 27, no. 3, pp. 379–423, July 1948.

[2] C. E. Shannon, “The bandwagon,” IRE Trans. Inf. Theory, vol. 2, no. 1, p. 3,1956.

[3] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley& Sons, 2nd ed., 2012.

[4] S. Verdu, “Fifty years of Shannon theory,” IEEE Trans. Inf. Theory, vol. 44,no. 6, pp. 2057–2078, Oct. 1998.

[5] P. Gacs and J. Korner, “Common information is far less than mutual informa-tion,” Problems of Control and Information Theory, vol. 2, no. 2, pp. 149–162,1973.

[6] A. D. Wyner, “The common information of two dependent random variables,”IEEE Trans. Inf. Theory, vol. 21, no. 2, pp. 163–179, Mar. 1975.

[7] R. Ahlswede and P. Gacs, “Spreading of sets in product spaces and hypercon-traction of the Markov operator,” The Annals of Probability, vol. 4, pp. 925–939,1976.

[8] I. Csiszar and J. Korner, Information Theory: Coding Theorems for DiscreteMemoryless Systems. Cambridge University Press, 2 ed., 2011.

[9] V. Anantharam, A. Gohari, S. Kamath, and C. Nair, “On maximal correlation,hypercontractivity, and the data processing inequality studied by Erkip andCover,” arXiv preprint arXiv:1304.6133, 2013.

[10] H. O. Hirschfeld, “A connection between correlation and contingency,” Proc.Cambridge Philosophical Soc., vol. 31, pp. 520–524, 1935.

[11] H. Gebelein, “Das statistische problem der korrelation als variations- undeigenwert-problem und sein zusammenhang mit der ausgleichungsrechnung,”Zeitschrift fur angew. Math. und Mech., vol. 21, pp. 364–379, 1941.

[12] A. Renyi, “On measures of dependence,” Acta Math. Hung., vol. 10, no. 3-4,pp. 441–451, 1959.

278

[13] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,”IRE Int. Conv. Rec., vol. vol. 7, part 4, pp. 142–163, 1959.

[14] U. M. Maurer, “Secret key agreement by public discussion from common infor-mation,” IEEE Trans. Inf. Theory, vol. 39, no. 3, pp. 733–742, Mar. 1993.

[15] R. Ahlswede and I. Csiszar, “Common randomness in information theory andcryptography. I. Secret sharing,” IEEE Trans. Inf. Theory, vol. 39, no. 4,pp. 1121–1132, Apr. 1993.

[16] R. Ahlswede and I. Csiszar, “Common randomness in information theory andcryptography. II. CR capacity,” IEEE Trans. Inf. Theory, vol. 44, no. 1, pp. 225–240, Jan. 1998.

[17] C. E. Shannon, “Two-way communication channels,” in Proc. 4th BerkeleySymp. Math. Statist. Probab., vol. 1, pp. 611–644, 1961.

[18] A. El Gamal and Y.-H. Kim, Network information theory. Cambridge UniversityPress, 2011.

[19] C. Shannon, “The zero error capacity of a noisy channel,” IRE Trans. Inf.Theory, vol. 2, no. 3, pp. 8–19, 1956.

[20] C. E. Shannon, R. G. Gallager, and E. Berlekamp, “Lower bounds to errorprobability for coding on discrete memoryless channels, I,” Information andControl, vol. 10, no. 1, pp. 65–103, 1967.

[21] G. Forney Jr, “Exponential error bounds for erasure, list, and decision feedbackschemes,” IEEE Trans. Inf. Theory, vol. 14, no. 2, pp. 206–220, Mar. 1968.

[22] R. Ahlswede and G. Dueck, “Identification via channels,” IEEE Trans. Inf.Theory, vol. 35, no. 1, pp. 15–29, 1989.

[23] R. Ahlswede, N. Cai, and Z. Zhang, “Erasure, list, and detection zero error ca-pacities for low noise and a relation to identification,” IEEE Trans. Inf. Theory,vol. 42, no. 1, pp. 55–62, Jan. 1996.

[24] J. Korner and A. Orlitsky, “Zero-error information theory,” IEEE Trans. Inf.Theory, vol. 44, pp. 2207–2229, October 1998.

[25] I. Csiszar and J. Korner, Information Theory: Coding Theorems for DiscreteMemoryless Systems. Cambridge University Press, 1 ed., 1981.

[26] J. Wolfowitz, “Notes on a general strong converse,” Information and Control,vol. 12, pp. 1–4, 1968.

[27] R. Ahlswede, P. Gacs, and J. Korner, “Bounds on conditional probabilitieswith applications in multi-user communication,” Z. Wahrscheinlichkeitstheorieverw. Gebiete, vol. 34, no. 2, pp. 157–177, 1976. Correction (1977). ibid, 39(4),353–354.

279

[28] R. M. Fano, “Transmission of information,” Unpublished course notes, Mas-sachusetts Institute of Technology, Cambridge, MA., 1952.

[29] M. Braverman and A. Rao, “Information equals amortized communication,”IEEE Trans. Inf. Theory, vol. 60, no. 10, pp. 6058–6069, 2014.

[30] A. Orlitsky, N. Santhanam, and J. Zhang, “Relative redundancy for large al-phabets,” in Proc. of IEEE International Symposium on Information Theory,pp. 2672–2676, Seattle, WA, July 2006.

[31] B. Kelly, Information Theory With Large Alphabets And Source Coding ErrorExponents. PhD thesis, Cornell University, 2011.

[32] Y. Polyanskiy, “A perspective on massive random-access,” in Proc. IEEE In-ternational Symposium on Information Theory, Aachen, Germany, July 2017.

[33] V. Strassen, “Asymptotische abschatzungen in Shannons informationstheorie,”in Trans. 3rd Prague Conf. Inf. Theory, pp. 689–723, Prague, 1962.

[34] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the finiteblocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307–2359,May 2010.

[35] M. Hayashi, “Information spectrum approach to second-order coding rate inchannel coding,” IEEE Trans. Inf. Theory, vol. 55, pp. 4947–4966, Nov. 2009.

[36] V. Kostina and S. Verdu, “Fixed-length lossy compression in the finite block-length regime,” IEEE Trans. Inf. Theory, vol. 58, no. 6, pp. 3309–3338, June2012.

[37] M. H. Yassaee, M. R. Aref, and A. Gohari, “A technique for derivingone-shot achievability results in network information theory,” arXiv preprintarXiv:1303.0696, 2013.

[38] S. Watanabe, S. Kuzuoka, and V. Y. F. Tan, “Nonasymptotic and second-order achievability bounds for coding with side-information,” IEEE Trans. Inf.Theory, vol. 61, Apr. 2015.

[39] P. Cuff, Communication in Networks for Coordinating Behavior. PhD thesis,Stanford University, 2009.

[40] V. Y. F. Tan and O. Kosut, “On the dispersions of three network informationtheory problems,” IEEE Trans. Inf. Theory, vol. 60, pp. 881–903, Feb. 2014.

[41] S. Verdu and T. S. Han, “A general formula for channel capacity,” IEEE Trans.Inf. Theory, vol. 40, pp. 1147–1157, July 1994.

[42] Y. Polyanskiy and S. Verdu, “Arimoto channel coding converse and Renyi di-vergence,” in 48th Annual Allerton Conference on Communication, Control,and Computing, pp. 1327–1333, Monticello, Illinois, Oct. 2010.

280

[43] S. Verdu, Information Theory. Princeton University Press (In preparation).

[44] T. S. Han, Information-spectrum Methods in Information Theory. Springer-Verlag Berlin Heidelberg 2003.

[45] Y. Oohama, “Strong converse exponent for degraded broadcast channels atrates outside the capacity region,” in Proc. of IEEE International Symposiumon Information Theory, pp. 939–943, June 2015, Hong Kong, China.

[46] R. Ahlswede, “Coloring hypergraphs: A new approach to multi-user sourcecoding, 1,” Journal of combinatorics, information & system sciences, vol. 4,no. 1, 1979.

[47] A. Ingber and Y. Kochman, “The dispersion of lossy source coding,” in Proc.of the Data Compression Conference (DCC), pp. 53–62, Mar 2011, Snowbird,UT.

[48] M. Raginsky and I. Sason, “Concentration of measure inequalities in informa-tion theory, communications, and coding,” Foundations and Trends in Com-munications and Information Theory, vol. 10, no. 1-2, pp. 1–250, 2014.

[49] A. D. Wyner, “On source coding with side information at the decoder,” IEEETrans. Inf. Theory, vol. 21, pp. 294–300, May 1975.

[50] V. Y. F. Tan, “Asymptotic Estimates in Information Theory with Non-Vanishing Error Probabilities,” Foundations and Trends in Communicationsand Information Theory, vol. 11, no. 1-2, pp. 1–184, 2014.

[51] R. M. Gray and A. D. Wyner, “Source coding for a simple network,” BellSystem Technical Journal, vol. 53, no. 9, pp. 1681–1721, Nov. 1974.

[52] S. Watanabe, “Second-order region for Gray-Wyner network,” IEEE Trans. Inf.Theory, vol. 63, pp. 1006–1018, Feb. 2017.

[53] R. Renner and S. Wolf, “Simple and tight bounds for information reconcilia-tion and privacy amplification,” in Advances in Cryptology-ASIACRYPT 2005,pp. 199–216, Springer, 2005.

[54] M. Hayashi, H. Tyagi, and S. Watanabe, “Secret key agreement: General ca-pacity and second-order asymptotics,” in Proc. IEEE International Symposiumon Information Theory, pp. 1136–1140, Honolulu, HI, June-July 2014.

[55] H. Tyagi and S. Watanabe, “A bound for multiparty secret key agreement andimplications for a problem of secure computing,” in Advances in Cryptology–EUROCRYPT 2014, pp. 369–386, Springer, 2014.

[56] E. Mossel, R. O. Donnell, O. Regev, J. E. Steif, and B. Sudakov, “Non-interactive correlation distillation, inhomogeneous markov chains, and the re-verse bonami-beckner inequality,” Israel Journal of Mathematics, vol. 154, no. 1,2006.

281

[57] C. Villani, Optimal transport: old and new, vol. 338. Springer Science & Busi-ness Media, 2008.

[58] M. Talagrand, “A new look at independence,” The Annals of probability, vol. 24,no. 1, pp. 1–34, 1996.

[59] V. D. Milman, “New proof of the theorem of A. Dvoretzky on intersections ofconvex bodies,” Functional Analysis and Its Applications, vol. 5, no. 4, pp. 288–295, 1971.

[60] W. T. Gowers, “The two cultures of mathematics,” Mathematics: Frontiers andPerspectives (V. Arnold et al., eds), pp. 65–78, 2000.

[61] G. A. Margulis, “Veroyatnostniye characteristiki grafov s bolshoy svyaznostyu.[in russian],” Problems of Information Transmission, vol. 10, no. 2, pp. 174–179,1974.

[62] “Interview with Katalin Marton,” slides available online at http: // media.

itsoc. org/ marton-interview. pdf .

[63] K. Marton, “A simple proof of the blowing-up lemma,” IEEE Trans. Inf. The-ory, vol. 24, pp. 857–866, 1986.

[64] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: ANonasymptotic Theory of Independence. Oxford Univesity Press, 2013.

[65] S. Varadhan, Large Deviations and Applications. Society for Industrial andApplied Mathematics (SIAM), 1984.

[66] M. D. Donsker and S. R. S. Varadhan, “Asymptotic evaluation of certain markovprocess expectations for large time,” I, Comm. Pure Appl. Math., vol. 28, pp. 1–47, 1975.

[67] P. D. Lax, Functional Analysis. John Wiley & Sons, Inc., 2002.

[68] N. Bourbaki, Integration. (Chaps. I-IV, Actualites Scientifiques et Industrielles,no. 1175), Paris, Hermann, 1952.

[69] E. Nelson, “The free Markov field,” J. Funct. Anal., vol. 12, no. 2, pp. 211–227,1973.

[70] L. Gross, “Logarithmic Sobolev Inequalities,” American Journal of Mathemat-ics, vol. 97, no. 4, pp. 1061–1083, 1975.

[71] J. Kahn, G. Kalai, and N. Linial, “The influence of variables on Boolean func-tions,” in 29th Annual Symposium on Foundations of Computer Science, pp. 68–80, 1988.

282

http://media.itsoc.org/marton-interview.pdf

http://media.itsoc.org/marton-interview.pdf

[72] E. Mossel, R. O’Donnell, and K. Oleszkiewicz, “Noise stability of functions withlow influences: Invariance and optimality,” Annals of Mathematics, vol. 171,no. 1, pp. 295–341, 2010.

[73] A. Xu and M. Raginsky, “Converses for distributed estimation via strong dataprocessing inequalities,” in Proc. of the 2015 IEEE International Symposiumon Information Theory (ISIT), Hong Kong, China, pp. 2376–2380, July 2015.

[74] C. Nair, “Equivalent formulations of hypercontractivity using information mea-sures,” IZS Workshop, Feb. 2014.

[75] E. A. Carlen, E. H. Lieb, and M. Loss, “A sharp analog of Young’s inequalityon SN and related entropy inequalities,” The Journal of Geometric Analysis,vol. 14, no. 3, pp. 487–520, 2004.

[76] E. A. Carlen and D. Cordero-Erausquin, “Subadditivity of the entropy andits relation to Brascamp–Lieb type inequalities,” Geometric and FunctionalAnalysis, vol. 19, no. 2, pp. 373–405, 2009.

[77] E. Mossel, K. Oleszkiewicz, and A. Sen, “On reverse hypercontractivity,” Geo-metric and Functional Analysis, vol. 23, no. 3, pp. 1062–1097, 2013.

[78] S. Kamath, “Reverse hypercontractivity using information measures,” inProc. of the 53rd Annual Allerton Conference on Communications, Controland Computing, pp. 627–633, Sept. 30-Oct. 2, 2015, UIUC, Illinois.

[79] S. Beigi and C. Nair, “Equivalent characterization of reverse Brascamp-Liebtype inequalities using information measures,” in Proc. of IEEE InternationalSymposium on Information Theory, pp. 1038–1042, Barcelona, Spain, July2016.

[80] J. Liu, T. A. Courtade, P. Cuff, and S. Verdu, “Brascamp-Lieb inequality and itsreverse: An information theoretic view,” in Proceedings of 2016 IEEE Interna-tional Symposium on Information Theory, (Barcelona, Spain), pp. 1048–1052,July 2016.

[81] M. Ledoux, “Isoperimetry and Gaussian analysis,” in Lectures on probabilitytheory and statistics, pp. 165–294, Springer, 1996.

[82] I. a. L. M. Bobkov, S. G. and Gentil, “Hypercontractivity of Hamilton-Jacobiequations,” J. Math. Pures Appl., vol. 80, no. 7, pp. 669–696, 2001.

[83] H. J. Brascamp and E. H. Lieb, “Best constants in Young’s inequality, itsconverse, and its generalization to more than three functions,” Advances inMathematics, vol. 20, no. 2, pp. 151–173, 1976.

[84] F. Barthe, “The Brunn-Minkowski theorem and related geometric and func-tional inequalities,” in International Congress of Mathematicians, vol. 2,pp. 1529–1546, 2006.

283

[85] F. Avram, N. Leonenko, and L. Sakhno, “Harmonic analysis tools for statisticalinference in the spectral domain,” Dependence in Probability and Statistics,pp. 59–70, 2010.

[86] A. Garg, L. Gurvits, R. Oliveira, and A. Wigderson, “Algorithmic aspects ofBrascamp-Lieb inequalities,” arXiv preprint arXiv:1607.06711, 2016.

[87] A. Dembo, T. M. Cover, and J. A. Thomas, “Information theoretic inequalities,”IEEE Trans. Inf. Theory, vol. 37, no. 6, pp. 1501–1518, Nov. 1991.

[88] F. Barthe, “On a reverse form of the Brascamp-Lieb inequality,” Inventionesmathematicae, vol. 134, no. 2, pp. 335–361, 1998.

[89] E. H. Lieb, “Gaussian kernels have only Gaussian maximizers,” Inventionesmathematicae, vol. 102, no. 1, pp. 179–208, 1990.

[90] F. Barthe, “Optimal Young’s inequality and its converse: a simple proof,” Ge-ometric & Functional Analysis, vol. 8, no. 2, pp. 234–242, 1998.

[91] F. Barthe, D. Cordero-Erausquin, M. Ledoux, and B. Maurey, “Correlation andBrascamp-Lieb inequalities for Markov semigroups,” International MathematicsResearch Notices, pp. 2177–2216, 2011.

[92] Y. Geng and C. Nair, “The capacity region of the two-receiver Gaussian vec-tor broadcast channel with private and common messages,” IEEE Trans. Inf.Theory, vol. 60, no. 4, pp. 2087–2104, 2014.

[93] J. Liu, T. A. Courtade, P. Cuff, and S. Verdu, “Information theoretic perspec-tives on Brascamp-Lieb inequality and its reverse,” arXiv:1702.06260.

[94] R. T. Rockafellar, Convex Analysis. Princeton University Press, 2015.

[95] H. J. Brascamp and E. H. Lieb, “On extensions of the Brunn-Minkowski andPrekopa-Leindler theorems, including inequalities for log concave functions, andwith an application to the diffusion equation,” J. Funct. Anal., vol. 22, no. 4,1976.

[96] S. G. Bobkov and M. Ledoux, “From Brunn-Minkowski to Brascamp-Lieband to logarithmic Sobolev inequalities,” Geom. Funct. Anal., vol. 10, no. 5,pp. 1028–1052, 2000.

[97] D. Cordero-Erausquin, “Transport inequalities for log-concave measures, quan-titative forms and applications,” arXiv preprint arXiv:1504.06147, 2015.

[98] J. Bennett, A. Carbery, M. Christ, and T. Tao, “The Brascamp–Lieb inequal-ities: finiteness, structure and extremals,” Geometric and Functional Analysis,vol. 17, no. 5, pp. 1343–1415, 2008.

[99] R. Gardner, “The Brunn-Minkowski inequality,” Bulletin of the AmericanMathematical Society, vol. 39, no. 3, pp. 355–405, 2002.

284

[100] W. Beckner, “Inequalities in Fourier analysis on Rn,” Proceedings of the Na-tional Academy of Sciences, vol. 72, no. 2, pp. 638–641, 1975.

[101] J. Liu, P. Cuff, and S. Verdu, “Secret key generation with one communica-tor and a one-shot converse via hypercontractivity,” in Proceedings of 2015IEEE International Symposium on Information Theory, (Hong Kong, China),pp. 710–714, June 2015.

[102] J. Liu, P. Cuff, and S. Verdu, “Secret key generation with limited interaction,”IEEE Trans. on Information Theory, accepted, 2017.

[103] J. Lehec, “Representation formula for the entropy and functional inequalities,”arXiv preprint arXiv:1006.3028, 2010.

[104] C. Villani, Topics in Optimal Transportation. No. 58, American MathematicalSoc., 2003.

[105] A. Dembo and O. Zeitouni, Large deviations techniques and applications, vol. 38.Springer Science & Business Media, 2009.

[106] T. Tao, “245B, Notes 12: Continuous functions on locally compactHausdorff spaces.” https://terrytao.wordpress.com/2009/03/02/

245b-notes-12-continuous-functions-on-locally-compact-hausdorff-spaces/,Mar. 2, 2009.

[107] S. M. Lane, Categories for the Working Mathematician. Springer Science andBusiness Media, 1978.

[108] A. Hatcher, “Algebraic topology,” Cambridge University Press, 2002.

[109] Y. V. Prokhorov, “Convergence of random processes and limit theorems inprobability theory,” Theory of Probability & Its Applications, vol. 1, no. 2,pp. 157–214, 1956.

[110] R. Atar and N. Merhav, “Information-theoretic applications of the logarithmicprobability comparison bound,” arXiv preprint arXiv:1412.6769, 2014.

[111] J. Radhakrishnan, “Entropy and counting,” Computational mathematics, mod-elling and algorithms, vol. 146, 2003.

[112] M. Madiman and P. Tetali, “Information inequalities for joint distributionswith interpretations and applications,” IEEE Trans. Inf. Theory, vol. 56, no. 6,pp. 2699–2713, June 2010.

[113] F. Bobkov, S. G. and Gotze, “Exponential integrability and transportation costrelated to Logarithmic Sobolev inequalities,” J. Funct. Anal., vol. 163, no. 1,pp. 1–28, 1999.

[114] E. Erkip and T. M. Cover, “The efficiency of investment information,” IEEETrans. Inf. Theory, vol. 44, no. 3, pp. 1026–1040, Mar. 1998.

285

https://terrytao.wordpress.com/2009/03/02/245b-notes-12-continuous-functions-on-locally-compact-hausdorff-spaces/

https://terrytao.wordpress.com/2009/03/02/245b-notes-12-continuous-functions-on-locally-compact-hausdorff-spaces/

[115] T. A. Courtade, “Outer bounds for multiterminal source coding via a strongdata processing inequality,” in Proceedings of 2013 IEEE International Sympo-sium on Information Theory, pp. 559–563, Istanbul, Turkey, July 2013.

[116] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with inputconstraints,” IEEE Trans. Inf. Theory, vol. 62, no. 1, pp. 35–55, Jan. 2016.

[117] Y. Polyanskiy and Y. Wu, “A note on the strong data-processing inequalitiesin Bayesian networks,” http://arxiv.org/pdf/1508.06025v1.pdf.

[118] J. Liu, P. Cuff, and S. Verdu, “Key capacity for product sources with applicationto stationary Gaussian processes,” IEEE Trans. Inf. Theory, vol. 62, pp. 984–1005, Feb. 2016.

[119] S. Kamath and V. Anantharam, “On non-interactive simulation of joint distri-butions,” arXiv preprint arXiv:1505.00769, 2015.

[120] A. Ganor, G. Kol, and R. Raz, “Exponential separation of information andcommunication,” in 2014 IEEE 55th Annual Symposium on Foundations ofComputer Science (FOCS), pp. 176–185.

[121] Z. Dvir and G. Hu, “Sylvester-Gallai for Arrangements of Subspaces,”arXiv:1412.0795, 2014.

[122] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff, “Commu-nication lower bounds for statistical estimation problems via a distributed dataprocessing inequality,” arXiv preprint arXiv:1506.07216, 2015.

[123] M. Talagrand, “On Russo’s approximate zero-one law,” The Annals of Proba-bility, pp. 1576–1587, 1994.

[124] E. Friedgut, G. Kalai, and A. Naor, “Boolean functions whose Fourier trans-form is concentrated on the first two levels,” Advances in Applied Mathematics,vol. 29, no. 3, pp. 427–437, 2002.

[125] J. Bourgain, “On the distribution of the Fourier spectrum of Boolean functions,”Israel Journal of Mathematics, vol. 131, no. 1, pp. 269–276, 2002.

[126] C. Garban, G. Pete, and O. Schramm, “The Fourier spectrum of critical per-colation,” Acta Mathematica, vol. 205, no. 1, pp. 19–104, 2010.

[127] M. J. John C Duchi and M. J. Wainwright, “Local privacy and statistical min-imax rates,” in IEEE 54th Annual Symposium on Foundations of ComputerScience (FOCS), pp. 429–438, 2013.

[128] K. Dvijotham and E. Todorov, “A Unifying Framework for Linearly SolvableControl,” Proc. 27th Conf. on Uncertainty in Artificial Intelligence, pp. 1–8.

286

[129] R. Atar, K. Chowdhary, and P. Dupuis, “Robust bounds on risk-sensitive func-tionals via Renyi divergence,” SIAM J. Uncertainty Quant., vol. 3, pp. 18–33,2015.

[130] T. V. Erven and P. Harremoes, “Renyi divergence and Kullback-Leibler diver-gence,” IEEE Trans. Inf. Theory, vol. 60, pp. 3797–3820, July 2014.

[131] O. Shayevitz, “A note on a characterization of Renyi measures and its relationto composite hypothesis testing,” arXiv:1012.4401v1, 2010.

[132] I. Sason and S. Verdu, “f -divergence inequalities,” IEEE Trans. Inf. Theory,vol. 62, no. 11, pp. 5973–6006, Nov. 2016.

[133] I. Sason and S. Verdu, “Arimoto-Renyi conditional entropy and Bayesian m-aryhypothesis testing,” IEEE Trans. Inf. Theory, Jan. 2018.

[134] L. H. Loomis and H. Whitney, “An inequality related to the isoperimetric in-equality,” Bull. Amer. Math. Soc., vol. 55, pp. 961–962, 1949.

[135] P. F. F. Chung R. Graham and J. Shearer, “Some intersection theorems forordered sets and graphs,” J. Combinatorial Theory Series A, vol. 43, no. 1,pp. 23–37, 1986.

[136] S. T. Rachev, Probability Metrics and the Stability of Stochastic Models. JohnWiley & Sons Ltd., Chichester, 1991.

[137] K. Marton, “A simple proof of the blowing-up lemma,” IEEE Trans. Inf. The-ory, vol. 32, no. 3, pp. 445–446, 1986.

[138] K. Marton, “Bounding d-distance by informational divergence: a method toprove measure concentration,” The Annals of Probability, vol. 24, no. 2, pp. 857–866, 1996.

[139] R. Impagliazzo and D. Zuckerman, “How to recycle random bits,” in Proc. 1989Symposium on Foundations of Computer Science (FOCS), pp. 248–253, 1989.

[140] C. H. Bennett, G. Brassard, C. Crepeau, and U. Maurer, “Generalized privacyamplification,” IEEE Trans. Inf. Theory, vol. 41, no. 6, pp. 1915–1923, 1995.

[141] J. Hastad, R. Impagliazzo, L. A. Levin, and M. G. Luby, “A pseudorandomgenerator from any one-way function,” SIAM Journal on Computing, vol. 28,no. 4, pp. 1364–1396, 1999.

[142] S. Fehr and S. Berens, “On the conditional Renyi entropy,” IEEE Trans. Inf.Theory, vol. 60, pp. 6801–6810, Nov. 2014.

[143] J. Liu, P. Cuff, and S. Verdu, “Eγ-resolvability,” IEEE Trans. Inf. Theory,vol. 63, no. 5, pp. 2629–2658, May 2017.

287

[144] J. Liu, T. A. Courtade, P. Cuff, and S. Verdu, “Smoothing Brascamp-Lieb inequalities and strong converses for common randomness gen-eration (extended),” http: // www. princeton. edu/ ~ jingbo/ preprints/

ISITsmoothBL2016. pdf .

[145] J. Liu, T. A. Courtade, P. Cuff, and S. Verdu, “Smoothing Brascamp-Lieb in-equalities and strong converses for common randomness generation,” in Proc. ofIEEE International Symposium on Information Theory, pp. 1043–1047, July2016, Barcelona, Spain.

[146] Y. Polyanskiy, Channel Coding: Non-asymptotic Fundamental Limits. PhDthesis, Princeton University, Princeton, NJ, 2010.

[147] M. Ledoux and M. Talagrand, Probability in Banach Spaces: isoperimetry andprocesses. Springer Science & Business Media, 2013.

[148] C. Borell, “Positivity improving operators and hypercontractivity,” Mathema-tische Zeitschrift, vol. 180, no. 3, pp. 225–234, 1982.

[149] Y. Polyanskiy and S. Verdu, “Empirical distribution of good channel codeswith nonvanishing error probability,” IEEE Trans. Inf. Theory, vol. 60, no. 1,pp. 5–21, 2014.

[150] J. Liu, P. Cuff, and S. Verdu, “On α-decodability and α-likelihood decoder,” in55th Annual Allerton Conference on Communication, Control, and Computing,Monticello, Illinois, Oct. 2017.

[151] D. Guo, S. Shamai, and S. Verdu, “Mutual information and minimum mean-square error in Gaussian channels,” IEEE Trans. Inf. Theory, vol. 51, no. 4,pp. 1261–1282, 2005.

[152] S. L. Fong and V. Y. Tan, “A proof of the strong converse theorem for Gaus-sian broadcast channels via the Gaussian Poincare inequality,” arXiv preprintarXiv:1509.01380, 2015.

[153] P. Caputo, G. Menz, and P. Tetali, “Approximate tensorization of entropy athigh temperature,” arXiv:1405.0608v3, 2015.

[154] H. Tyagi and P. Narayan, “The Gelfand-Pinsker channel: Strong converse andupper bound for the reliability function,” in Proc. of IEEE International Sym-posium on Information Theory, pp. 1954–1958, June-July 2009, Seoul, Korea.

[155] T. Han and S. Verdu, “Approximation theory of output statistics,” IEEE Trans.Inf. Theory, vol. 39, no. 3, pp. 752–772, May 1993.

[156] A. El Gamal and E. C. van der Meulen, “A proof of Marton’s coding theorem forthe discrete memoryless broadcast channel.,” IEEE Trans. Inf. Theory, vol. 27,no. 1, pp. 120–122, Jan. 1981.

288

http://www.princeton.edu/~jingbo/preprints/ISITsmoothBL2016.pdf

http://www.princeton.edu/~jingbo/preprints/ISITsmoothBL2016.pdf

[157] Y. Polyanskiy private communication, June 2014.

[158] I. Csiszar, “Information-type measures of difference of probability distributionsand indirect observation,” Studia Sci. Math. Hungar., vol. 2, pp. 229-318, 1967.

[159] P. Cuff, “Distributed channel synthesis,” IEEE Trans. Inf. Theory, vol. 59,no. 11, pp. 7071–7096, Nov. 2013.

[160] E. C. Song, P. Cuff, and H. V. Poor, “The likelihood encoder for lossy compres-sion,” IEEE Trans. Inf. Theory, vol. 62, no. 4, pp. 1836–1849, Apr. 2016.

[161] S. Verdu, “Non-asymptotic achievability bounds in multiuser information the-ory,” in 50th Annual Allerton Conference on Communication, Control, andComputing, pp. 1-8, 2012.

[162] S. Verdu, “Total variation distance and the distribution of relative information,”in Workshop on Information Theory and Applications, UCSD, Feb. 10-15 2014.

[163] A. Renyi, “On measures of entropy and information,” in Proceedings of theFourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1,pp. 547–561, 1961.

[164] N. A. Warsi, “One-shot bounds for various information theoretic problems usingsmooth min and max Renyi divergences,” in Proceedings of 2013 InformationTheory Workshop (ITW), pp. 1–5, Sept. 2013.

[165] L. Wang, R. Colbeck, and R. Renner, “Simple channel coding bounds,” inProceedings of 2009 IEEE International Symposium on Information Theory,pp. 1804–1808, Seoul, South Korea, June-July 2009.

[166] M. M. Wilde, Quantum Information Theory. Cambridge University Press, 2013.

[167] M. S. Pinsker, Information and Information Stability of Random Variables andProcesses. San-Fransisco: Holden-Day, 1964, originally published in Russian in1960.

[168] P. Cuff, “Soft covering with high probability,” in Proceedings of 2016 IEEEInternational Symposium on Information Theory, pp. 2963–2967, Barcelona,Spain, July 10-July 15, 2016.

[169] Y. Steinberg and S. Verdu, “Simulation of random processes and rate-distortiontheory,” IEEE Trans. Inf. Theory, vol. 42, no. 1, pp. 63–86, 1996.

[170] I. Csiszar, “Almost independence and secrecy capacity,” Problems Inf. Trans-mission, vol. 32, no. 1, pp. 40-47, 1996.

[171] M. Hayashi, “General nonasymptotic and asymptotic formulas in channel re-solvability and identification capacity and their application to the wiretap chan-nel,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1562–1575, Apr. 2006.

289

[172] M. Bloch and N. Laneman, “Strong secrecy from channel resolvability,” IEEETrans. Inf. Theory, vol. 59, no. 12, pp. 8077–8098, Dec. 2013.

[173] T. S. Han, H. Endo, and M. Sasaki, “Reliability and security functions of thewiretap channel under cost constraint,” IEEE Trans. Inf. Theory, vol. 60, no. 11,pp. 6819-6843, 2014.

[174] Y. Steinberg and S. Verdu, “Channel simulation and coding with side informa-tion,” IEEE Trans. Inf. Theory, vol. 40, no. 3, pp. 634–646, May 1994.

[175] S. Shamai and S. Verdu, “The empirical distribution of good codes,” IEEETrans. Inf. Theory, vol. 43, no. 3, pp. 836–846, 1997.

[176] S. Watanabe and M. Hayashi, “Strong converse and second-order asymptoticsof channel resolvability,” in Proceedings of 2014 IEEE International Symposiumon Information Theory, pp. 1882–1886, Honolulu, HI, USA, July 2014.

[177] Z. Goldfeld, P. Cuff, and H. H. Permuter, “Semantic-security capacity for wire-tap channels of type ii,” in Proceedings of 2016 IEEE International Symposiumon Information Theory, pp. 2799–2803, Barcelona, Spain, July 10-July 15, 2016.

[178] M. Tahmasbi and M. R. Bloch, “Second-order asymptotics of covert communi-cations over noisy channels,” in Proceedings of 2016 IEEE International Sym-posium on Information Theory, pp. 2224–2228, Barcelona, Spain, July 10-July15, 2016.

[179] S. Verdu and T. S. Han, “A general formula for channel capacity,” IEEE Trans.Inf. Theory, vol. 40, no. 4, pp. 1147–1157, 1994.

[180] A. Orlitsky and J. R. Roche, “Coding for computing,” IEEE Trans. Inf. Theory,vol. 47, no. 3, pp. 903–917, Mar. 2001.

[181] J. Liu, J. Jin, and Y. Gu, “Robustness of sparse recovery via f -minimization: Atopological viewpoint,” IEEE Trans. Inf. Theory, vol. 61, pp. 3996–4014, July2015.

[182] J. K. Omura, “A lower bounding method for channel and source coding prob-abilities,” Information and Control, vol. 27, no. 2, pp. 148–177, 1975.

[183] Y. Liang and G. Kramer, “Rate regions for relay broadcast channels,” IEEETrans. Inf. Theory, vol. 53, pp. 3517–3535, Oct. 2007.

[184] S. I. Gel’fand and M. S. Pinsker, “Capacity of a broadcast channel with onedeterministic component,” Problems Inf. Transmission.

[185] Y. Liang, G. Kramer, and H. V. Poor, “On the equivalence of two achievableregions for the broadcast channel,” IEEE Trans. Inf. Theory, vol. 57, pp. 95–100, Jan. 2011.

290

[186] J. Liu, P. Cuff, and S. Verdu, “One-shot mutual covering lemma and Marton’sinner bound with a common message,” in Proceedings of 2015 IEEE Interna-tional Symposium on Information Theory, pp. 1457–1461, Hong Kong, China,June 2015.

[187] S. Verdu, “Non-asymptotic covering lemmas,” in Proc. 2015 IEEE InformationTheory Workshop, Jerusalem, Israel, April 26-May 1, 2015.

[188] M. H. Yassaee, M. R. Aref, and A. Gohari, “A technique for deriving one-shot achievability results in network information theory,” in Proceedings ofIEEE International Symposium on Information Theory, pp. 1287–1291, Istan-bul, Turkey, Jul. 2013.

[189] M. H. Yassaee, “One-shot achievability via fidelity,” in Proceedings of 2015IEEE International Symposium on Information Theory, pp. 301–305, HongKong, China, June 2015.

[190] M. H. Yassaee, J. Liu, and S. Verdu, “One-shot multivariate covering lemmasvia weighted sum and concentration inequalities,” in Proceedings of the 2017IEEE International Symposium on Information Theory, pp. 669–673, June 25–30, 2017.

[191] J. Liu, P. Cuff, and S. Verdu, “Key capacity with limited one-way communica-tion for product sources,” in Proceedings of 2014 IEEE International Symposiumon Information Theory, pp. 1146–1150, Honolulu, HI, USA, July 2014.

[192] A. D. Wyner, “The wire-tap channel,” The Bell System Technical Journal,vol. 54, no. 8, pp. 1355–1387, 1975.

[193] J. Hou and G. Kramer, “Effective secrecy: Reliability, confusion and stealth,”in Proceedings of 2014 IEEE International Symposium on Information Theory,pp. 601–605, Honolulu, HI, USA, July 2014.

[194] M. Bloch, “A channel resolvability perspective on stealth communications,” inProceedings of 2015 IEEE International Symposium on Information Theory,pp. 601–605, Hong Kong, June 2015.

[195] T. Tao, “Amplification, arbitrage, and the tensor power trick,”https://terrytao.wordpress.com/2007/09/05/amplification-arbitrage-and-the-tensor-power-trick/.

[196] T. Tao, “The Brunn-Minkowski inequality for nilpotent groups,” [On-line]. Available: https: // terrytao. wordpress. com/ 2011/ 09/ 16/

the-brunn-minkowski-inequality-for-nilpotent-groups/ , Sept. 16,2011.

[197] E. A. Carlen, “Superadditivity of Fisher’s information and logarithmic Sobolevinequalities,” Journal of Functional Analysis, vol. 101, no. 1, pp. 194–211, 1991.

291

https://terrytao.wordpress.com/2011/09/16/the-brunn-minkowski-inequality-for-nilpotent-groups/

https://terrytao.wordpress.com/2011/09/16/the-brunn-minkowski-inequality-for-nilpotent-groups/

[198] H. Cramer, “Uber eine Eigenschaft der normalen Verteilungsfunktion,” Mathe-matische Zeitschrift, vol. 41, no. 1, pp. 405–414, 1936.

[199] C. Nair, “An extremal inequality related to hypercontractivity of Gaussian ran-dom variables,” ITA 2014, 2014.

[200] T. Courtade and J. Jiao, “An Extremal Inequality for Long Markov Chains,”arXiv preprint arXiv:1404.6984, 2014.

[201] T. A. Courtade, “Strengthening the entropy power inequality,” in Proc. of IEEEInternational Symposium on Information Theory, pp. 2294–2298, July 2016,Barcelona, Spain.

[202] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory.” MIT (6.441),UIUC (ECE 563), 2012-2014. http://people.lids.mit.edu/yp/homepage/

data/itlectures_v3.pdf.

[203] S. G. Bobkov and G. P. Chistyakov, “Entropy power inequality for the Renyientropy,” IEEE Trans. Inf. Theory, vol. 61, no. 2, pp. 708–714, Feb. 2015.

[204] Y. Wu and S. Verdu, “Functional properties of minimum mean-square error andmutual information,” IEEE Trans. Inf. Theory, vol. 58, pp. 1289–1301, Mar.2012.

[205] M. Godavarti and A. Hero, “Convergence of differential entropies,” IEEE Trans.Inf. Theory, vol. 50, no. 1, pp. 171–176, Jan. 2004.

[206] K. Ball, “Volumes of sections of cubes and related problems,” in Geometricaspects of functional analysis, pp. 251–260, Springer, 1989.

[207] M. Talagrand, “Transportation cost for Gaussian and other product measures,”Geometric & Functional Analysis, vol. 6, no. 3, pp. 587–600, 1996.

[208] G. Xu, W. Liu, and B. Chen, “Wyner’s Common Information: General-izations and A New Lossy Source Coding Interpretation,” arXiv preprintarXiv:1301.2237, 2013.

[209] M. D. Perlman, “Jensen’s inequality for a convex vector-valued function onan infinite-dimensional space,” Journal of Multivariate Analysis, vol. 4, no. 1,pp. 52–65, 1974.

[210] I. Csiszar and P. Narayan, “Secrecy capacities for multiple terminals,” IEEETrans. Inf. Theory, vol. 50, no. 12, pp. 3047–3061, Dec. 2004.

[211] C. Chan and L. Zheng, “Mutual dependnece for secret key agreement,” in Proc.44 Annual Conference on Information Science and Systems (CISS), pp. 1–6,Mar. 2010.

292

http://people.lids.mit.edu/yp/homepage/data/itlectures_v3.pdf

http://people.lids.mit.edu/yp/homepage/data/itlectures_v3.pdf

[212] W. Yang, R. F. Schaefer, and H. V. Poor, “Finite-blocklength bounds for wire-tap channels,” in Proceedings of the 2016 IEEE International Symposium onInformation Theory, pp. 3087–3091, July 2016.

293

information theory from a functional...

Documents