information theory from a functional...
TRANSCRIPT
Information Theory from A
Functional Viewpoint
Jingbo Liu
A Dissertation
Presented to the Faculty
of Princeton University
in Candidacy for the Degree
of Doctor of Philosophy
Recommended for Acceptance
by the Department of
Electrical Engineering
Adviser: Paul Cuff and Sergio Verdu
January 2018
c© Copyright by Jingbo Liu, 2017.
All rights reserved.
Abstract
A perennial theme of information theory is to find new methods to determine the
fundamental limits of various communication systems, which potentially helps the
engineers to find better designs by eliminating the deficient ones. Traditional meth-
ods have focused on the notion of “sets”: the method of types concerns the cardinality
of subsets of the typical sets; the blowing-up lemma bounds the probability of the
neighborhood of decoding sets; the single-shot (information-spectrum) approach uses
the likelihood threshold to define sets. This thesis promotes the idea of deriving the
fundamental limits using functional inequalities, where the central notion is “func-
tions” instead of “sets”. A functional inequality follows from the entropic definition
of an information measure by convex duality. For example, the Gibbs variational
formula follows from the Legendre transform of the relative entropy.
As a first example, we propose a new methodology of deriving converse (i.e. im-
possibility) bounds based on convex duality and the reverse hypercontractivity of
Markov semigroups. This methodology is broadly applicable to network information
theory, and in particular resolves the optimal scaling of the second-order rate for
the previously open “side-information problems”. As a second example, we use the
functional inequality for the so-called Eγ metric to prove non-asymptotic achievability
(i.e. existence) bounds for several problems including source coding, wiretap channels
and mutual covering.
Along the way, we derive general convex duality results leading to a unified treat-
ment to many inequalities and information measures such as the Brascamp-Lieb in-
equality and its reverse, strong data processing inequality, hypercontractivity and its
reverse,transportation-cost inequalities, and Renyi divergences. Capitalizing on such
dualities, we demonstrate information-theoretic approaches to certain properties of
functional inequalities, such as the Gaussian optimality. This is the antithesis of the
main thesis (functional approaches to information theory).
iii
Acknowledgements
First, I thank my advisors Professor Paul Cuff and Professor Sergio Verdu for their
unceasing guidance and support over the past five years. I was new to information
theory at the start of my PhD, and Professor Cuff was generous with his time in
helping getting on the road of this field. The puzzle meetings with Professor Cuff
and other students was a great source of diversion and inspiration for our graduate
life. Professor Verdu demonstrated to us the beauty of research and the charisma of
teaching. He encouraged us to delve deeply into serious research topics, revise papers
carefully, and work hard even on holidays, by doing so himself.
Thanks, Prof. Abbe, Braverman, Cuff and Verdu, for serving as the committee;
Prof. Chen, Poor, and Verdu, for being the readers; the administrative staff from
ELE, for their assistance over the years. Thanks to all faculty members of Princeton
who have taught me classes; what I learnt from them is reflected in every aspect of this
thesis. I am indebted to Princeton University for providing a wonderful environment
for studying and research.
I owe sincere gratitude to my mentors and collaborators: Sanket Satpathy, Sudeep
Kamath, Wei Yang for numerous illuminating discussions; Professor Emmanuel Abbe
for letting me catch a glimpse of working on coding theory; Professor Thomas Cour-
tade for showing me how to prove Gaussian optimality; Professor Ramon van Handel
for advising me to look into semigroup operators, as opposed to the maximal operators
which I was previously stuck with.
I am thankful to all my friends at Princeton, my colleagues from the information
sciences and systems group, especially my roommates, Sen Tao, Lanqing Yu and Amir
Reza Asadi, for helping me out with many things in life.
Last but not the least, I thank my parents for everything, and in particular for
providing me a place where a large part of the thesis was written.
iv
To my parents.
v
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1 Introduction 1
1.1 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Information Measures and Operational Problems . . . . . . . 1
1.1.2 Channel Coding Revisited . . . . . . . . . . . . . . . . . . . . 3
1.1.3 More on Converses in Information Theory . . . . . . . . . . . 7
1.1.4 The Side Information Problems . . . . . . . . . . . . . . . . . 10
1.1.5 Common Randomness Generation . . . . . . . . . . . . . . . . 11
1.2 Concentration of Measure . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.1 The Gromov-Milman Formulation . . . . . . . . . . . . . . . . 13
1.2.2 Is It Natural for Information Theorists? . . . . . . . . . . . . . 15
1.3 Functional Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 In Large Deviation Theory . . . . . . . . . . . . . . . . . . . . 16
1.3.2 Hypercontractivity and Its Reverse . . . . . . . . . . . . . . . 17
1.3.3 Brascamp-Lieb Inequality . . . . . . . . . . . . . . . . . . . . 19
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 21
vi
2 A Tale of Two Optimizations 23
2.1 The Duality between Functions and Measures . . . . . . . . . . . . . 24
2.2 Dual Formulations of the Forward Brascamp-Lieb Inequality . . . . . 26
2.3 Application to Common Randomness Generation . . . . . . . . . . . 33
2.3.1 The One Communicator Problem: the Setup . . . . . . . . . . 33
2.3.2 A Zero-Rate One-Shot Converse for the Omniscient Helper
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Dual Formulation of a Forward-Reverse Brascamp-Lieb Inequality . . 42
2.4.1 Some Mathematical Preliminaries . . . . . . . . . . . . . . . . 43
2.4.2 Compact X . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.3 Noncompact X . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.4.4 Extension to General Convex Functionals . . . . . . . . . . . . 59
2.5 Some Special Cases of the Forward-Reverse Brascamp-Lieb Inequality 61
2.5.1 Variational Formula of Renyi Divergence . . . . . . . . . . . . 62
2.5.2 Strong Data Processing Constant . . . . . . . . . . . . . . . . 64
2.5.3 Loomis-Whitney Inequality and Shearer’s Lemma . . . . . . . 66
2.5.4 Hypercontractivity . . . . . . . . . . . . . . . . . . . . . . . . 68
2.5.5 Reverse Hypercontractivity (Positive Parameters) . . . . . . . 69
2.5.6 Reverse Hypercontractivity (One Negative Parameter) . . . . 70
2.5.7 Transportation-Cost Inequalities . . . . . . . . . . . . . . . . . 71
3 Smoothing 74
3.1 The BL Divergence and Its Smooth Version . . . . . . . . . . . . . . 74
3.2 Upper Bounds on dδ . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.1 Discrete Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.2 Gaussian Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.2.3 Continuous Density Case . . . . . . . . . . . . . . . . . . . . . 90
3.3 Application to Common Randomness Generation . . . . . . . . . . . 91
vii
3.3.1 Second-order Converse for the Omniscient Helper Problem . . 91
4 The Pump 96
4.1 Prelude: Binary Hypothesis Testing . . . . . . . . . . . . . . . . . . . 97
4.1.1 Review of the Blowing-Up Method, and Its Second-order Sub-
optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.1.2 The Fundamental Sub-optimality of the Set-Inequality Ap-
proaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.1.3 The New Pumping-up Approach . . . . . . . . . . . . . . . . . 102
4.2 Optimal Second-order Fano’s Inequality . . . . . . . . . . . . . . . . . 107
4.2.1 Bounded Probability Density Case . . . . . . . . . . . . . . . 107
4.2.2 Gaussian Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3 Optimal Second-order Change-of-Measure . . . . . . . . . . . . . . . 114
4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.4.1 Output Distribution of Good Channel Codes . . . . . . . . . . 124
4.4.2 Broadcast Channels . . . . . . . . . . . . . . . . . . . . . . . . 125
4.4.3 Source Coding with Compressed Side Information . . . . . . . 126
4.4.4 One-communicator CR Generation . . . . . . . . . . . . . . . 131
4.5 About Extensions to Processes with Weak Correlation . . . . . . . . 136
5 More Dualities 137
5.1 Air on g‹ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.1.1 Review of the Image-size Problem . . . . . . . . . . . . . . . . 138
5.1.2 g‹ and the Optimal Second-order Image Size Lemmas . . . . . 141
5.1.3 Relationship between g and g‹ . . . . . . . . . . . . . . . . . . 151
5.2 Variations on Eγ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.2.1 Eγ: Definition and Dual Formulation . . . . . . . . . . . . . . 155
viii
5.2.2 The Relationship of Eγ with Excess Information and Smooth
Renyi Divergence . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.2.3 Eγ-Resolvability . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.2.4 Application to Lossy Source Coding . . . . . . . . . . . . . . . 198
5.2.5 Application to Mutual Covering Lemma . . . . . . . . . . . . 202
5.2.6 Application to Wiretap Channels . . . . . . . . . . . . . . . . 207
6 The Counterpoint 225
6.1 Data Processing, Tensorization and Convexity . . . . . . . . . . . . . 226
6.1.1 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 226
6.1.2 Tensorization . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.1.3 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
6.2 Gaussian Optimality in the Forward Brascamp-Lieb Inequality . . . . 229
6.2.1 Optimization of Conditional Differential Entropies . . . . . . . 232
6.2.2 Optimization of Mutual Informations . . . . . . . . . . . . . . 244
6.2.3 Optimization of Differential Entropies . . . . . . . . . . . . . . 246
6.3 Gaussian Optimality in the Forward-Reverse Brascamp-Lieb Inequality 247
6.4 Consequences of the Gaussian Optimality . . . . . . . . . . . . . . . . 258
6.4.1 Brascamp-Lieb Inequality with Gaussian Random transforma-
tions: an Information-Theoretic Proof . . . . . . . . . . . . . . 258
6.4.2 Multi-variate Gaussian Hypercontractivity . . . . . . . . . . . 260
6.4.3 T2 Inequality for Gaussian Measures . . . . . . . . . . . . . . 263
6.4.4 Wyner’s Common Information for m Dependent Random Vari-
ables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.4.5 Key Generation with an Omniscient Helper . . . . . . . . . . 267
6.5 Appendix: Existence of Maximizer in Theorem 6.2.1 . . . . . . . . . . 270
7 Conclusion and Future Work 276
ix
Bibliography 278
x
List of Tables
1.1 Comparison of previous converse proof techniques . . . . . . . . . . . 10
4.1 Comparison of the basic ideas in BUL and the new methodology . . . 107
xi
List of Figures
1.1 CR generation with one-communicator . . . . . . . . . . . . . . . . . 12
2.1 Diagrams for Theorem 2.4.5. . . . . . . . . . . . . . . . . . . . . . . . 51
2.2 The “partial relations” between the inequalities discussed in this chap-
ter with respect to implication. . . . . . . . . . . . . . . . . . . . . . 61
2.3 Diagram for hypercontractivity (HC) . . . . . . . . . . . . . . . . . . 68
2.4 Diagram for reverse HC . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.5 Diagram for reverse HC with one negative parameter . . . . . . . . . 70
4.1 Schematic comparison of 1A, 1Ant and Tbnt 1A, where A is the indicator
function of a Hamming ball. . . . . . . . . . . . . . . . . . . . . . . 106
4.2 Illustration of the action of Txn,t. The original function (an indicator
function) is convolved with a Gaussian measure and then dilated (with
center xn). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3 A set A Ď Y and its “preimage” under QY |X in X . . . . . . . . . . . 114
4.4 Source coding with compressed side information . . . . . . . . . . . . 126
5.1 The tradeoff between the sizes of the two images. . . . . . . . . . . . 139
5.2 EγpP Qq as a function of γ where P “ Berp0.1q and Q “ Berp0.5q. . . 157
5.3 Setup for channel resolvability. . . . . . . . . . . . . . . . . . . . . . . 173
5.4 The wiretap channel . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
6.1 Key generation with an omniscient helper . . . . . . . . . . . . . . . 267
xii
Chapter 1
Introduction
1.1 Information Theory
1.1.1 Information Measures and Operational Problems
The scope of a research field is not only defined by its very name, but also shaped
by the people who left their mark on it. Shannon’s ground-breaking 1948 paper [1]
characterized the capacity of transmitting messages through a channel in terms of
the maximum mutual information. Since then, information theory has been closely
associated with communication engineering, even though the ideas and the analytical
tools from information theory have found applications in many other research fields,
such as physics, statistics, computer science, networking, control theory and financial
engineering [2][3][4].
Roughly speaking, the research by Shannon and his successors may be split into
the following two categories:
• Information measures. Just as physicists are intrigued by the notion of a space,
information theorists are perpetually obsessed with finding a good notion of an
information measure, which is a quantity defined for a probability distribution
that characterizes, say, the complexity of a random object, or the correlation
1
between random objects. Examples of information measures include the relative
entropy, mutual information, common information [5][6], the strong data pro-
cessing coefficient [7][8][9], and the maximal correlation coefficient [10][11][12].
A good definition of an information measure entails certain mathematical con-
veniences while being the answers to interesting problems.
• Operational problems. Once a practical system, say from communication engi-
neering, is formulated as an abstract model, it then becomes a well-posed math-
ematical question to determine whether an operational task (e.g. data transmis-
sion) can be performed for a given parameter range of the model. The critical
value of the parameter is called the fundamental limit, which the information
theorists seek to express or bound in terms of the information measures. Ex-
amples of operational problems in communication engineering include channel
coding [1], source coding [13], randomness generation [14][15][16], their gener-
alizations to network settings [17][18], or their variants [19][20][21][22][23][24].
Research in the above two categories are not independent. Firstly, knowing the
properties of information measures certainly helps computing the fundamental limits
of operational problems. Secondly, the study of operational problems often serves
as a source of motivations and ideas for proposing new information measures, or the
proofs of information theoretic inequalities.
This thesis investigates the functional representations1 (sometimes called the vari-
ational formulae) of various information measures, ubiquitous in information theory,
and their implications on the fundamental limits of operational problems. The most
basic example of such a functional representation is the Gibbs variational formula
1In the mathematical literature a “functional” usually refers to a mapping from a vector space tothe reals. In the scope of this thesis, this vector space is the set of all test functions (e.g. boundedcontinuous real-valued functions) on an alphabet.
2
(also known as the Donsker-Varadhan formula) of the relative entropy,
DpP Qq “ supf
"ż
f dP ´ log
ż
exppfq dQ
*
(1.1)
where P and Q are probability distributions and f is a real-valued measurable func-
tion, all on the same measurable space. As simple and well-known as (1.1) is, it
may be surprising that a new, yet powerful and canonical methodology of attack-
ing a wide range of operational problems in communication engineering lies beneath
these functional representations. A main objective of this thesis is to introduce this
new methodology, and apply it to certain problems regarded as challenging or open
by some researchers in the community. As another example, the Eγ metric to be
introduced in Section 5.2 has the following functional representation:
EγpπQq “ supf : 0ďfď1
"ż
f d π ´ γ
ż
f dQ
*
. (1.2)
While there are other ways of deriving (1.2), a very illuminating viewpoint, which
also matches the philosophy of this thesis, is to view Eγp¨Qq as the Legendre-Fenchel
dual of the convex functional
f ÞÑ
$
’
&
’
%
γş
fdQ, 0 ď f ď 1 a.e.;
`8, otherwse.(1.3)
1.1.2 Channel Coding Revisited
To get a glimpse of the above ideas through a concrete example, let us recall the data
transmission (or channel coding) problem considered by Shannon [1], which consists
of the following (in the single-shot formulation):
• The input and the output alphabets X and Y , which represent the sets of the
possible values of the input and output signals for a given communication chan-
3
nel. The communication channel is modeled as a random transformation (or
transition probability) PY |X .
• A set of messagesM to be transmitted by the channel. The semantic meaning
of the messages is irrelevant for the solution of the operational problem, so it is
without loss of generality to consider M “ t1, . . . ,Mu for some integer M .
• The encoder is a mapping from M to X , which can be represented by a code-
book consisting of codewords c1, . . . , cM P X .
• The decoder is a rule of reconstructing a message m P M based on the ob-
servation y P Y . A deterministic decoder can be specified by disjoint sets
D1, . . . ,DM Ď Y . A stochastic decoder can be represented by nonnegative
functions f1, . . . , fM where fm : Y Ñ r0, 1s is the probability of declaring mes-
sage m upon observing y. HenceřMm“1 fm “ 1 (i.e. pfmq
Mm“1 is a partition of
the unity).
The goal is to find the tradeoff between M and the maximum error probability2
ε :“ 1´ min1ďmďM
PY |X“cmrDms. (1.4)
Of course, in general such a problem cannot have a simple yet explicit answer. Shan-
non considered a special case called the discrete stationary memoryless channel, where
X and Y are finite sets, and the channel can be used repeatedly for n times; in other
2To streamline the presentation we consider the maximum error probability here rather than themore popular average error probability. By a simple expurgation argument (see e.g. [25]) it is knownthat the choice of the two error probability criteria is irrelevant for the fundamental limit in anasymptotic sense.
4
words, one considers3
X Ð X n, (1.5)
Y Ð Yn, (1.6)
PY |X Ð PbnY |X . (1.7)
Then it is known [1][26] that fix any ε P p0, 1q, one can send at most4
M “ exppnpC ` oεp1qqq (1.8)
messages as nÑ 8, where
C :“ maxPX
IpX;Y q (1.9)
is the maximum mutual information over all input distributions. To establish such
a result, one needs to prove lower and upper bounds on M , which are called the
achievability and the converse, respectively. For the converse part, the classical Fano’s
inequality claims that
IpXn;Y nq ě p1´ εq logM ´ hpεq. (1.10)
where Xn equiprobably takes the values c1, . . . , cM , and Y n is the corresponding
output. Information theorists call (1.10) a weak converse because it only shows
that limεÓ0 limnÑ81n
logM ď C for the optimal codebook. In contrast, the classical
blowing-up lemma argument [27] (see Section 4.1.1 for an introduction) strengthens
3Conventionally, Xn :“ X ˆ¨ ¨ ¨ˆX denotes the Cartesian product. The product channel (tensorproduct) PbnY |X is defined by PbnY |Xpy
n|xnq :“śni“1 PY |Xpyi|xiq for any xn P Xn and yn P Yn.
4The notation φpε, nq “ oεpfpnqq means that limnÑ8φpε,nqfpnq “ 0 for any ε ą 0.
5
(1.10) to
IpXn;Y nq ě logM ´ oεp
a
n log nq. (1.11)
Note that (1.11) implies
M ď exppnpC ` oεp1qqq (1.12)
which is called a strong converse. However, the oεp1q term in (1.12) has not yet
captured the correct order of the second-order term. In this thesis we propose a new
converse method, further strengthening (1.12) to
IpXn;Y nq ě logM ´O
˜
c
n log1
1´ ε
¸
(1.13)
which, as we will see, is sharp in both ε Ò 1 and n Ñ 8! Although there are other
methods for obtaining sharp converse bounds for channel coding, to our knowledge
there was no previous method for proving the sharp Fano’s inequality in (1.13), which
is a versatile tool immediately applicable to a number of other operational problems
in network information theory.
The fundamental limits for most operational problems in network information
theory can be expressed in terms of the linear combinations of information measures,
which admit functional representations (analogous to the Gibbs variational formula)
via convex duality, so we can play a similar game. For certain network information
theory problems, we thus obtain by far the only existing method for a converse bound
which is sharp as nÑ 8.
6
1.1.3 More on Converses in Information Theory
While the early work in information theory centered around first-order asymptotic
results, such as (1.8), information theorists have always been striving for improve-
ments:
• Non-asymptotic bounds where the blocklength n is assumed to be fixed. For
example, Fano’s inequality [28] provides a simple, though not very strong, non-
asymptotic bound. While traditionally information theory has focused on the
stationary memoryless settings, it is certainly of practical importance to con-
sider processes with memory, or sources and channels with only one realization
(as opposed to repeated independent realizations). In fact the latter setting
is often adopted in the computer science literature, for problems in the inter-
section of theoretical computer science and information theory (see e.g. [29]).
Moreover, even in the stationary memoryless setting, it is sometimes more de-
sirable to consider the scenario where the alphabet size or the number of users
is large compared to the number of channel/source realizations [30][31][32], per-
haps even more so in the so-called “big data” and “IoT” era. In those cases,
the traditional n Ñ 8 assumption appears inadequate, while a versatile non-
asymptotic bound may be applicable to a more relevant regime of asymptotics.
• More precise asymptotics. For example, in channel coding, it is possible to
improve (1.8) to the precise second-order asymptotic result:
M “ exp´
nC ´?nV Q´1
pεq ` op?nq¯
(1.14)
where Qp¨q denotes the Gaussian tail probability. The study of the second-order
asymptotics in information theory was initiated by Strassen[33], and recently
flourished thanks to the work of Polyanskiy et al. [34], Hayashi [35], Kostina et
al. [36], and others. When n is not too small, a good approximation of M can
7
be obtained from (1.14) (sometimes significantly better than the error exponent
approximation [34]). This is relevant to the practical setting where a short delay,
hence a short blocklength of the code, is demanded for data transmission.
The key to achieving the above goals is to find new and better ways of proving
achievability and converse bounds. For example, suppose that the goal is to determine
the second-order asymptotics of a certain operational problem in network information
theory. The achievability (i.e. existence) part is rather well understood: for almost
all5 problems from network information theory with known first-order asymptotics,
an achievability bound with an Op?nq second-order term can be derived [37][38].
Indeed, since the achievability part is constructive, one can usually use random coding
to produce a joint distribution which is close to the ideal distribution appearing in the
formula of the first-order fundamental limit, so the second-order analysis is generally
reduced to Berry-Esseen central limit arguments. The problem of producing a desired
joint distribution is sometimes called coordination [39].
On the other hand, the converse (i.e. impossibility) part appears subtler since
the possibility of success by any code must be excluded. It is recognized within
the community that the converse part is usually the bottleneck for nailing down the
exact second-order asymptotics (see e.g. [40, Section V]). Below, we review three most
influential ideas of proving converse results in information theory:
1. Information spectrum method : given probability measures P and Q on the same
alphabet, the information spectrum refers to the cumulative distribution func-
tion of the relative information log dPdQpXq where X „ P . It is possible to bound
the error probability (maximum or average) of channel coding in terms of the
information spectrum [26][41]. The meta-converse approach (based on binary
hypothesis testing) and the Renyi divergence approach [42] are both closely
related to the information spectrum approach: it is easy to see that the knowl-
5Here we do not discuss the variants such as universal coding or zero-error coding
8
edge of the information spectrum is equivalent to the knowledge of the binary
hypothesis testing tradeoff region for P and Q (see e.g. [43]), and the Renyi
divergence between P and Q is recoverable from the information spectrum. Be-
sides being very general and not limited to the stationary memoryless systems,
the information spectrum method shows that the opnq term in (1.12) can be
improved to Op?nq, which is in fact the optimal central limit theorem rate (see
e.g. [34]). Other information-theoretic problems6 with known information spec-
trum converses include lossy source coding [36], Slepian-Wolf coding, multiple
access channels [44], and broadcast channels [45].
2. Type class analysis [8]: this method is restricted to, but very powerful for,
the discrete memoryless systems. The idea is to first look at, instead of a
stationary memoryless distribution, the equiprobable distribution on sequences
with the same empirical distribution (a.k.a type). In other words, imagine that
a genie reveals to the encoder and the decoder the information of the type,
and a different code can be constructed for each type. The evaluation of the
probability then becomes a combinatorial (counting) problem, which has been
referred to as the “skeleton” or the “combinatorial kernel” of the information
theoretic problem [46]. It appears that those Op?nq second-order converses for
those problems proved by the information spectrum methods can also be proved
by the method of types (see [47] for the example of lossy source coding); this is
not surprising since the type determines the value of the relative information,
and hence may be viewed as a finer resolution than the information spectrum.
For those problems, it is generally easy to see (by a combinatorial covering
or packing argument) that the error probability given the type of the source or
channel realization is either very close to 0 or 1, hence the task of estimating the
6By an “information-theoretic problem” we mean an operational task such as channel coding orsource coding.
9
error probability in theOp?nq second-order rate regime is reduced to calculating
the probability of those “bad types”, which is an exercise of the central limit
theorem and the Taylor expansion of the information quantities. The type class
analysis has also been used extensively in [8] for deriving error exponents.
3. Blowing-up lemma (BUL): this method was introduced in [27] and exploited
extensively in the classical book [8]; see also the recent survey [48]. We review
the BUL method in Section 4.1.1. BUL draws on the nonlinear measure concen-
tration mechanism, and has been successful in establishing the strong converse
property in a wide range of multiuser information theoretic problems (including
all settings in [8] with known single-letter 7 rate regions; see [8, Ch. 16/Discus-
sion].)
Informationspectrum
(meta-converse)
Type classanalysis
Blowing-up(image size
characteriza-tion)
Beyond|Y | ă 8?
X 7 7
Second-orderterm
optimal optimal Oεp?n log
32 nq
Extension tomultiuser
more selected selected
all sourcechannel
networks withknown firstorder region
Table 1.1: Comparison of previous converse proof techniques
1.1.4 The Side Information Problems
A class of problems involving side information (also called “helper” in [8]) are known
to be challenging in the converse part. An archetypical example is source coding
7A single-letter expression (as opposed to a multi-letter expression) for a stationary memorylesssystem is a formula involving only the per-letter distribution.
10
with compressed side information. Originally studied by Wyner [49], Ahlswede, Gacs
and Korner [27] in the 1970’s, this problem is the first instance in network informa-
tion theory where the single letter expression of the rate region requires an auxiliary
random variable. Other examples of the side information problems include common
randomness generation under the one-way communication protocol [15] and some
source and channel networks considered in [25]. For the side information problems,
BUL remained hitherto the only method for establishing the strong converse, hence
the precise second-order asymptotics has been recognized as a tough problem [50,
Section 9.2].
On reason such problems require BUL might be that the “combinatorial kernels”
(see the paragraph explaining the type class analysis above) are not easily solved,
hence it is not clear that the error probability can be estimated by simply computing
the error probability of certain “bad types”. We remark that the Gray-Wyner network
[51], although also requiring an auxiliary in the expression of the single-letter region,
does not assume this degree of difficulty as the side information problems. In fact the
exact second-order asymptotics of the Gray-Wyner network can be derived using the
type class analysis described above [52].
1.1.5 Common Randomness Generation
Common randomness (CR) generation refers to the operational problem of generating
shared random bits among terminals. Although the converse methodology proposed
in this thesis is broadly applicable, we will use examples from the common random-
ness generation problem, simply as a consequence of the author’s personal research
background.
The scenario depicted in Figure 1.1, which we call CR generation with one com-
municator, is a typical example of the side information problem. The terminals T0,
. . . , Tm observe correlated sources. Terminal T0 is allowed to communicate to the
11
T1 T2. . . Tm
T0X
K1 K2 Km
K
Y1 Y2 Ym
W1 W2 Wm
Figure 1.1: CR generation with one-communicator
other terminals, after which all terminals aim to produce a common random number
K. The goal is to characterize the tradeoff between the communication rates and
the rate at which the random numbers are produced, under a lower bound on the
probability that the random numbers produced by the terminals agree. The precise
formulation of the problem is given in Section 2.3.1.
The first three chapters of the thesis form a trilogy that offers the three key ingre-
dients for proving a Op?nq second-order converse for this problem in the stationary
memoryless setting, which, to the best of the author’s knowledge, was beyond the
reach of previous techniques. More precisely, the three ingredients allow one to ac-
complish the following, respectively:
• Chapter 2 presents equivalences between entropic optimizations and functional
optimizations, which allow one to derive a sharp converse bound for the case
where Y1,. . . ,Ym are functions of X and when the communication rates are close
to zero.
• Chapter 3 shows that the behavior of an entropic optimization under a per-
turbation on the underlying measure is approximated by a mutual information
optimization. This forms the basis for deriving an Op?nq second-order converse
bound when the vanishing communication rates assumption is dropped.
12
• To show an Op?nq second-order converse bound when further dropping the
assumption that Y1, . . . , Ym are functions of X, we need the pumping-up tech-
nology in Chapter 4.
On the other hand, previous works have obtained one-shot converses using smooth
Renyi entropy [53] or the meta-converse idea [54][55], for which the asymptotic tight-
ness are achieved in the other extreme of limited correlated sources but unlimited
communications. Moreover, Mossel [56] used reverse hypercontractivity to bound the
probability of agreement when there is no communication at all for the binary sym-
metric sources, and derived the tight asymptotics when the number of terminals is
growing and the probability of agreement is vanishing.
1.2 Concentration of Measure
1.2.1 The Gromov-Milman Formulation
Concentration of measure is a collection of tools and results from analysis and prob-
ability theory that have proved successful in many areas of pure and applied math-
ematics [57][48]. Concentration of a real valued random variable simply means that
the random variable is close to its “center” (e.g. its mean or its median) with high
probability.
The relevance of concentration of measure to non-asymptotic information theory
is easy to see: consider the example of channel coding. By the information spectrum
bounds [41], the error probability is essentially
ε « PrıX;Y pX;Y q ă logM s (1.15)
where M denotes the number of messages, ıX;Y p¨q is a quantity called information
density, and X and Y are the input and the output, respectively. Note that the
13
expectation of ıX;Y pX;Y q equals the mutual information IpX;Y q, hence bounding ε
can be reduced to the concentration of ıX;Y pX;Y q.
Many standard tail bounds in statistics, such as Hoeffding’s inequality, Chernoff’s
inequality, and Bernstein’s inequality, capture the concentration of measure of the
i.i.d. sum of real valued random variables. They are sometimes called “linear concen-
tration inequalities” since the summation is linear.
In the more general case, in which these random variables are not real valued,
there are no notions of “linear functions”, “median” or “mean” of these random
variables. However, intuitively we should still be able to describe the concentration
phenomenon of, say, a random variable taking values in a metric space. Indeed, there
are two widely accepted formulations of the concentration of measure phenomenon,
equivalent in an exact way, for general metric measure spaces (See for example [57][58]
for a proof of the equivalence):
1. Any 1-Lipschitz function of the random variable deviates from its median with
small probability.
2. The blow-up of any set with probability at least 1/2 has a high probability.
Here, the blow-up of a set means the collection of all points whose distance to the
set is not exceeding a given threshold. Setting the threshold close to zero, we see
that the second formulation above implies an isoperimetric property (i.e. relationship
between the surface area and the volume of the set). Conversely, isoperimetry implies
concentration of measure by integration. While isoperimetry had long been consid-
ered as merely a curiosity in geometry, Milman showed in the early 1970s [59] that
sphere isoperimetry can be used to give a simpler proof of the Dvoretzky theorem, a
fundamental result in convex geometry originally conjectured by Grothendieck. Since
then, Milman vigorously promoted the concentration of measure as a useful tool for
attacking serious mathematical problems (see the discussions in [58][60]). Moreover,
14
Gromov studied concentration of measure on the manifolds. Thus the above for-
mulation for the concentration of measure has been called the “Gromov-Milman”
formulation (see e.g. [58]). Margulis investigated isoperimetry on the hypercube with
application to a phase transition problem on the random graph in the 1970s [61].
In information theory, concentration of measure has also been called the blowing-up
lemma [27][8].
Peter Gacs recalled the following (see [62]) about how the ideas of concentration
of measure in the early 1970s, and in particular the result of Margulis, influenced
his work (with Ahlswede and Korner) on the strong converse of the side information
problem [27]:
Rudi Ahlswede, at a visit in Budapest, posed a question that I found
solvable. I remembered a Dobrushin seminar presentation of Margulis
about a certain phase transition in random graphs, in particular a certain
isoperimetric theorem used in the proof. This is how the Blowup Lemma
was born. We had a pleasant collaboration with Janos Korner in the
course of working it into a paper, along with a couple of other results.
In the 1980s, Marton [63] invented a transportation method for proving concentra-
tion of measure, yielding simpler proofs and stronger claims for some previous results
[64]. The heart of this methodology is the equivalence between the transportation-cost
inequality and the concentration inequality, which can also be viewed as an instance
of convex duality (see Section 2.5.7).
1.2.2 Is It Natural for Information Theorists?
Although the blowing-up lemma (the Gromov-Milman formulation of concentration
for metric-measure spaces) has been successfully applied to the strong converses of
many network information theory problems [8], the drawbacks are also evident: as
15
we alluded before, the BUL approach does not yield the optimal second-order term,
and is limited to the finite alphabet case.8 Perhaps more importantly, assuming a
metric structure on the alphabet which plays an active role is not logically satisfying.
Whether it be channel coding or the side information problem, it is clear that the
fundamental limit does not change if a measure-preserving map (which may very well
distort the metric structure) is applied to the alphabet.
The new converse approach we introduce in this thesis addresses these issues.
In particular, the Gromov-Milman formulation of concentration of measure is
sidestepped, and a metric structure need not be imposed on the alphabet. We now
proceed to review some elements in functional analysis that form the basis of this
new methodology.
1.3 Functional Analysis
1.3.1 In Large Deviation Theory
While most textbooks on information theory first define the relative entropy by
DpP Qq :“ E„
logdP
dQpXq
, (1.16)
X „ P , and then prove the variational formula (1.1), Varadhan, the founder of
the general theory of large deviations, chose the other way around: he defined the
relative entropy via (1.1) as the convex dual (Legendre-Fenchel dual) of the cumulant
8Although BUL can be extended beyond finite alphabets, the BUL approach to strong conversesinvolves a union bound step which requires finite alphabets.
16
generating function and then proved its equivalence to (1.16) [65, Section 10].9 The
“functional approach” of the present thesis follows the same spirit.
In fact, the idea of defining an object “dually” is very common in mathematics. For
example, in probability theory, some authors prefer to define a probability measure
as a functional on the linear space of continuous functions (i.e. using the claim of
the Riesz-Kakutani theorem as the definition), and then recover the usual definition
(assigning real numbers to measurable sets) and other properties [67][68]. Other
examples include cohomology in topology (dual of homology) and vector fields in
differential geometry (dual of “local functions” on a manifold). Although not the
most intuitive way of defining those objects, such “dual” definitions often turn out
to be more technically convenient in deeper studies of the subject.
1.3.2 Hypercontractivity and Its Reverse
Let LppQY q denote the linear space of all measurable function f equipped with the
Lp norm:
fLppQY q :“ E1p rfppY qs (1.17)
where Y „ QY . Let T be a linear operator from LppQY q to LqpQXq; if10
T LppQY qÑLqpQXq ď 1 (1.18)
for some q ě p ě 0 we say T is a hypercontraction (the case of p “ q is called a
contraction).
9While the equivalence of (1.16) and (1.1) follows immediately from the nonnegativity of therelative entropy when the f in (1.1) is assumed to be measurable, in the large deviation theory theunderlying space is usually assumed to be a metric-measure space (e.g. a Polish space), and thefunction f is assumed to be bounded and continuous. In that case, the proof of the equivalence of(1.16) and (1.1) (due to Donsker and Varadhan) is more involved [66].
10Recall that the norm of an operator T is defined as the supremum norm of Tf over f whosenorm is bounded by 1.
17
For information theorists, such an operator T arises naturally when a random
transformation QY |X is considered (in which case T is called a conditional expectation
operator): Given QY |X , the corresponding T maps a function f on Y to
x ÞÑ ErfpY q|X “ xs (1.19)
which is a function on X , where Y „ QY |X“x conditioned on X “ x. For some delicate
problems (e.g. the proof of the dual formulation of certain functional inequalities in
Chapter 2), it is more technically convenient (and perhaps also more elegant) to
define those objects “dually”: as alluded before, a measure can be thought of as a
linear functional for a suitable space of test functions. Thus, a random transformation
QY |X (yielding a linear mapping taking a measure on X to a measure on Y) can be
thought of as the dual (or conjugate) of a conditional expectation operator T mapping
functions on Y to functions on X .
Hypercontractivity was initially considered in theoretical physics [69] and later
in functional analysis [70], theoretical computer science [71], and statistics [72][73].
In information theory, it was introduced by Ahlswede-Gacs [7], and more recently
received revived interest through the work of Anantharam et al. [9]. It is known
[74][75][76] that the functional definition (1.18) is equivalent to an information theo-
retic characterization, which we discuss in more details in Section 2.5.4. The strong
data processing constant [25] can be recovered from the region of hypercontractive
parameters p and q via a limiting argument [7].
The operator T is said to satisfy a reverse hypercontractivity if
TfLqpQXq ě fLppQY q, @f ě 0 (1.20)
for some 0 ď q ď p ď 1. Initially studied by Borell in the Gaussian and the Bernoulli
cases in the 1970s, reverse hypercontractivity has not found many applications until
18
the recent work of Mossel et al. [77]. Information theoretic formulations of the reverse
hypercontractivity were studied in [78][79][80].
In this thesis, we show that the reverse hypercontractivity mechanism gives rise
to a general approach of proving strong converses which uniformly improves the tra-
ditional blowing-up methodology. In hindsight, this makes good sense, since literally,
the reverse of “contraction” is nothing but “expansion” (i.e. blowing-up). However,
this cute observation was not exploited before, even though Ahlswede, Gac and Korner
have worked on both hypercontractivity [7] and the strong converse problem [27]
around 1976.
We remark that various bounds between hypercontractivity (and its reverse) and
measure concentration have been discussed in certain special cases such as Gaussian
[81, P116][82] or Bernoulli [56]. However, our present application of reverse hyper-
contractivity relies on different arguments, and is also more canonical and convincing
than those discussions in terms of the generality and the sharpness of our results.
1.3.3 Brascamp-Lieb Inequality
The Brascamp-Lieb inequality is a functional inequality that assumes as special cases,
or closely relates to, many other familiar inequalities, including Holder’s inequality,
the sharp Young inequality, the Loomis-Whitney inequality, hypercontractivity, the
entropy power inequality, and the logarithmic Sobolev inequality. Originally studied
by Brascamp and Lieb [83] in the 1970s motivated by problems in particle physics,
the Brascamp-Lieb inequality has now found applications in numerous other fields
such as convex geometry [84], statistics [85], and theoretical computer science [86].
The connection to information theory was already observed in the paper of Cover
and Dembo [87] in the context of the entropy power inequality. However, the present
thesis will investigate a more general connection to information via convex duality
and its implications, as we explain next.
19
By convex duality, the (functional form of the) Brascamp-Lieb inequality has an
equivalent formulation in terms of a bound on the linear combination of relative en-
tropies. Such an equivalent formulation was observed in the work of Carlen and
Cordero-Erausquin [76], Alternative proofs (without using the Gibbs variational for-
mula) and extensions are recently proposed by Nair [74] and Liu et al. [80]. Similarly,
the reverse Brascamp-Lieb inequality [88] also has equivalent information theoretic
formulations; see [79][80].
In high dimensions, we show that a small perturbation of the underlying mea-
sure in the Brascamp-Lieb inequality changes the first-order asymptotics to a certain
linear combination of mutual informations (instead of the linear combinations of the
relative entropy). Such a perturbation idea is a generalization of that of the smooth
entropy arising in quantum information theory [53]. Thus, we obtain a correspondence
between these mutual informations (which appear in the expression of single-letter
regions) and a functional inequality, which can be used in the strong converse proofs
if we take the functions to be the indicator functions of certain encoding or decoding
sets. In the case of finite alphabets, similar results were derived under the moniker
“image-size characterization” [8], using the monotonicity of relative entropy instead
of convex duality.
The Brascamp-Lieb inequality can be restated as an optimization problem over
some measurable functions. The original work of Brascamp and Lieb [83] proved that
Gaussian functions achieve the optimal value. Since then, various other proofs of the
Gaussian optimality have been found [89][90][75][76][91]. Among them, Lieb’s proof
[89] used the fact that two independent random variables whose sum and difference
are also independent must be Gaussian. Using the same property of the Gaussian dis-
tribution but working with information theoretic inequalities, Geng-Nair [92] proved
the Gaussian optimality in the single-letter region of several network information the-
ory problems. Liu et al. [93] gave information-theoretic proofs of the Brascamp-Lieb
20
inequality and its extensions with several improvements over the methodology of Lieb
[89] and Geng-Nair [92].
1.4 Organization of the Thesis
The trilogy of the proposal of the functional approach to information theoretic con-
verses is comprised of the first three chapters of the thesis. Each of these chapters
describes one of the three main ingredients: the convex duality of the functional and
entropic optimizations; the behavior of the optimal value under a perturbation of the
underlying measure; and the reverse hypercontractivity of Markov semigroups. In
each of these chapters we will return to the one-communicator common randomness
generation problem introduced in Section 1.1.5, and show how the introduction of the
new ingredients gets us closer to the goal of obtaining an optimal Op?nq second-order
converse. Towards the end of Chapter 4, we also discuss how to apply the proposed
methodology to other problems in network information theory, such as source net-
works, broadcast channels, and the empirical distribution of good channel codes.
Chapter 2 contains additional results on convex duality which are of independent
interest, but not needed for the applications to common randomness generation.
Chapter 5 presents additional examples of information measures, their dual formu-
lations, and the applications. Eγ is a generalization of the total variational distance.
We get a glimpse of the power of Eγ through channel resolvability in the large de-
viation regime, and applications of the result to source coding, broadcast channels,
and wiretap channels. g‹ is a functional variant of the quantity g in the chapter on
“image-size characterizations” in the book of Csiszar-Korner [8]. The proposed g‹
quantity can substitute the g in the converse proofs of network information theory
problems such as Gelfand-Pinsker coding, but the analysis of the former is cleaner
and also yields the optimal scaling of the second-order term.
21
While Chapters 2-5 are concerned with the applications of functional inequalities
to problems in information theory, Chapter 6 offers the counterpoint that entropic
inequalities and standard techniques in information theory provide simpler alterna-
tive approaches to certain functional inequalities, such as establishing the Gaussian
optimality in the Brascamp-Lieb inequality.
Given the successes of functional inequalities in common randomness generation,
source coding and channel coding, it is tempting to extend this paradigm to the secret
key generation problem, which imposes security constraints on common randomness
generation, as well as the interactive common randomness generation problem where
terminals may communicate back-and-forth under a rate constraint. We discuss these
operational problems and the challenges of extending the functional approach to these
settings in Chapter 7.
With the understanding that readers from various backgrounds may be interested
in different aspects of the thesis, we provide the following recommended itineraries
for the first reading:
• Readers interested in strong converse and non-asymptotic information theory
may prefer to look closely at the first four chapters, excluding Sections 2.4-2.5.
• Readers interested in information security, especially common generation or key
agreement, may prefer to read Sections 2.3, 3.3, 4.4.4, for the converse proof for
the one-communicator common randomness generation model, and then visit
Section 5.2.6 for the application of Eγ to the wiretap channels. Moreover, a
summary of some other results on secret key generation by the author can be
found in Chapter 7.
• Readers interested in information measures or the proof of Gaussian optimal-
ity in the rate regions in network information theory may prefer to focus on
Chapters 2 and 6.
22
Chapter 2
A Tale of Two Optimizations
It is a functional optimization; it is an entropic optimization. We investigate which
viewpoint is better, for various problems under consideration.
This chapter investigates the fundamental convex duality machinery that explains
the connection between the functional optimization and the entropic optimization.
To be concrete, we look at the example of the Brascamp-Lieb inequality from func-
tional analysis. More precisely, we consider a slight extension of the Brascamp-Lieb
inequality tailored to our information-theoretic applications. The entropic formula-
tion of such a functional inequality is proved in Sections 2.1-2.2. The result is then
applied to the common randomness generation problem in Section 2.3 where a zero-
rate single-shot converse bound is derived. The reverse Brascamp-Lieb inequality,
although formally symmetrical to the Brascamp-Lieb inequality, requires more so-
phistication in the proof of the equivalence to the entropic formulation (Section 2.4),
and may be skipped by readers who are only interested in the applications to network
information theory. Section 2.5 reviews several special cases of the Brascamp-Lieb
inequality and its reverse.
23
2.1 The Duality between Functions and Measures
We first introduce some general notations and definitions used in this thesis. An
alphabet (denoted by X , Y etc.) is a measurable space, that is, a set equipped with a
σ-algebra. The set of all real-valued measurable functions (denoted by f , g etc.) on
a given alphabet forms a linear space. We use Greek letters (e.g. µ) for unnormalized
measures, capital Roman letters (e.g. P and Q) for probability measures. A measure
µ defines a linear functional on the space of functions via integration:
f ÞÑ
ż
fdµ, (2.1)
hence the space of measures may be viewed as the dual space (see e.g. [67]) of the space
of “test functions”. Note that this is a general and informal statement since we haven’t
specified what are the “test functions” yet. When we deal with the reverse Brascamp-
Lieb inequality in later sections of this chapter, it is technically necessary to choose a
“nice” space of functions (e.g. bounded continuous functions) so that a certain space of
measures indeed forms the set of all bounded linear functionals, meaning that the space
of measures is the dual of the space of functions in the precise mathematical sense [67].
However, in other places in this thesis, it is generally more convenient to consider the
ensembles of non-negative functions and non-negative measures instead. In this case,
the duality between the two linear spaces no longer holds in the conventional precise
sense of functional analysis (since µ may no longer be bounded as a linear functional),
but the integration (2.1) still always makes sense (possibly equal to `8) thanks
to the nonnegativity assumption. Hence, except in sections related to the reverse-
type inequalities, no additional assumptions (e.g. a topological structure) need to be
imposed on the alphabet other than being a measurable space.
A fundamental quantity in information theory is the relative entropy. First, given
two nonnegative σ-finite measures θ ! µ on X , let us define the relative information
24
as the logarithm of the Radon-Nikodym derivative:
ıθµpxq :“ logdθ
dµpxq (2.2)
where x P X . Note that there is no assumption about |X |, and recall that the right
side of (2.2) is unique up to values on a set of µ-measure zero. Then, the relative
entropy between a probability measure P and a σ-finite measure µ on the same
measurable space is defined as
DpP µq :“ E“
ıP µpXq‰
(2.3)
where X „ P , if P ! µ, and infinity otherwise.
A fundamental quantity in functional analysis is the norm: for p P p0,8q and a
real-valued measurable function f on X , define the p-th norm
fLppµq :“
ˆż
|f |pdµ
˙1p
. (2.4)
Sometimes we abbreviate fLppµq as fp, or as fpXqp if µ is a probability measure
and X „ µ. Note that if µ is a probability measure, and g is a real-valued function
on X , then
log exppgqp “1
plogErexpppgpXqqs (2.5)
“1
pΛgpXqppq (2.6)
where ΛgpXqppq denotes the cumulant-generating function of the real-valued random
variable gpXq.
25
Using a bit of convex analysis (see e.g. [94]) one can verify that the convex func-
tional
P ÞÑ1
pDpP µq (2.7)
is the Legendre dual of the convex functional
g ÞÑ log exppgqp. (2.8)
This basic fact hints that functional inequalities in functional analysis (typically in-
volving integrals and norms) have dual correspondences with inequalities in informa-
tion theory (typically involving the relative entropy). In the next section, we explore a
very canonical instance of such a correspondence, in the context of the Brascamp-Lieb
inequality.
2.2 Dual Formulations of the Forward Brascamp-
Lieb Inequality
The Brascamp-Lieb inequality1, originally motivated by problems from particle
physics [83], can be viewed as a property of the “correlation” among random vari-
ables, and turns out to be naturally suited for the analysis of the common randomness
generation problem. In the literature, the Brascamp-Lieb inequality usually refers
to the fact that Gaussian functions saturate a certain functional inequality. To be
concrete, let us look at a modern formulation of the result from Barthe’s paper [88]:2
1Not to be confused with the “variance Brascamp-Lieb inequality” (cf. [95][96][97]), a differenttype of inequality that generalizes the Poincare inequality.
2 [88, Theorem 1] actually contains additional assumptions, which make the best constants Dand F positive and finite, but not really necessary for the conclusion to hold ([88, Remark 1]).
26
Theorem 2.2.1 ([88, Theorem 1]) Let E, E1, . . . , Em be Euclidean spaces,
equipped with the Lebesgue measure, and Bi : E Ñ Ei be linear maps. Let pciqmi“1 and
D be positive real numbers. Then the Brascamp-Lieb inequality
ż mź
i“1
fipBixq dx ď Dmź
i“1
fi 1ci
, (2.9)
where
fj 1cj
:“ Ecj„
f1cj
j pYjq
, (2.10)
for all nonnegative measurable functions fi on Ei, i “ 1, . . . ,m, holds if and only if
it holds whenever fi, i “ 1, . . . ,m are centered Gaussian functions3.
Various proofs of the Gaussian optimality in the Brascamp-Lieb inequality have been
proposed since the original paper of Brascamp and Lieb [83]; we further discuss these
matters in Section 6.2. In this chapter, however, we are only interested in the form
of the inequality (2.9) rather than the Gaussian optimality result. In particular, we
do not restrict our attention to the setting of Euclidean spaces and linear maps: the
underlying spaces and the mappings between them can be arbitrary.
The form of the inequality (2.9) can be seen as a generalization of several other
familiar inequalities, including Holder’s inequality, the sharp Young inequality, the
Loomis-Whitney inequality, the entropy power inequality, hypercontractivity and the
logarithmic Sobolev inequality; we further discuss these special cases in Section 2.5
(see also [98] [99]) [70]).
The Brascamp-Lieb inequality is known to imply several information-theoretic
inequalities, such as the entropy power inequality [87, Theorem 12] and the entropic
uncertainty principle [100]; see also the summary in [93]. In this thesis, however, we
3A centered Gaussian function is of the form x ÞÑ expp´xJAxq where A is a positive semidefinitematrix.
27
are mainly interested in the dual formulations of the inequality (2.9). Such duality
was noted by Carlen and Cordero-Erausquin [76, Theorem 2.1] (see also [75] for a
special case). We reproduce their result here:
Theorem 2.2.2 ([76, Theorem 2.1]) Let X , Y1,. . . ,Ym be measurable spaces.
Consider a probability measure QX , measurable maps φj : X Ñ Yj, j “ 1, . . . ,m,
probability measures QY1,. . . ,QYm (not necessarily induced by QX and pφjqmj“1), and
c1, . . . , cm P p0,8q. For any d P R, the following statements are equivalent:
1. For any non-negative measurable functions fj : Yj Ñ R, j “ 1, . . . ,m, it holds
that
E
«
mź
j“1
fjpφjpXqq
ff
ď exppdqmź
j“1
fj 1cj
(2.11)
where X „ QX and Yj „ QYj .
2. For any probability measure PX ! QX ,
DpPX ||QXq ` d ěmÿ
j“1
cjDpPYj ||QYjq (2.12)
where PYj is induced by PX and φj, j “ 1, . . . ,m.
In this section we prove an extension of the duality of Carlen and Cordero-
Erausquin, with a functional inequality that generalizes (2.11) by allowing a cost
function and random transformations in lieu of tφju. Both generalizations are essen-
tial for certain information-theoretic applications. The “forward-reverse Brascamp-
Lieb inequality” alluded before will be introduced in later sections since, although
the forward-reverse inequality essentially generalizes the forward inequality (2.11),
28
the proof of its equivalent formulation is more involved and applies only to certain
“regular” (though fairly general) spaces4.
Theorem 2.2.3 Fix a probability measure QX , an integer m P t1, 2, . . . u, cj P p0,8q,
and random transformation QYj |X5 for each j P t1, . . . ,mu. Let pX, Yjq „ QXQYj |X .
Assume that d : X Ñ p´8,8s is a measurable function satisfying
0 ă Erexpp´dpXqqs ă 8. (2.13)
The following statements are equivalent:
1. For any non-negative measurable functions fj : Yj Ñ R, j P t1, . . . ,mu, it holds
that
E
«
exp
˜
mÿ
j“1
Erlog fjpYjq|Xs ´ dpXq
¸ff
ď
mź
j“1
fj 1cj
(2.14)
where the norm fj 1cj
is with respect to QYj .
2. For any distribution PX ! QX , it holds that
DpPX ||QXq ` ErdpXqs ěmÿ
j“1
cjDpPYj ||QYjq (2.15)
where X „ PX , and PX Ñ QYj |X Ñ PYj for j P t1, . . . ,mu.
Proof • 1)ñ2) Define
d0 :“ logE
«
expp´dpXqqmź
j“1
exppcjErıPYj ||QYj pYjq|Xsq
ff
. (2.16)
4More precisely, the “entropicñfunctional” direction is not more difficult than the forward in-equality, but the “functionalñentropic” direction requires sophisticated min-max theorems and isonly proved in for Polish spaces. In the finite alphabet case, the latter difficulty can be circumventedby using the KKT conditions [80].
5Throughout the thesis with the exception of the discussions on the reverse Brascamp-Lieb in-equality (mostly, Section 2.4), the definition of a random transformation is the same as the conven-tional definition of a regular conditional probability (see e.g. [43]).
29
Invoking statement 1) with
fj Ð
ˆ
dPYjdQYj
˙cj
(2.17)
we obtain
exppd0q ď
mź
j“1
ˆ
E„
dPYjdQYj
pYjq
˙cj
“ 1. (2.18)
Now if 0 ă exppd0q ď 1 then
dSXpxq :“ expp´dpxq ´ d0q
mź
j“1
exppcjErıPYj ||QYj pYjq|X “ xsqdQXpxq (2.19)
is a probability measure. Then (2.18) combined with the nonnegativity of rela-
tive entropy shows that
mÿ
j“1
cjDpPYj ||QYjq ď DpPX ||SXq ´ d0 `
mÿ
j“1
cjDpPYj ||QYjq (2.20)
“ DpPX ||QXq ` ErdpXqs (2.21)
and statement 2) holds. On the other hand, if exppd0q “ 0, then for QX-almost
all x,
expp´dpxqqmź
j“1
exppcjErıPYj ||QYj pYjq|X “ xsq “ 0. (2.22)
Taking logarithms on both sides and taking the expectation with respect to PX ,
we have
´ErdpXqs `mÿ
j“1
cjDpPYj ||QYjq “ ´8 (2.23)
30
and statement 2) also follows.
• 2)ñ1) It suffices to prove for fj’s such that 0 ă a ă fj ă b ă 8 for some a and
b, since the general case will then follow by taking limits (e.g. using monotone
convergence theorem). By this assumption and (2.13), we can always define PX
through
ıPXQX pxq “ ´dpxq ´ d0 ` E
«
mÿ
j“1
log fjpYjq
ˇ
ˇ
ˇ
ˇ
ˇ
X “ x
ff
(2.24)
and SYj through
ıSYj QYj pyjq :“1
cjlog fjpyjq ´ dj, (2.25)
for each j P t1, . . . ,mu, where dj P R, j P t0, . . . ,mu are normalization con-
stants, therefore
exppd0q “ E
«
exp
˜
´dpXq ` E
«
mÿ
j“1
log fjpYjq
ˇ
ˇ
ˇ
ˇ
ˇ
X
ff¸ff
; (2.26)
exppdjq “ E„
exp
ˆ
1
cjlog fjpYjq
˙
, j P t1, . . . ,mu. (2.27)
But direct computation gives
DpPX ||QXq “ ´ErdpXqs ´ d0 ` E
«
mÿ
j“1
log fjpYjq
ff
(2.28)
DpPYj ||QYjq “ DpPYj ||SYjq ` E”
ıSYj ||QYj pYjqı
(2.29)
“ DpPYj ||SYjq ´ dj ` E„
1
cjlog fjpYjq
(2.30)
31
where Yj „ PYj . Therefore statement 2) yields
´d0 ` E
«
mÿ
j“1
log fjpYjq
ff
ě
mÿ
j“1
cjDpPYj ||SYjq ´mÿ
j“1
cjdj ` E
«
mÿ
j“1
log fjpYjq
ff
.
(2.31)
Since fj’s are assumed to be bounded, ´8 ă E”
řmj“1 log fjpYjq
ı
ă 8 so we
can cancel it from the two sides of the inequality. It then follows from the
non-negativity of relative entropy that
d0 ď
mÿ
j“1
cjdj (2.32)
which is equivalent to statement 1) in view of (2.26) and (2.27).
Remark 2.2.1 We prove Theorem 2.2.3 by defining certain auxiliary measures and
then reducing (2.14) or (2.15) to the nonnegativity of relative entropy. Alternatively,
Theorem 2.2.3 may be proved with a variational formula for the relative entropy in a
similar manner as the proof of [76, Theorem 2.1].
Remark 2.2.2 As a convention, if the left side of (2.15) is 8 ´ 8 it should be
understood as E”
ıPXQX pXq ` dpXqı
with X „ PX . Note that the assumption (2.13)
ensures that
dTXpxq :“expp´dpxqqdQXpxq
Erexpp´dpXqqs(2.33)
defines a probability measure TX so that
E”
ıPXQX pXq ` dpXqı
“ D pPXTXq ´ logErexpp´dpXqqs (2.34)
32
is always well-defined (finite or `8).
Remark 2.2.3 In the statements of Theorem 2.2.3, QX is a probability measure and
QX and QYj are connected through the random transformation QYj |X . These help to
keep our notations simple and suffice for most of our applications. However, from the
proof it is clear that these restrictions are not really necessary. In other words, we
have the extension of Theorem 2.2.3 that the following two statements are equivalent:
ż
exp
˜
mÿ
j“1
Erlog fjpYjq|X “ xs ´ dpxq
¸
dνpxq ďmź
j“1
fj 1cj
, @f1, . . . , fm; (2.35)
DpPX ||νq `
ż
dpxqdνpxq ěmÿ
j“1
cjDpPYj ||µjq, @PX , (2.36)
where again fj : Yj Ñ R, j P t1, . . . ,mu are nonnegative measurable functions, PX !
ν, and PX Ñ QYj |X Ñ PYj for j P t1, . . . ,mu. Here the measures ν and µj need not
be normalized and need not be connected by QYj |X , and ¨ 1cj
is with respect to µj.
2.3 Application to Common Randomness Genera-
tion
We now use Theorem 2.2.3 to prove a preliminary converse bound for common ran-
domness generation.
2.3.1 The One Communicator Problem: the Setup
We first formulate the problem in the one-shot setting. Let QXYm be the joint dis-
tribution of sources X, Y1, . . . , Ym, observed by terminals T0, . . . , Tm as shown in
Figure 1.1. The communicator T0 computes/encodes (possibly stochastically) the
33
integers W1pXq, . . . , WmpXq and sends them to T1, . . . , Tm, respectively. Then, ter-
minals T0, . . . , Tm compute/decode (either deterministically or stochastically, which
will be specified in the statements of our results) integers KpXq, K1pY1,W1q,. . . ,
KmpYm,Wmq. The goal is to produce K “ K1 “ ¨ ¨ ¨ “ Km with high probability with
K almost equiprobable.
In the stationary memoryless case where the sources have the per-letter distribu-
tion QXYm , take X Ð Xn, Yj Ð Y nj in the above formulation.
Definition 2.3.1 The pm ` 1q-tuple pR,R1, . . . , Rmq is said to be achievable if a
sequence of key generation schemes can be designed to fulfill the following conditions:
lim infnÑ8
1
nHpKq ě R; (2.37)
lim supnÑ8
1
nlog |Wl| ď Rl, l “ 1, . . . ,m; (2.38)
limnÑ8
max1ďlďm
PrK ‰ Kls “ 0. (2.39)
Irrespective of the stochasticity of the decoders (cf. [16][101]), the achievable region
is the set of pR,R1, . . . , Rmq P r0,8qm`1 such that
supPU|X
#
mÿ
j“1
cjIpU ;Yjq ´ IpU ;Xq
+
`ÿ
j
cjRj ě
˜
ÿ
j
cj ´ 1
¸
R (2.40)
for all cm P p0,8qm (equivalently, for all cm P p0,8qm such thatř
cj ą 1), where
pU,X, Yjq „ PU|XQXYj .
Remark 2.3.1 It is widely known in the information-theoretic secrecy literature (see
e.g. [102] for a discussion) that the rate region (2.40) is rather “robust” under various
definitions. For example, (2.40) does not change if one replaces (2.37) and (2.39) in
34
the conventional definition with
lim infnÑ8
1
nlog |K| ě R; (2.41)
limnÑ8
|PKKm ´ TKKm | “ 0, (2.42)
where TKKm denotes the ideal distribution:
TKKmpk, kmq “1
|K|1tk “ k1 “ ¨ ¨ ¨ “ kmu, @k, k1, . . . , km. (2.43)
In fact, such a definition appears to be more natural for the second-order analysis in
the nonvanishing error regime.
2.3.2 A Zero-Rate One-Shot Converse for the Omniscient
Helper Problem
Although the single-letter rate region is known for the one-communicator problem
(2.40), the previous proof based on Fano’s inequality only establishes a weak converse
result, that is, the error probability is assumed to be vanishing. As mentioned before,
a main goal of the first three chapters is to supply general converse tools to solve,
in particular, the optimal scaling of the second-order rate for the one-communicator
problem with nonvanishing error probability. In this section, however, we achieve an
intermediate goal, which is more modest than the final goal in the following senses:
• The first order asymptotics of the bound is tight only when the supremum in
(2.40) is zero. In other words, we bound the maximum ratio of the log alphabet
sizes of the key and the messages such that the key can be successfully generated
in the omniscient helper problem. Since this ratio is supremized as the key rate
and the communication rates tend to zero, such a converse bound may also be
35
called a zero-rate converse. The bound is only asymptotically tight in the case
of abundant correlated sources but limited communication rates.
• We consider only the omniscient helper special case of the one-communicator
problem, where X “ Y m. Actually, the analysis should work as long as each Yj
is a (deterministic) function of X, but let us consider X “ Y m for simplicity.
We remark that the above two limitations will be remedied by techniques introduced
in Chapters 3,4 respectively.
Now, the rate region (2.40) suggests that we should find a correspondence between
the linear combination of mutual informations to a certain functional inequality. Sup-
pose that random variables Y1,. . . ,Ym and c1, . . . , cm P p0,8q satisfy
E
«
mź
l“1
flpYlq
ff
ď
mź
l“1
flpYlq 1cl
(2.44)
for all bounded real-valued measurable functions fl defined on Yl, l “ 1, . . . ,m. Then
Theorem 2.2.3 shows that (2.44) is equivalent to
DpPYmQYmq ě
mÿ
l“1
clDpPYlQYlq (2.45)
for all PYm ! QYm . To further relate (2.45) to the rate region formula (2.40), we
need the following result:
Proposition 2.3.1 For given QYm and c1, . . . , cm P r1,`8q, (2.45) holds for all
PYm ! QYm if and only if
IpU ;Y mq ě
mÿ
l“1
clIpU ;Ylq (2.46)
holds for all PU |Ym.
36
Proof The (2.45)ñ(2.46) part follows immediately from the definition of the mutual
information. The less trivial direction (2.46)ñ(2.45) can be shown by constructing a
random variable U that takes two values; see for example [74].
In view of the rate region formula (2.40), a rate tuple pR,R1, . . . , Rmq is not
achievable if
R ămÿ
l“1
clpR ´Rlq (2.47)
holds for some c1, . . . , cm satisfying the assumption in Proposition 2.3.1. Conversely,
if r1, . . . , rm P p0,8q satisfies the property that 1 ěřml“1 clp1´ rlq for all pc1, . . . , cmq
such that the assumption in Proposition 2.3.1 is satisfied, then for any δ ą 0 there
exists pR,R1, R2, . . . , Rmq achievable such that RlRă rl ` δ for each l “ 1, . . . ,m. In
other words, the set of admissible pc1, . . . , cmq completely characterizes the set of the
achievable ratios of R1
R, . . . , Rm
R.
We are now ready to prove a zero-rate one-shot converse for the omniscient
helper problem. Suppose the (possibly stochastic) encoder for the public messages
is specified by PWm|Ym and the (possibly stochastic) decoder for the key is given byśm
l“1 PKl|YlWl. Let TKm be the correct distribution under which K1 “ K2 “ ¨ ¨ ¨ “ Km
is equiprobably distributed on K. Clearly, a small total variation |PKm´TKm | implies
both uniformity of the key distribution and a small probability of key disagreement;
in fact in the stationary memoryless case it can be shown that this is asymptotically
equivalent to imposing both the entropy constraint (2.37) and the error constraint
(2.39).
37
Theorem 2.3.2 In the omniscient helper problem, if Y m and c1, . . . , cm satisfies the
condition in Proposition 2.3.1, then
1
2|PKm ´ TKm | ě 1´
1
|K| ´«
|K|mź
l“1
ˆ
|Wl|
|K|
˙clff
1ř
cl
. (2.48)
Proof For any k P K,
P
«
mč
l“1
tKl “ ku
ff
(2.49)
“
ż
Ym
ÿ
wm
mź
l“1
PKl“k|YlWl“wlPWm|YmdPYm (2.50)
ď
ż
Ymmaxwm
mź
l“1
PKl“k|YlWl“wldPYm (2.51)
ď
ż
Ym
mź
l“1
maxwl
PKl“k|YlWl“wldPYm (2.52)
ď
mź
l“1
„ż
Ylpmaxwl
PKl“k|YlWl“wlq1cl dPYl
cl
(2.53)
ď
mź
l“1
„ż
Ylmaxwl
PKl“k|YlWl“wldPYl
cl
(2.54)
ď
mź
l“1
«
ÿ
wl
ż
YlPKl“k|YlWl“wldPYl
ffcl
(2.55)
where
• (2.53) uses the definition of hypercontractivity;
• (2.54) uses 0 ă cl ă 1 and maxwl PKl“k|YlWl“wl ď 1.
Raising both sides of (2.55) to the power of 1řmi“1 ci
, we obtain
P
«
mč
l“1
tKl “ ku
ff1
ř
i ci
ď
mź
l“1
«
ÿ
wl
ż
YlPKl“k|YlWl“wldPYl
ff
clř
i ci
(2.56)
38
But the function tm ÞÑśm
l“1 tcl
ř
i ci
l is a concave function on r0,8qm, so by Jensen’s
inequality,
1
|K|
|K|ÿ
k“1
mź
l“1
«
ÿ
wl
ż
YlPKl“k|YlWl“wldPYl
ff
clř
i ci
(2.57)
ď
mź
l“1
«
ÿ
wl
ż
Yl
1
|K|
|K|ÿ
k“1
PKl“k|YlWl“wldPYl
ff
clř
i ci
(2.58)
“
mź
l“1
«
ÿ
wl
ż
Yl
1
|K|dPYl
ff
clř
i ci
(2.59)
“
mź
l“1
ˆ
|Wl|
|K|
˙
clř
i ci
(2.60)
Combining (2.56) and (2.60) we obtain
1
|K|
|K|ÿ
k“1
P
«
mč
l“1
tKl “ ku
ff1
ř
i ci
ď
mź
l“1
ˆ
|Wl|
|K|
˙
clř
i ci
. (2.61)
Now, if we can show the following bound
1
2|PKm ´ TKm |
“
|K|ÿ
k“1
ˇ
ˇ
ˇ
ˇ
ˇ
P
«
mč
l“1
tKl “ ku
ff
´1
|K|
ˇ
ˇ
ˇ
ˇ
ˇ
` 1´
|K|ÿ
k“1
P
«
mč
l“1
tKl “ ku
ff
(2.62)
ě 1´1
|K| ´ |K|1
ř
l cl´1
|K|ÿ
k“1
P
«
mč
l“1
tKl “ ku
ff1
ř
l cl
(2.63)
then the proof is finished by combining (2.61) and (2.63).
39
To finish the proof we need to find an lower bound on (2.62) in terms of the left
side of (2.61). Consider the optimization problem over nonnegative vector a|K|:
minimize fpa|K|q :“
|K|ÿ
k“1
ˇ
ˇ
ˇ
ˇ
1
|K| ´ akˇ
ˇ
ˇ
ˇ
`
(2.64)
subject to gpa|K|q :“1
|K|
|K|ÿ
k“1
a1
ř
l clk ď λ (2.65)
where λ ą 0 is some constant. Notice that if a|K| and b|K| satisfies
1
|K|
|K|ÿ
k“1
ak “1
|K|
|K|ÿ
k“1
bk (2.66)
and that ak ´1|K| and bk ´
1|K| have the same sign, then (because 1
ř
clď 1)
fp1
2pa|K| ` b|K|qq “
1
2pfpa|K|q ` fpb|K|qq (2.67)
gp1
2pa|K| ` b|K|qq ě
1
2pgpa|K|q ` gpb|K|qq. (2.68)
This implies that an optimizer a|K| of (2.64) satisfies
|tk : ak P p0,1
|K| qu| ď 1. (2.69)
Moreover for any a|K|, define bk :“ ak1tak ă1|K|u `
1|K|1tak ě
1|K|u, then
fpb|K|q “ fpa|K|q (2.70)
gpb|K|q ď pgpa|K|qq. (2.71)
Thus a|K| can further be assumed to be of the following form:
a|K| “
„
1
|K| ,1
|K| , . . . ,1
|K| , η, 0, 0, . . . , 0
(2.72)
40
for some 0 ă η ď 1|K| , and then it is elementary to show that the optimal value for
(2.64) satisfies
f ě 1´ λ|K|1
ř
l cl ´1
|K| . (2.73)
Thus we have shown that
1
2|PKm ´ TKm | ě 1´
1
|K| ´ |K|1
ř
l cl´1
|K|ÿ
k“1
PrK1 “ ¨ ¨ ¨ “ Km “ ks1
ř
l cl (2.74)
as desired.
Remark 2.3.2 Theorem 2.3.2 is stronger than a conventional converse by Fano’s
inequality (which proves exactly the converse part of (2.40)) in the following senses:
• The conventional converse fails to bound the ratio of the number of common
randomness bits to the communication bits if the communication rate is zero. In
contrast, Theorem 2.3.2 is still gives a tight bound on this ratio in the zero-rate
case where the log size of the key alphabet grows sub-linearly in the blocklength.
In fact, as long as
log |K| ´mÿ
l“1
clplog |K| ´ log |Wl|q Ñ ´8 (2.75)
which is weaker than (2.47), Theorem 2.3.2 implies that 12|PKm´TKm | converges
to 1.
• Even if (2.47) holds, the conventional Fano-based converse does not guarantee
that the error probability tends to 1.
41
2.4 Dual Formulation of a Forward-Reverse Brascamp-
Lieb Inequality
In this section we introduce a new type of functional inequality which may be called
the “forward-reverse Brascamp-Lieb inequality”, and prove its equivalent entropic
formulation. The forward-reverse Brascamp-Lieb inequality is a generalization of
both the forward inequality (which we have discussed in Section 2.2) and the reverse
Brascamp-Lieb inequality. The proof of the equivalent formulation of the forward-
reverse inequality is more technically demanding because of the “reverse” component.
Readers who are only interested in applications to network information theory may
skip this section. However, as shown in Section 2.5 the forward-reverse inequality is
useful in providing a unified treatment to almost all duality results in the literature,
previously proved by various other methods.
To begin with, let us review a common form of the reverse Brascamp-Lieb in-
equality, again from Barthe’s paper [88]:
Theorem 2.4.1 Let E, E1, . . . , Em be Euclidean spaces, and Bi : E Ñ Ei be lin-
ear maps. Let pciqmi“1 and F be positive real numbers. The reverse Brascamp-Lieb
inequality6
ż
suppyiq :
řmi“1 ciB
˚i yi“x
mź
i“1
fipyiq dx ě Fmź
i“1
fi 1ci
, (2.76)
for all nonnegative measurable functions fi on Ei, i “ 1, . . . ,m, holds if and only if
it holds for all centered Gaussian functions.
As in the forward case, in this chapter we are only be interested in the form of
the reverse Brascamp-Lieb inequality rather than the Gaussian optimality result.
6B˚i denotes the adjoint of Bi. In other words, the matrix of B˚i is the transpose of the matrixof Bi.
42
In particular, we consider general (possibly nonlinear) mappings between general
(possibly non-Euclidean) spaces.
Lehec [103, Theorem 18] essentially proved that (2.76) is implied by an entropic
inequality, but did not prove the converse implication (which, as we shall see, is the
more challenging direction).
2.4.1 Some Mathematical Preliminaries
Before we introduce the new forward-reverse inequality, some mathematical notations
and preliminaries are necessary: this is because we will prove the “functional implies
entropic” part of the equivalent formulation result, which is much more sophisticated
than the case of the forward inequality. Some regularity assumptions on the alphabets
and the measures appear to be necessary, and some definitions (e.g. that of a random
transformation) are slightly different than the other sections of the thesis in order
to make the proof work (although of course these definitions are equivalent for finite
alphabets).
Throughout this thesis, when the forward-reverse inequality is considered, we
always assume that the alphabets are Polish spaces, and the measures are Borel
measures7. Of course, this covers the cases where the alphabet is Euclidean or discrete
(endowed with the Hamming metric, which induces the discrete topology, making
every function on the discrete set continuous), among others. Readers interested in
finite-alphabets only may refer to the (much simpler) argument in [80] based on the
KKT condition. However, the proof here for the general Polish space case better
manifests the “function-measure duality” viewpoint emphasized in this thesis.
Notation 1 Let X be a topological space.
7A Polish space is a complete separable metric space. It enjoys several nice properties that weuse heavily in this section, including Prokhorov theorem and Riesz-Kakutani theorem (the latter isrelated to the fact that every Borel probability measure on a Polish space is inner regular, hence aRadon measure). Short introductions to the Polish space can be found in e.g. [104][105].
43
• CcpX q denotes the space of continuous functions on X with a compact support;
• C0pX q denotes the space of all continuous function f on X that vanishes at
infinity (i.e. for any ε ą 0 there exists a compact set K Ď X such that |fpxq| ă ε
for x P X zK);
• CbpX q denotes the space of bounded continuous functions on X ;
• MpX q denotes the space of finite signed Borel measures on X ;
• PpX q denotes the space of probability measures on X .
We consider Cc, C0 and Cb as topological vector spaces, with the topology induced
from the sup norm. The following theorem, usually attributed to Riesz, Markov and
Kakutani, is well-known in functional analysis and can be found in, e.g. [67][106].
Theorem 2.4.2 (Riesz-Markov-Kakutani) If X is a locally compact, σ-compact
Polish space, the dual8 of both CcpX q and C0pX q is MpX q.
Remark 2.4.1 The dual space of CbpX q can be strictly larger than MpX q, since
it also contains those linear functionals that depend on the “limit at infinity” of a
function f P CbpX q (originally defined for those f that do have a limit at the infinity,
and then extended to the whole CbpX q by Hahn-Banach theorem; see e.g. [67]).
Of course, any µ P MpX q is a continuous linear functional on C0pX q or CcpX q,
given by
f ÞÑ
ż
fdµ (2.77)
where f is a function in C0pX q or CcpX q. Remarkably, Theorem 2.4.2 states that
the converse is also true under mild regularity assumptions on the space. Thus, we
8The dual of a topological vector space consists of all continuous linear functionals on that space,which is, naturally, also a topological vector space (with the weak˚ topology).
44
can view measures as continuous linear functionals on a certain function space;9 this
justifies the shorthand notation
µpfq :“
ż
fdµ (2.78)
which we use frequently in the rest of the thesis. This viewpoint is the most natural
for our setting since in the proof of the equivalent formulation of the forward-reverse
Brascamp-Lieb inequality we shall use the Hahn-Banach theorem to show the exis-
tence of certain linear functionals.
Definition 2.4.1 Let Λ: CbpX q Ñ p´8,`8s be a lower semicontinuous, proper
convex function. Its Legendre-Fenchel transform Λ˚ : C˚b pX q Ñ p´8,`8s is given
by
Λ˚p`q :“ supuPCbpX q
r`puq ´ Λpuqs. (2.79)
Let ν be a nonnegative finite Borel measure on a Polish space X . The relative
entropy has the following definition: for any µ PMpX q,
Dpµνq :“ supfPCbpX q
rµpfq ´ Λpfqs (2.80)
where we have defined the convex functional on CbpX q:
Λpfq :“ log νpexppfqq (2.81)
“ log
ż
exppfqdν. (2.82)
9In fact, some authors prefer to construct the measure theory by defining a measure as a linearfunctional on a suitable measure space; see Lax [67] or Bourbaki [68].
45
The definition (2.80) agrees with the definition (2.3) when ν is a probability measure,
by the Donsker-Varadhan formula (c.f. [105, Lemma 6.2.13]). If µ is not a probability
measure, then Dpµνq as defined in (2.80) is `8.
Given a bounded linear operator T : CbpYq Ñ CbpX q, the dual operator
T ˚ : CbpX q˚ Ñ CbpYq˚ is defined in terms of
T ˚µX : f P CbpYq ÞÑ µXpTfq, (2.83)
for any µX P CbpX q˚. Since PpX q ĎMpX q Ď CbpX q˚, we can define a conditional
expectation operator as any T such that T ˚P P PpYq for any P P PpX q. A random
transformation T ˚ is defined as the dual of some conditional expectation operator.
Remark 2.4.2 From the viewpoint of category theory (see for example [107][108]),
Cb is a functor from the category of topological spaces to the category of topological
vector spaces, which is contra-variant because for any continuous, φ : X Ñ Y (mor-
phism between topological spaces), we have Cbpφq : CbpYq Ñ CbpX q, u ÞÑ u ˝ f where
u ˝ φ denotes the composition of two continuous functions, reversing the arrows in
the maps (i.e. the morphisms). On the other hand, M is a covariant functor and
Mpφq : MpX q Ñ MpYq, µ ÞÑ µ ˝ φ´1, where µ ˝ φ´1pBq :“ µpφ´1pBqq for any
Borel measurable B Ď Y. “Duality” itself is a contra-variant functor between the
category of topological spaces (note the reversal of arrows in Fig. 2.1). Moreover,
CbpX q˚ “ MpX q and Cbpφq˚ “ Mpφq if X and Y are compact metric spaces and
φ : X Ñ Y is continuous. Definition 2.4.2 below can therefore be viewed as the special
case where φ is the projection map:
Definition 2.4.2 Suppose φ : Z1 ˆ Z2 Ñ Z1, pz1, z2q ÞÑ z1 is the projection to the
first coordinate.
• Cbpφq : CbpZ1q Ñ CbpZ1ˆZ2q is called a canonical map, whose action is almost
trivial: it sends a function of zi to itself, but viewed as a function of pz1, z2q.
46
• Mpφq : MpZ1 ˆ Z2q ÑMpZ1q is called marginalization, which simply takes a
joint distribution to a marginal distribution.
Next, recall that the Fenchel-Rockafellar duality (see [104, Theorem 1.9], or [94]
in the case of finite dimensional vector spaces) from convex analysis usually refers to
the k “ 1 special case of the following result, for which we provide a proof here for
completeness:
Theorem 2.4.3 Assume that A is a topological vector space whose dual is A˚. Let
Θj : AÑ RY t`8u, j “ 0, 1, . . . , k, for some positive integer k. Suppose there exist
some pujqkj“1 and u0 :“ ´pu1 ` ¨ ¨ ¨ ` ukq such that
Θjpujq ă 8, j “ 0, . . . , k (2.84)
and Θ0 is upper semicontinuous at u0. Then
´ inf`PA˚
«
kÿ
j“0
Θ˚j p`q
ff
“ infu1,...,ukPA
«
Θ0
˜
´
kÿ
j“1
uj
¸
`
kÿ
j“1
Θjpujq
ff
. (2.85)
For completeness, we provide a proof of this result, which is based on the Hahn-
Banach theorem (Theorem 2.4.4) and is similar to the proof of [104, Theorem 1.9].
Proof Let m0 be the right side of (2.85). The ď part of (2.85) follows trivially from
the (weak) min-max inequality since
m0 “ infu0,...,ukPA
sup`PA˚
#
kÿ
j“0
Θjpujq ´ `pkÿ
j“0
ujq
+
(2.86)
ě sup`PA˚
infu0,...,ukPA
#
kÿ
j“0
Θjpujq ´ `pkÿ
j“0
ujq
+
(2.87)
“ ´ inf`PA˚
«
kÿ
j“0
Θ˚j p`q
ff
. (2.88)
47
It remains to prove the ě part, and it suffices to assume without loss of generality
that m0 ą ´8. Note that (2.84) also implies that m0 ă `8. Define convex sets
Cj :“ tpu, rq P Aˆ R : r ą Θjpuqu, j “ 0, . . . , k; (2.89)
D :“ tp0,mq P Aˆ R : m ď m0u. (2.90)
Observe that these are nonempty sets by the assumption (2.84). Also C0 has
nonempty interior by the assumption that Θ0 is upper semicontinuous at u0. Thus,
the Minkowski sum
C :“ C0 ` ¨ ¨ ¨ ` Ck (2.91)
is a convex set with a nonempty interior. Moreover, CYD “ H. By the Hahn-Banach
theorem (Theorem 2.4.4), there exists p`, sq P A˚ ˆ R such that
sm ď `pkÿ
j“0
ujq ` skÿ
j“0
rj. (2.92)
For any m ď m0 and puj, rjq P Cj, j “ 0, . . . , k. From (2.90) we see (2.92) can only
hold when s ě 0. Moreover, from (2.84) and the upper semicontinuity of Θ0 at u0 we
see theřkj“0 uj in (2.92) can take value in a neighbourhood of 0 P A, hence s ‰ 0.
Thus, by dividing s on both sides of (2.92) and setting `Ð ´`s, we see that
m0 ď infu0,...,ukPA
«
´`pkÿ
j“0
ujq `kÿ
j“0
Θjpujq
ff
(2.93)
“ ´
«
kÿ
j“0
Θ˚j p`q
ff
(2.94)
which establishes the ě part in (2.85).
48
Theorem 2.4.4 (Hahn-Banach) Let C and D be convex, nonempty disjoint sub-
sets of a topological vector space A. If the interior of C is non-empty, then there
exists ` P A˚, ` ‰ 0 such that
supuPD
`puq ď infuPC
`puq. (2.95)
Remark 2.4.3 The assumption in Theorem 2.4.4 that C has nonempty interior is
only necessary in the infinite dimensional case. However, even if A in Theorem 2.4.3
is finite dimensional, the assumption in Theorem 2.4.3 that Θ0 is upper semicontin-
uous at u0 is still necessary, because this assumption was not only used in applying
Hahn-Banach, but also in concluding that s ‰ 0 in (2.92).
2.4.2 Compact X
We first state a duality theorem for the case of compact alphabets to streamline the
proof. Later we show that the argument can be extended to a particular non-compact
case.10 Our proof based on the Legendre-Fenchel duality (Theorem 2.4.3) was inspired
by the proof of the Kantorovich duality in the theory of optimal transportation (see
[104, Chapter 1], where the idea is credited to Brenier).
Theorem 2.4.5 (Dual formulation of forward-reverse Brascamp-Lieb inequality)
Assume that
• m and l are positive integers, d P R, X is a compact metric space (hence also a
Polish space);
• For each i “ 1, . . . , l, bi P p0,8q; νi is a finite Borel measure on a Polish space
Zi, and QZi|X “ S˚i is a random transformation;
10Theorem 2.4.5 is not included in the conference paper [80], but was announced in the conferencepresentation.
49
• For each j “ 1, . . . ,m, cj P p0,8q, µj is a finite Borel measure on a Polish
space Yj, and QYj |X “ T ˚j is a random transformation.
• For any PZi such that DpPZiνiq ă 8, i “ 1, . . . , l, there exists PX P
Ş
ipS˚i q´1PZi such that
řmj“1 cjDpPYjµjq ă 8, where PYj :“ T ˚j PX .
Then the following two statements are equivalent:
1. If nonnegative continuous functions pgiq, pfjq are bounded away from 0 and such
that
lÿ
i“1
biSi log gi ďmÿ
j“1
cjTj log fj (2.96)
then (see (2.78) for the notation of the integral)
lź
i“1
νbii pgiq ď exppdqmź
j“1
µcjj pfjq. (2.97)
2. For any pPZiq such that DpPZiνiq ă 811, i “ 1, . . . , l,
lÿ
i“1
biDpPZiνiq ` d ě infPX
mÿ
j“1
cjDpPYjµjq (2.98)
where the infimum is over PX such that S˚i PX “ PZi, i “ 1, . . . , l, and PYj :“
T ˚j PX , j “ 1, . . . ,m.
Proof We can safely assume d “ 0 below without loss of generality (since otherwise
we can always substitute µ1 Ð exp´
dc1
¯
µ1).
11Of course, this assumption is not essential (once we adopt the convention that the infimum in(2.98) is `8 when it runs over an empty set).
50
CbpX qCbpZ1q Q g1
CbpZ2q Q g2
CbpY1q Q f1
CbpY2q Q f2
S1
S2
T1
T2
PpX q Q PXPpZ1q Q PZ1
PpZ2q Q PZ2
PpY1q Q PY1
PpY2q Q PY2
S˚1
S˚2
T ˚1
T ˚2
Figure 2.1: Diagrams for Theorem 2.4.5.
1)ñ2) This is the nontrivial direction which relies on certain (strong) min-max type
results. In Theorem 2.4.3, put
Θ0 : u P CbpX q ÞÑ
$
’
&
’
%
0 u ď 0;
`8 otherwise.(2.99)
Then,
Θ˚0 : π PMpX q ÞÑ
$
’
&
’
%
0 π ě 0;
`8 otherwise.(2.100)
For each j “ 1, . . . ,m, set
Θjpuq “ cj inf log µj
ˆ
exp
ˆ
1
cjv
˙˙
(2.101)
where the infimum is over v P CbpYq such that u “ Tjv; if there is no such v
then Θjpuq :“ `8 as a convention. Observe that
• Θj is convex;
51
• Θjpuq ą ´8 for any u P CbpX q. If otherwise, for any PX and PYj :“ T ˚j PX
we have
DpPYjµjq “ supvtPYjpvq ´ log µjpexppvqqu (2.102)
“ supvtPXpTjvq ´ log µjpexppvqqu (2.103)
“ supuPCbpX q
"
PXpuq ´1
cjΘjpcjuq
*
(2.104)
“ `8 (2.105)
which contradicts the assumption in the theorem;
• From the steps (2.102)-(2.104), we see Θ˚j pπq “ cjDpT
˚j πµjq for any
π PMpX q, where the definition of Dp¨µjq is extended using the Donsker-
Varadhan formula (that is, it is infinite when the argument is not a prob-
ability measure).
Finally, for the given pPZiqli“1, choose
Θm`1 : u P CbpX q ÞÑ
$
’
&
’
%
řli“1 PZipwiq if u “
řli“1 Siwi for some wi P CbpZiq;
`8 otherwise.
(2.106)
Notice that
• Θm`1 is convex;
52
• Θm`1 is well-defined (that is, the choice of pwiq in (2.106) is inconsequen-
tial). Indeed if pwiqli“1 is such that
řli“1 Siwi “ 0, then
lÿ
i“1
PZipwiq “lÿ
i“1
S˚i PXpwiq (2.107)
“
lÿ
i“1
PXpSiwiq (2.108)
“ 0, (2.109)
where PX is such that S˚i PX “ PZi , i “ 1, . . . , l, whose existence is guaran-
teed by the assumption of the theorem. This also shows that Θm`1 ą ´8.
•
Θ˚m`1pπq :“ sup
utπpuq ´Θm`1puqu (2.110)
“ supw1,...,wl
#
π
˜
lÿ
i“1
Siwi
¸
´
lÿ
i“1
PZipwiq
+
(2.111)
“ supw1,...,wl
#
lÿ
i“1
S˚i πpwiq ´lÿ
i“1
PZipwiq
+
(2.112)
“
$
’
&
’
%
0 if S˚i π “ PZi , i “ 1, . . . , l;
`8 otherwise.(2.113)
Invoking Theorem 2.4.3 (where the uj in Theorem 2.4.3 can be chosen as the
constant function uj ” 1, j “ 1, . . . ,m` 1):
infπ : πě0, S˚i π“PZi
mÿ
j“1
cjDpT˚j πµjq (2.114)
“ ´ infvm,wl :
řmj“1 Tjvj`
řli“1 Siwiě0
«
mÿ
j“1
cj log µj
ˆ
exp
ˆ
1
cjvj
˙˙
`
lÿ
i“1
PZipwiq
ff
(2.115)
53
Note that the left side of (2.115) is exactly the right side of (2.98). For any
ε ą 0, choose vj P CbpYjq, j “ 1, . . . ,m and wi P CbpZiq, i “ 1, . . . , l such that
řmj“1 Tjvj `
řli“1 Siwi ě 0 and
ε´mÿ
j“1
cj log µj
ˆ
exp
ˆ
1
cjvj
˙˙
´
lÿ
i“1
PZipwiq ą infπ : πě0, S˚i π“PZi
mÿ
j“1
cjDpT˚j πµjq
(2.116)
Now invoking (2.97) with fj :“ exp´
1cjvj
¯
, j “ 1, . . . ,m and gi :“ exp´
´ 1biwi
¯
,
i “ 1, . . . , l, we upper bound the left side of (2.116) by
ε´lÿ
i“1
bi log νipgiq `lÿ
i“1
biPZiplog giq ď ε`lÿ
i“1
biDpPZiνiq (2.117)
where the last step follows by the Donsker-Varadhan formula. Therefore (2.98)
is established since ε ą 0 is arbitrary.
2)ñ1) Since νi is finite and gi is bounded by assumption, we have νipgiq ă 8,
i “ 1, . . . , l. Moreover (2.97) is trivially true when νipgiq “ 0 for some i, so we
will assume below that νipgiq P p0,8q for each i. Define PZi by
dPZidνi
“gi
νipgiq, i “ 1, . . . , l. (2.118)
Then for any ε ą 0,
lÿ
i“1
bi log νipgiq “lÿ
i“1
birPZiplog giq ´DpPZiνiqs (2.119)
ă
mÿ
j“1
cjPYjplog fjq ` ε´mÿ
j“1
cjDpPYjµjq (2.120)
ď ε`mÿ
j“1
cj log µjpfjq (2.121)
54
where
• (2.120) uses the Donsker-Varadhan formula, and we have chosen PX , PYj :“
T ˚j PX , j “ 1, . . . ,m such that
lÿ
i“1
biDpPZiνiq ąmÿ
j“1
cjDpPYjµjq ´ ε (2.122)
• (2.121) also follows from the Donsker-Varadhan formula.
The result follows since ε ą 0 can be arbitrary.
Remark 2.4.4 The infimum in (2.98) is in fact achievable: For any pPZiq, there
exists a PX that minimizesřmj“1 cjDpPYjµjq subject to the constraints S˚i PX “ PZi,
i “ 1, . . .m, where PYj :“ T ˚j PX , j “ 1, . . . ,m. Indeed, since the singleton tPZiu is
weak˚-closed and S˚i is weak˚-continuous12, the setŞli“1pS
˚i q´1PZi is weak˚-closed in
MpXq; hence its intersection with PpX q is weak˚-compact in PpX q, because PpX q
is weak˚-compact by (a simple version for the setting of a compact underlying space
X of) the Prokhorov theorem [109]. Moreover, by the weak˚-lower semicontinuity
of Dp¨µjq (easily seen from the variational formula/Donsker-Varadhan formula of
the relative entropy, cf. [43]) and the weak˚-continuity of T ˚j , j “ 1, . . . ,m, we see
řmj“1 cjDpT
˚j PXµjq is weak˚-lower semicontinuous in PX , and hence the existence of
a minimizing PX is established.
Remark 2.4.5 Abusing the terminology from the min-max theory, Theorem 2.4.5
may be interpreted as a “strong duality” result which establishes the equivalence of
12Generally, if T : A Ñ B is a continuous map between two topologically vector spaces, thenT˚ : B˚ Ñ A˚ is a weak˚ continuous map between the dual spaces. Indeed, if yn Ñ y is a weak˚-convergent subsequence in B˚, meaning ynpbq Ñ ypbq for any b P B, then we must have T˚ynpaq “ynpTaq Ñ ypTaq “ T˚ypaq for any a P A, meaning that T˚yn converges to T˚y in the weak˚
topology.
55
two optimization problems. The 1)ñ2) part is the non-trivial direction which requires
regularity on the spaces. In contrast, the 2)ñ1) direction can be thought of as a “weak
duality” which establishes only a partial relation but holds for more general spaces.
Remark 2.4.6 The equivalent formulations of the forward Brascamp-Lieb inequality
(Theorem 2.2.3) can be recovered from Theorem 2.4.5 by taking l “ 1, Z1 “ X , and
letting S1 be the identity map/isomorphism, except that Theorem 2.2.3 is established
for completely general alphabets. In other words, the forward Brascamp-Lieb inequal-
ity is the special case of the forward-reverse Brascamp-Lieb inequality when there is
only one reverse channel which is the identity.
2.4.3 Noncompact X
Our proof of 1)ñ2) in Theorem 2.4.5 makes use of the Hahn-Banach theorem, and
hence relies crucially on the fact that the measure space is the dual of the function
space. Naively, one might want to extend the the proof to the case of locally compact
X by considering C0pX q instead of CbpX q, so that the dual space is still MpX q.
However, this would not work: consider the case when X “ Z1ˆ, . . . ,ˆZl and each
Si is the canonical map. Then Θm`1puq as defined in (2.106) is `8 unless u ” 0,
thus Θ˚m`1 ” 0. Luckily, we can still work with CbpX q; in this case ` P CbpX q˚ may
not be a measure, but we can decompose it into ` “ π`R where π PMpX q and R is
a linear functional “supported at the infinity”. Below we use the techniques in [104,
Chapter 1.3] to prove a particular extension of Theorem 2.4.5 to a non-compact case.
Theorem 2.4.6 Theorem 2.4.5 still holds if
• The assumption that X is a compact metric space is relaxed to the assumption
that it is a locally compact and σ-compact Polish space;
56
• X “śl
i“1Zi and Si : CbpZiq Ñ CbpX q, i “ 1, . . . , l are canonical maps (see
Definition 2.4.2).
Proof The proof of the “weak duality” part 2)ñ1) still works in the noncompact
case, so we only need to explain what changes need to be made in the proof of 1)ñ2)
part. Let Θ0 be defined as before, in (2.99). Then for any ` P CbpX q˚,
Θ˚0p`q “ sup
uď0`puq (2.123)
which is 0 if ` is nonnegative (in the sense that `puq ě 0 for every u ě 0), and `8
otherwise. This means that when computing the infimum on the left side of (2.85),
we only need to take into account of those nonnegative `.
Next, let Θm`1 be also defined as before. Then directly from the definition we
have
Θ˚m`1p`q “
$
’
&
’
%
0 if `př
i Siwiq “ř
i PZipwiq, @wi P CbpZiq, i “ 1, . . . l;
`8 otherwise.(2.124)
For any ` P C˚b pX q. Generally, the condition in the first line of (2.124) does not imply
that ` is a measure. However, if ` is also nonnegative, then using a technical result
in [104, Lemma 1.25] we can further simplify:
Θ˚m`1p`q “
$
’
&
’
%
0 if ` PMpX q and S˚i ` “ PZi , i “ 1, . . . , l;
`8 otherwise.(2.125)
This further shows that when we compute the left side of (2.85) the infimum can be
taken over ` which is a coupling of pPZiq. In particular, if ` is a probability measure,
then Θ˚j p`q “ cjDpT
˚j `µjq still holds with the Θj defined in (2.101), j “ 1, . . . ,m.
Thus the rest of the proof can proceed as before.
57
Remark 2.4.7 The second assumption is made in order to achieve (2.125) in the
proof.
Remark 2.4.8 In [80] we studied a version of “reverse Brascamp-Lieb inequality”
which is a special case of Theorem 2.4.6 when there is only one forward channel: in
the setting of Theorem 2.4.5 consider m “ 1, c1 “ 1, X “ Z1ˆ, . . . ,ˆZl. Let Si be
the canonical map, gi Ð g1bii , i “ 1, . . . , l. Then (2.97) becomes
lź
i“1
gi 1bi
ď exppdqµ1pf1q (2.126)
for any nonnegative continuous pgiqli“1 and f1 bounded away from 0 and `8 such
that
lÿ
i“1
log gipziq ď Erlog f1pY1q|Zl“ zls, @zl. (2.127)
Note that (2.127) can be simplified in the deterministic special case: let φ : X Ñ Y1
be any continuous function, and T1 Ð Cbpφq (that is, T1 sends a function f on Y1 to
the function f ˝ φ on X ). Then (2.127) becomes
lź
i“1
gipziq ď f1pφpz1, . . . , zlqq, @z1, . . . , zl. (2.128)
Then the optimal choice of f1 admits an explicit formula, since for any given pgiq, to
verify (2.126) we only need to consider
f1pyq :“ supφpz1,...,zlq“y
#
ź
i
gipziq
+
, @y. (2.129)
Thus when φ is a linear function, (2.126) is essentially Barthe’s formulation of reverse
BL (2.76) (the exception being that Theorem 2.4.6, in contrast to (2.76), restricts
attention to finite νi, i “ 1, . . . , l and µ1). The more straightforward part of the duality
58
(entropic inequalityñfunctional inequality) has essentially been proved by Lehec [103,
Theorem 18] in a special setting.
2.4.4 Extension to General Convex Functionals
For certain applications (e.g. the transportation-cost inequalities, see Section 2.5.7
ahead), we may be interested in convex functionals beyond the relative entropy. Recall
the definition of the Legendre-Fenchel transform in Definition 2.4.1. Then we have,
from the theory of convex analysis (see for example [105, Lemma 4.5.8]),
Λpuq “ sup`PCbpX q˚
r`puq ´ Λ˚p`qs, (2.130)
for any u P CbpX q. Moreover, if Λ˚p`q “ `8 for any ` R PpX q, then from (2.130) we
must also have
Λpuq “ sup`PPpX q
r`puq ´ Λ˚p`qs. (2.131)
For example, the function Λ defined in (2.82) satisfies the property in (2.131). We
need this property in the proof of Theorem 2.4.5 because of step (2.119). From the
proof of Theorem 2.4.5 we see that we can obtain the following generalization to
convex functionals with no additional cost. An application of this generalization to
transportation-cost inequalities is given in Section 2.5.7.
Theorem 2.4.7 Assume that
• m and l are positive integers, d P R, X is a compact metric space (hence also a
Polish space);
• For each i “ 1, . . . , l, Zi is a Polish space, Λi : CbpZiq Ñ R Y t`8u is proper
convex such that Λ˚i p`q “ `8 for ` R PpZiq, and Si : CbpZiq ÞÑ CbpX q is a
conditional expectation operator;
59
• For each j “ 1, . . . ,m, Yj is a Polish space, Θj : CbpYjq Ñ RY t`8u is proper
convex such that Θjpuq ă 8 for some u P CbpYjq which is bounded below, and
Tj : CbpYjq Ñ CbpX q is a conditional expectation operator;
• For any `Zi P MpZiq such that Λ˚i p`Ziq ă 8, i “ 1, . . . , l, there exists `X P
Ş
ipS˚i q´1`Zi such that
řmj“1 Θ˚
j p`Yjq ă 8, where `Yj :“ T ˚j `X .
Then the following two statements are equivalent:
1. If gi P CbpZiq, fj P CbpYjq, i “ 1, . . . , l, j “ 1, . . . ,m satisfy
lÿ
i“1
Sigi ďmÿ
j“1
Tjfj (2.132)
then
lÿ
i“1
Λipgiq ďmÿ
j“1
Θjpfjq. (2.133)
2. For any13 `Zi PMpZiq, i “ 1, . . . , l,
lÿ
i“1
Λ˚i p`Ziq ě inf`X
mÿ
j“1
Θ˚j p`Yjq (2.134)
where the infimum is over `X such that S˚i `X “ PZi, i “ 1, . . . , l, and `Yj :“
T ˚j `X , j “ 1, . . . ,m.
Remark 2.4.9 Just like Theorem 2.4.6, it is possible to extend Theorem 2.4.7 to
the case of noncompact X , provided that X “ Z1ˆ, . . . ,ˆZl and Si, i “ 1, . . . , l are
canonical maps.
13Since by assumption Λ˚i p`Ziq “ `8 when ` R PpZiq, in which case (2.134) is trivially true, it isequivalent to assuming here that ` P PpZiq.
60
2.5 Some Special Cases of the Forward-Reverse
Brascamp-Lieb Inequality
In this section we discuss some notable special cases of the duality results for the
forward-reverse Brascamp-Lieb Inequality (Theorems 2.2.3-2.4.7). The relationship
among the involved inequalities is illustrated in Figure 2.2. The equivalent formu-
lations of Renyi divergence, the strong data processing inequality, hypercontractiv-
ity and its reverse (with positive or negative parameters), Loomis-Whitney inequal-
ity/Shearer’s lemma, and the transportation-cost inequalities have been proved by
different methods (see for example [110][7][111][112][74][79][113]). In some of these
examples (e.g. strong data processing [7]) the previous approaches rely heavily on
the finiteness of the alphabet, whereas the present approach (essentially based on the
nonnegativity of the relative entropy) is simpler and holds for general alphabets.
Forward-reverse Brascamp-Lieb (2.97)
Forward Brascamp-Lieb (2.14)
Strong data processinginequality (2.144)
Reverse hypercontractivity with one negative parameter (2.165)
Reverse Brascamp-Lieb (2.126)
Hypercontractivity(2.161)
Reverse hypercontractivity withpositive parameters (2.163)
Figure 2.2: The “partial relations” between the inequalities discussed in this chapterwith respect to implication.
Due in part to their utility in establishing impossibility bounds, these functional
inequalities have attracted a lot of attention in information theory [114][115][116]
[117][118][101][73][119], theoretical computer science [71][120][121][122][86], and
statistics [123][124][125][72][126][127], to name only a small subset of the literature.
61
2.5.1 Variational Formula of Renyi Divergence
As the first example, we show how (2.14) recovers the variational formula of Renyi
divergence [128] [129] in a special case. A prototype of the variational formula of Renyi
divergence appeared in the context of control theory [128] as a technical lemma. Its
utility in information theory was then noticed by [129] [110], which further developed
the result and elaborated on its applications in other areas of probability theory.
Suppose R and Q are nonnegative measures on pX ,F q, α P p0, 1q Y p1,8q, and
g : X Ñ R is a bounded measurable function. Also, let T be a probability measure
such that R,Q ! T . Define the Renyi divergence
DαpQRq :“1
α ´ 1log
`
E“
exp`
αıQT pXq ` p1´ αqıRT pXq˘‰˘
(2.135)
where X „ T , which is independent of the particular choice of the reference measure
T [130]. Then the variational formula of Renyi divergence [129, Remark 2.2] can be
equivalently stated as the functional inequality14
1
α ´ 1logErexpppα ´ 1qgpXqqs ´
1
αlogErexppαgpXqqs ď
1
αDαpQRq (2.136)
where X „ R and X „ Q, with equality achieved when
dQ
dRpxq “
exppgpxqq
ErexppgpXqqs. (2.137)
The well-known variational formula of the relative entropy (see e.g. [43]) can be
recovered by taking α Ñ 1. In the α P p1,8q case, we can choose exppgq to be the
indicator function of an arbitrary measurable set A Ď X , to obtain the logarithmic
14Note that our definition of Renyi divergence is different from [129] by a factor of α.
62
probability comparison bound (LPCB) [129]15
1
α ´ 1logQpAq ´ 1
αlogRpAq ď 1
αDαpQRq. (2.138)
Now we give a new proof of the functional inequality (2.136) using a well-known
entropic inequality in information theory. First consider α P p1,8q. In Theorem 2.2.3,
set m Ð 1, Y1 “ X , c Ð α´1α
, PY1|X “ id (the identity mapping). We may assume
without loss of generality that Q ! R, since otherwise DαpQRq “ 8 and (2.136)
always holds. Thus, setting the cost function as
dpxq Ð ´ıQRpxq `α ´ 1
αDαpQRq (2.139)
we see that (2.15) is reduced to
DpP Rq ` E”
´ıQRpXqı
`α ´ 1
αDαpQRq ě
α ´ 1
αDpP Rq (2.140)
which, by our convention in Remark 2.2.2, can be simplified to
DpP Qq `α ´ 1
αDαpQRq ě
α ´ 1
αDpP Rq. (2.141)
It is a well-known result that (2.141) holds for all P absolutely continuous with respect
to Q and R (see for example [130, Theorem 30], [131, Theorem 1], [132, Corollary 2]),
due to its relation to the fundamental problem of characterizing the error exponents
in binary hypothesis testing. By Theorem 2.2.3 with m “ 1 we translate (2.140) into
the functional inequality:
E„
exp
ˆ
log fpXq ` ıQRpXq ´α ´ 1
αDαpQRq
˙
ď
´
E”
fαα´1 pXq
ı¯α´1α
(2.142)
15[129] focuses on the case of probability measure, but (2.138) continues to hold if Q and R arereplaced by any unnormalized nonnegative measures.
63
for all nonnegative measurable f . Finally, by taking the logarithms and dividing by
α´ 1 on both sides and setting f Ð expppα´ 1qgq, the functional inequality (2.136)
is recovered.
Note that the choice of dp¨q in (2.139) is simply for the purpose of change-of-
measure, since the two relative entropy terms in (2.141) have different reference mea-
sures Q and R. Thus, an alternative proof is to take dp¨q “ 0 but invoke the extension
of Theorem 2.2.3 in Remark 2.2.3 with µÐ Q and ν Ð R.
The case of α P p0, 1q can be proved in a similar fashion using Theorem 2.4.5 with
m Ð 2, l Ð 1, X “ Y1 “ Y2 Ð X , QY1|X “ QY2|X “ id, QY1 Ð Q, QY2 Ð R, and
|Z| “ 1; we omit the details here.
Note that the original proofs of the functional inequality (2.136) in [128][129]
were based on Holder’s inequality, whereas the present proof relies on the duality
between functional inequalities and entropic inequalities, and the property (2.141)
(which amounts to the nonnegativity of relative entropy). Moreover, the weaker
probability version (2.138) can be easily proved by a data processing argument; see
for example [42, Section II.B][133].
2.5.2 Strong Data Processing Constant
The strong data processing inequality (SDPI) [7][8][9] has received considerable in-
terest recently. It has proved fruitful in providing impossibility bounds in various
problems; see [117] for a recent list of its applications. It generally refers to an in-
equality of the form
DpPXQXq ě cDpPY QY q, for all PX ! QX (2.143)
where PX Ñ QY |X Ñ PY , and we have fixed QXY “ QXQY |X . The conventional
data processing inequality corresponds to the case of c “ 1. The study of the best
64
(largest) constant c for (2.143) to hold can be traced to Ahlswede and Gacs [7], who
showed, among other things, its equivalence to the functional inequality
ErexppErlog fpY q|Xsqs ď f 1c. for all nonnegative f. (2.144)
The strong data processing inequality can be viewed as a special case of the forward-
reverse Brascamp-Lieb inequality where there is only one forward and one identity
reverse channel. In other words, the equivalence between (2.143) and (2.144) can be
readily seen from either Theorem 2.2.3 or Remark 2.4.8. Its original proof of such
an equivalence [7, Theorem 5], on the other hand, relies on a limiting property of
hypercontractivity, which, in turn, relies heavily on the finiteness of the alphabet and
the proof is quite technical even in that case.
As we saw in Section 2.5.1, a functional inequality often implies an inequality of
the probabilities of sets when specialized to the indicator functions. In the case of
(2.144), however, a more rational choice is
fpyq “ p1Apyq `QY rAs1Apyqqc (2.145)
where A is an arbitrary measurable subset of Y and A :“ YzA. Then using (2.144),
QcεY rAsQXrx : QY |X“xrAs ě 1´ εs “ Qcε
Y rAsQXrx : QY |X“xrAs ď εs (2.146)
ď
ż
exp`
logQY rAsc ¨QY |X“xrAs˘
dQXpxq
(2.147)
“ ErexppErlog fpY q|Xsqs (2.148)
ď Ecrf1c pY qs (2.149)
ď 2cQcY rAs. (2.150)
65
Rearranging, we obtain the following bound on conditional probabilities:
QX
“
x : QY |X“xrAs ě 1´ ε‰
ď 2cQcp1´εqY rAs (2.151)
which, by a blowing-lemma argument (cf. [27]), would imply the asymptotic result
of [27, Theorem 1], a useful tool in establishing strong converses in source coding
problems. Note that [27, Section 2] proved a result essentially the same as (2.151) by
working on (2.143) rather than (2.144).16
2.5.3 Loomis-Whitney Inequality and Shearer’s Lemma
The duality between Loomis-Whitney Inequality and Shearer’s Lemma is yet another
special case of Theorem 2.2.3. This is already contained in the duality theorem of
Carlen and Cordero-Erausquin [76], but we briefly discuss it here.
The combinatorial Loomis-Whitney inequality [134, Theorem 2] says that if A is
a subset of Am, where A is a finite or countably infinite set, then
|A| ďmź
j“1
|πjpAq|1
m´1 (2.152)
where we defined the projection
πj : Am Ñ Am´1, (2.153)
px1, . . . , xmq ÞÑ px1, . . . , xj´1, xj`1, . . . , xmq. (2.154)
16Another difference is that [27, Theorem 1] involves an auxiliary r.v. U with |U | ď 3, where thecardinality bound comes from convexifying a subset in R2. Here (2.151) holds if (2.143), which isslightly simpler involving only relative entropy terms, because we are essentially working with thesupporting lines of the convex hull, and the supporting line of a set is the same as the supportingline of its convex hull.
66
for each j P t1, . . . ,mu. The combinatorial inequality (2.152) can be recovered from
the following integral inequality: let µ be the counting measure on Am, then
ż
Am
mź
j“1
fjpπjpxqqdµpxq ďmź
j“1
fjm´1 (2.155)
for all nonnegative fj’s, where the norm on the right side is with respect to the
counting measure on Am´1. We claim that (2.155) is an inequality of the form (2.14).
To see how (2.155) recovers (2.152), let fj be the indicator function of πjpAq for each
j. Then the left side of (2.155) upper-bounds the left side of (2.152), while the right
side of (2.155) is equal to the right side of (2.152). Now we invoke Remark 2.2.3 with
X Ð Am, ν and µj being the counting measure on Am and Am´1, respectively, QYj |X
being the projection mappings in (2.153)-(2.154), and let pX1, . . . , Xmq be distributed
according to a given PX , to obtain
´HpX1, . . . , Xmq “ DpPXνq (2.156)
ě
mÿ
j“1
1
m´ 1DpPYjµjq (2.157)
ě ´
mÿ
j“1
1
m´ 1HpX1, . . . , Xj´1, Xj`1, . . . , Xmq (2.158)
where Hp¨q is the Shannon entropy. This is a special case of Shearer’s Lemma [135]
[112].
Similarly, the continuous Loomis-Whitney inequality for Lebesgue measure, that
is,
ż
Rm
mź
j“1
fjpπjpxnqqdxn ď
mź
j“1
fjm´1 (2.159)
67
is the dual of a continuous version of Shearer’s lemma involving differential entropies:
hpXmq ď
mÿ
j“1
1
m´ 1hpX1, . . . , Xj´1, Xj`1, . . . , Xmq. (2.160)
2.5.4 Hypercontractivity
PpY1 ˆ Y2qPpZ1q
PpY1q
PpY2q
–
T ˚1
T ˚2
Figure 2.3: Diagram for hypercontractivity (HC)
In Theorem 2.4.5, take l Ð 1, m Ð 2, b1 Ð 1, d Ð 0, f1 Ð f1c1
1 , f2 Ð f1c2
2 ,
ν1 Ð QY1Y2 , µ1 Ð QY1 , µ2 Ð QY2 . Also, put Z1 “ X “ pY1, Y2q, and let T1 and T2 be
the canonical maps (Definition 2.4.2). We obtain the equivalence between17
f1 1c1
f2 1c2
ě Erf1pY1qf2pY2qs, @f1 P L1c1 pQY1q, f2 P L
1c2 pQY2q (2.161)
and
@PY1Y2 , DpPY1Y2QY1Y2q ě c1DpPY1QY1q ` c2DpPY2QY2q. (2.162)
This equivalence can also be obtained from Theorem 2.2.3. By Holder’s inequal-
ity, (2.161) is equivalent to saying that the norm of the linear operator sending
f1 P L1c1 pQY1q to Erf1pY1q|Y2 “ ¨s P L
11´c2 pQY2q does not exceed 1. The interest-
ing case is 11´c2
ą 1c1
, hence the name hypercontractivity. The equivalent formulation
of hypercontractivity was shown in [74] using a different proof via the method of
types/typicality, which relies on the finite nature of the alphabet. In contrast, the
proof based on the nonnegativity of relative entropy removes this constraint, allow-
17By a standard dense-subspace argument, we see that it is inconsequential that f1 and f2 in(2.161) are not assumed to be continuous.
68
ing one to prove Nelson’s Gaussian hypercontractivity from the information-theoretic
formulation (see Section 6.4.2).
2.5.5 Reverse Hypercontractivity (Positive Parameters)
PpZ1 ˆ Z2q
PpZ1q
PpZ2q
PpY1q
S˚1
S˚2
–
Figure 2.4: Diagram for reverse HC
In Theorem 2.4.5, take l Ð 2, m Ð 1, c1 Ð 1, d Ð 0, g1 Ð g1b11 , g2 Ð g
1b22 ,
µ1 Ð QZ1Z2 , ν1 Ð QZ1 , ν2 Ð QZ2 . Also, put Y1 “ X “ pZ1, Z2q, and let S1 and S2 be
the canonical maps (Definition 2.4.2). Note that in the title, by “positive parameters”
we mean the b1 and b2 in (2.164) are positive. We obtain the equivalence between
g1 1b1
g2 1b2
ď Erg1pZ1qg2pZ2qs, @g1 P L1b1 pQZ1q, g2 P L
1b2 pQZ2q (2.163)
and
@PZ1 , PZ2 , DPZ1Z2 , DpPZ1Z2QZ1Z2q ď b1DpPZ1QZ1q ` b2DpPZ2QZ2q. (2.164)
Note that the function f1 in (2.97) disappears since its “optimal value” is easily com-
puted in terms of g1 and g2 from (2.96) in this special setting. Moreover, in this
set-up, if Z1 and Z2 are finite, then the condition in the last bullet in Theorem 2.4.5
is equivalent to QZ1Z2 ! QZ1 ˆ QZ2 . The equivalent formulations of reverse hyper-
contractivity were observed in [78], where the proof is based on the method of types
argument.
69
PpZ1 ˆ Y2qPpZ1q
PpY1q
PpY2q
S˚1–
T ˚2
Figure 2.5: Diagram for reverse HC with one negative parameter
2.5.6 Reverse Hypercontractivity (One Negative Parameter)
In Theorem 2.4.5, take l Ð 1, m Ð 2, c1 Ð 1, d Ð 0, g1 Ð g1b1 , f2 Ð f
´ 1c2 ,
µ1 Ð QZ1Y2 , ν1 Ð QZ1 , µ2 Ð QY2 . Also, put Y1 “ X “ pZ1, Y2q, and let S1 and
T2 be the canonical maps (Definition 2.4.2). Note that in the title, by “one negative
parameter” we mean the b1 is positive and ´c2 is negative in (2.166). We obtain the
equivalence between
f 1´c2
g 1b1
ď ErfpY2qgpZ1qs, @f P L1´c2 pQY2q, g P L
1b1 pQZ1q (2.165)
and
@PZ1 , DPZ1Y2 , DpPZ1Y2QZ1Y2q ď b1DpPZ1QZ1q ` p´c2qDpPY2QY2q. (2.166)
Again, the function f1 in (2.97) does not exist in (2.165) since its “optimal choice”
f1pz1, y2q “gb11 pz1q
f c22 py2q“ fpy2qgpz1q, @y2, z1 (2.167)
is easily computed in terms of g1 and f2 from (2.96) in this special setting. Inequality
(2.165) is called reverse hypercontractivity with a negative parameter in [79], where
an equivalent formulation is established for finite alphabets using the method of types.
Multiterminal extensions of (2.165) and (2.166) (called reverse Brascamp-Lieb type
inequality with negative parameters in [79]) can also be recovered from Theorem 2.4.5
70
in the same fashion, i.e., we move all negative parameters to the other side of the
inequality so that all parameters become positive.
In summary, from the viewpoint of Theorem 2.4.5, the results in Sections 2.5.4,2.5.5
and 2.5.6 are degenerate special cases, in the sense that in any of the three cases the
optimal choice of one of the functions in (2.97) can be explicitly expressed in terms
of the other functions, hence this “hidden function” disappears in (2.161), (2.163) or
(2.165).
2.5.7 Transportation-Cost Inequalities
Definition 2.5.1 (see for example [57]) We say that a probability measure Q on
a metric space pZ, dq satisfies the Tppλq inequality, p P r1,8q, λ P p0,8q, if
infπE
1p rdppX, Y qs ď
a
2λDpP Qq (2.168)
for every P ! Q, where the infimum is over all coupling π of P and Q, and pX, Y q „
π.
It suffices to focus on the case of λ “ 1, since results for general λ P p0,8q can usually
be obtained by a scaling argument.
As a consequence of Theorem 2.4.7 and Remark 2.4.9, we have
Corollary 2.5.1 Let pZ, dq be a locally compact, σ-compact Polish space.
(a) A probability measure Q on Z satisfies T2p1q inequality if and only if for any
f P CbpZq,
logQ
ˆ
exp
ˆ
infzPZ
„
fpzq `d2p¨, zq
2
˙˙
ď Qpfq. (2.169)
71
(b) A probability measure Q on Z satisfies Tpp1q inequality, p P r1, 2q, if and only if
logQ
ˆ
t infzPZ
„
fpzq `dpp¨, zq
p
˙
ď
ˆ
1
p´
1
2
˙
t2
2´p ` tQpfq, @t P r0,8q, f P CbpZq.
(2.170)
Proof (a) In Theorem 2.4.7, put l “ 2, m “ 1, Z1 “ Z2 Ð Z, X “ Y1 Ð Z ˆ Z,
and
Λ1puq :“ 2 logQ´
exp´u
2
¯¯
; (2.171)
Λ2puq :“ Qpuq; (2.172)
Θ1puq :“
$
’
&
’
%
0 u ď d2;
`8 otherwise.(2.173)
For ` PMpX q, we can compute
Λ˚1p`q “ 2Dp`Qq; (2.174)
Λ˚2p`q :“
$
’
&
’
%
0 ` “ Q;
`8 otherwise;(2.175)
Θ˚1p`q “
$
’
&
’
%
`pd2q ` ě 0;
`8 otherwise.(2.176)
We also have Λ˚i p`q “ `8 for any ` R PpX q, i “ 1, 2. Thus by Theorem 2.4.7,
(2.169) is equivalent to the following: for any f1 P CbpZ1 ˆ Z2q, g1 P CbpZ1q,
g2 P CbpZ2q such that g1 ` g2 ď f1, it holds that
Λ1pg1q ` Λ2pg2q ď Θ1pf1q. (2.177)
72
By the monotonicity of Λ1, this is equivalent to
Λ1
´
infzrd2p¨, zq ´ g2pzqs
¯
` Λ2pg2q ď 0 (2.178)
for any g2 P CbpZq, which is the same as (2.169).
(b) The proof is similar to Part (a), except that we now pick
Θ1pfq :“
$
’
&
’
%
2´2
2´ppp
2´p p2´ pq sup2
2´p`
fdp
˘
if sup f ě 0;
0 otherwise,(2.179)
so that for any ` ě 0,
Θ˚1p`q “ r`pd
pqs
2p . (2.180)
Remark 2.5.1 Actually, the proof of Corollary 2.5.1 does not use the assumption
that d is a metric (other than that it is a continuous function which is bounded below).
The equivalent formulation of the T1 inequality (special case of (2.170)) was known to
Rachev [136] and Bobkov and Gotze [113] (who actually slightly simplified the formula
using the fact that d is a metric). The equivalent formulation of the T2 inequality in
(2.169) also appeared in [113], and was employed in [96][82] to show a connection to
the logarithmic Sobolev inequality. The equivalent formulation of the Tp inequality,
p P r1, 2q in (2.170) appeared in [57, Proposition 22.3].
Transportation-cost inequalities have been fruitful in obtaining measure concentration
results (since [137][138]). Section 6.4.3 contains further discussions on T2 inequalities
in the Gaussian case.
73
Chapter 3
Smoothing
In Chapter 2 we studied the fundamentals of the functional-entropic duality and used
the result to derive a strong converse bound for the omniscient helper common ran-
domness generation problem. The bound is only (first-order) asymptotically tight for
vanishing communication rates, because the mutual information optimization arising
from the single-letter rate region coincides with the corresponding relative entropy
optimization only in that case. In this chapter we introduce a machinery called
smoothing to overcome this issue. By perturbing a product distribution by a tiny
bit in the total variation distance, we regain the (first-order asymptotic) equivalence
between the relative entropy optimization and the mutual information optimization
in rather general settings. Analysis of the second-order asymptotics is carried out
in the discrete and the Gaussian cases, implying the optimal Op?nq second-order
converse of the whole rate region for the omniscient-helper problem in those cases.
3.1 The BL Divergence and Its Smooth Version
This section introduces some necessary definitions and notations, and summarizes the
main goal of this chapter.
Definition 3.1.1 Given a nonnegative finite measure µ on X , nonnegative σ-finite
measures ν1, . . . , νm on Y1, . . . ,Ym, random transformations QY1|X , . . . , QYm|X , and
74
c1, . . . , cm P r0,8q, Define the Brascamp-Lieb divergence
dpµ, pQYj |Xq, pνjq, cmq :“ sup
PX
#
mÿ
j“1
cjDpPYjνjq ´DpPXµq
+
(3.1)
where the supremum is over PX : PX ! µ such that DpPXµq and DpPYjνjq, j “
1, . . . ,m are finite, and PX Ñ QYj |X Ñ PYj .
We remark that the optimization in (3.1) is generally not convex. Moreover, as a
convention of this thesis, if the feasible set for a supremum is empty, then we set the
supremum equal to ´8.
By Theorem 2.2.3, we have the following alternative expression for the Brascamp-
Lieb (BL) divergence:
Proposition 3.1.1 Under the assumptions in Definition 3.1.1,
dpµ, pQYj |Xq, pνjq, cmq
“ supf1,...,fm
#
log
ż
exp
˜
mÿ
j“1
Erlog fjpYjq|X “ xs
¸
dµpxq ´mÿ
j“1
log fj 1cj
+
(3.2)
where the supremum is over nonnegative functions on Y1, . . .Ym, and Yj „ QYj |X“x
conditioned on X “ x.
The entropic definition (3.1) is more close to the answer (the single-letter formula
of network information theory problems, usually expressed using the linear combina-
tion of mutual informations), whereas the functional (3.2) is more “operational” in
the derivation of bounds.
When other parameters are fixed, the BL divergence dpµ, pQYj |Xq, pQYjq, cmq can
be viewed as a function of µ. Loosely speaking, the “smoothing” operation in the
title of the chapter can be thought of as the “semicontinuous regularization” of a
function on the measure space in the total variation distance. That is, we infimize
(or supremize) a function within a neighbourhood of its argument. The rationale for
75
considering such a regularization is that a small perturbation of the distribution in the
total variation distance implies only a small change of the observed error probability,
but the asymptotic behavior of the value of the function can be vastly different. For
example, in cryptography, it has long been known that the collision entropy (Renyi
entropy of order 2) bounds the performance of the privacy amplification (see e.g.
[139][140][141][142]). However, Renner and Wolf [53] showed that the smooth Renyi
entropy has the asymptotic behavior of the Shannon entropy (Renyi entropy of order
1). Hence, the correct rate formula for privacy amplification is still expressed in terms
of the Shannon information quantities.
Now, let us try to apply the same smoothing operation to the BL divergence.
Instead of using the total variation distance, however, we shall consider a very similar,
but mathematically more convenient, quantity:
Definition 3.1.2 For nonnegative measures ν and µ on the same measurable space
pX ,F q where νpX q ă 8, and γ P r1,8q, the Eγ divergence is defined as
Eγpνµq :“ supAPF
tνpAq ´ γµpAqu. (3.3)
Note that under this definition E1pP µq “12|P ´µ| if and only if µ is a probability
measure. We will return to some other applications of Eγ in chapter 5.
Now let us proceed to define the smooth BL divergence. Since in most of our
usage, the random transformations QY1|X , . . . , QYm|X , We sometimes omit these ran-
dom transformations from the argument of the BL divergences, to keep the notations
simple.
Definition 3.1.3 For δ P r0, 1q, QX , pQYj |Xq, pνjq and cm P r0,8qm, define
dδpQX , pνjq, cmq :“ inf
µ : E1pQXµqďδdpµ, pνjq, c
mq (3.4)
76
with dp¨q is defined in (3.1.1).
Remark 3.1.1 According to discussions in the previous chapter, the BL divergence is
a generalization of several information measures, including the strong data processing
constant, hypercontractivity, and Renyi divergence. For example, for α P p1,8q, the
Renyi divergence between two probability measures P and Q on the same alphabet can
be expressed in terms of the BL divergence:
DαpP Qq “α
α ´ 1d
ˆ
P,Q,α ´ 1
α
˙
, (3.5)
where the random transformation in the definition of the BL divergence is the identity.
Consequently, the smooth Renyi divergence [53] can be expressed in terms of a smooth
BL divergence:
DδαpP Qq “
α
α ´ 1dδ
ˆ
P,Q,α ´ 1
α
˙
(3.6)
for δ P p0, 1q.
Remark 3.1.2 If we restrict µ in (3.4) to be a probability measure, then all asymp-
totic results remain the same. However, allowing unnormalized measures avoids the
unnecessary step of normalization in the proof, and is in accordance with the litera-
ture on smooth Renyi entropy, where such a relaxation generally gives rise to nicer
properties and tighter non-asymptotic bounds, cf. [53][143].
We now introduce a quantity which resembles the BL divergence and plays a
central role in the characterization of the asymptotic performance of the smooth BL
divergence.
77
Definition 3.1.4 Given QX , pQYj |Xq, pνjq and cm P p0,8qm such that |DpQYjνjq| ă
8, j “ 1, . . . ,m, define
d‹pQX , pνjq, cmq :“ sup
PUX : PX“QX
«
mÿ
j“1
cjDpPYj |Uνj|PUq ´ IpU ;Xq
ff
, (3.7)
where pU,Xq „ PUX , PYj |U is the concatenation of PX|U and QYj |X , and in the supre-
mum we also assume that PUX is such that each term in (3.7) is well-defined and
finite.
Remark 3.1.3 In Definition 3.1.4, it is without loss of generality to impose that
|U | ă 8, which can be shown using the fact that the mutual information equals its
supremum over finite partitions (see [144] for more details).
In later applications to the common randomness generation problem, we take the
reference measures νj “ QYj , j “ 1, . . . ,m, whereas in other applications we may
take νj equal to the equiprobable distribution. Note that in the case of νj “ QYj , we
have
d‹pQX , pQYjq, cmq “ sup
PU |X
#
mÿ
l“1
clIpU ;Ylq ´ IpU ;Xq
+
(3.8)
which is the quantity appearing in the single-letter region of the common randomness
generation problem (Section 2.3.1)! Moreover, as Proposition 2.3.1 shows, we must
have d‹ “ d if one of them is known to be 0. However, in general such an equivalence
may not hold, which is why our converse bound in the previous chapter is not (first-
order asymptotically) tight – the issue that we are now trying to fix in this chapter.
78
Notice that from these definitions and the tensorization property of dp¨q (Sec-
tion 6.1), we have
dpQX , pQYjq, cmq “
1
ndpQbnX , pQbnYj q, c
mq (3.9)
ě1
ndδpQ
bnX , pQbnYj q, c
mq. (3.10)
The goal of this chapter is to study the asymptotics of dδpQbnX , pQbnYj q, c
mq and discuss
its implication for some coding theorems.
The converse of the omniscient-helper common randomness (CR) generation prob-
lem that we will derive using dδ implies the general lower bound
dδpQbnX , pQbnYj q, c
mq ě n d‹pQX , pQYjq, c
mq `Op
?nq (3.11)
for any δ P p0, 1q, since otherwise the converse bound would contradict the known
achievability bound for the CR generation problem, which is impossible.
Upper bounding dδpQbnX , pQbnYj q, c
mq is the more nontrivial part. We would still
expect that d‹pQX , pQYjq, cmq will emerge as the asymptotic limit in rather general
settings. The results are summarized as follows:
• In the cases of 1) finite X , 2) QXYm is jointly Gaussian, the second-order term
is order Op?nq:
dδpQbnX , pQbnYj q, c
mq “ n d‹pQX , pQYjq, c
mq `Op
?nq. (3.12)
• We have conclusive results on the first-order asymptotics for a rather large class
of continuous distributions. More precisely, if X “ Y m and νj is an atomless1
1That is, for any measurable A : A Ď Yj with νYjpAq ą 0, there exists a measurable B : B Ď A
such that 0 ă νYjpBq ă νYj
pAq.
79
nonnegative σ-finite measures on a Polish space2, for each j, and a certain
relative information quantity is continuous and absolutely integrable, then
limnÑ8
1
ndδpQ
bnX , pQbnYj q, c
mq “ d‹pQX , pQYjq, c
mq. (3.13)
The second-order asymptotic result of (3.12) will lead to second-order converses
for coding theorems, whereas the first-order result of (3.13) will lead to a strong
converse result. The proposed approach to strong converses has several advantages
compared with existing approaches such as the method of types approach in [8], which
are nicely illustrated by our example of CR generation:
• The approach is applicable to continuous sources satisfying certain regularity
conditions for which the method of types is futile.
• In the omniscient helper CR generation problem, our approach covers possibly
stochastic encoders and decoders.3 Stochastic encoders and decoders are tricky
to handel with the method of types or change-of-measure techniques since they
can not be described by encoding and decoding sets.
Strong converses for a continuous source when the rate region involves auxiliaries
are rare [50]. The smooth BL divergence offers a canonical method for this situation.
3.2 Upper Bounds on dδ
3.2.1 Discrete Case
The main result for the discrete case can be summarized as follows:
2A Polish space is a complete separable metric space. It enjoys nice properties that allow us toperform some technical steps smoothly; see e.g. [105][104] for brief introductions.
3The (asymptotic) rate region with stochastic encoders can be strictly larger than with deter-ministic encoders, since in the former case the CR rate is unbounded whereas in the latter case itis bounded by the entropy of the sources. Regarding the decoders, we argue in Remark 3.3.2 thatallowing stochasticity can strictly decrease the (one-shot) error, but within a constant factor.
80
Theorem 3.2.1 Fix QX , pQYj |Xq, pνjq and cm P p0,8qm. Assume that X is finite,
and
DpPYjνjq ă `8, @PX ! QX , j “ 1, . . .m; (3.14)
d‹pQX , pνjq, cmq ą ´8. (3.15)
Then
dδpQbnX , pνbnj q, cmq ď n d‹pQX , pνjq, c
mq `Op
?nq. (3.16)
We prove a non-asymptotic version of Theorem 3.2.1 here. For simplicity we only
consider the m “ 1 case here, but the analysis completely goes through the general
case as well.
Lemma 3.2.2 Consider QX , QY |X , ν, δ P p0, 1q, c P p0,8q. Define βX and αY as
in Lemma 3.2.3. Then there exists some Cn, QbnX rCns ě 1´ δ, such that µn :“ QbnX |Cn
satisfies
dpµn, QbnY |X , ν
bn, cq
ď nd‹pQX , QY |X , ν, cq
` lnpαcY βc`1X q
c
3nβX ln|X |δ
(3.17)
for all n ą 3βX ln |X |δ
.
The intuition behind the second-order estimate in Lemma 3.2.2 is not hard to
comprehend: we can choose
Cn “ txn : | pPxn ´QX | “ Opn´12qu (3.18)
81
where pPxn denotes the empirical measure of xn, to guarantee that QbnX pCnq ě 1´δ (in
Lemma 3.2.2 a slightly more sophisticated Cn is chosen). A simple single-letterization
argument shows that 1ndpµn, Q
bnY |X , ν
bn, cq can be bounded using only the empirical
distributions of the sequences on the support of µn, so that the Opn´12q term in
(3.18) contributes to a second-order correction term in (3.61).
Proof Let pPXn denote the empirical distribution of Xn „ QbnX . For n ą 3βX ln |X |δ
,
define
εn :“
c
3βXn
ln|X |δP p0, 1q. (3.19)
For each x, we have, from the Chernoff bound for Bernoulli random variables (see
below),
Pr pPXnpxq ą p1` εnqQXpxqs ď e´n3QXpxqε
2n , (3.20)
therefore by the union bound,
Pr pPXn ď p1` εnqQXs ě 1´ δ. (3.21)
Finally, we note that for any PXn supported on
Cn :“ txn : Pxn ď p1` εnqQXu, (3.22)
82
we have
cDpPY nνbnq ´DpPXnµnq
“ cDpPY nνbnq ´DpPXnQbnX q (3.23)
“ cnÿ
i“1
DpPYi|Y i´1ν|PY i´1q ´
nÿ
i“1
DpPXi|Xi´1QX |PXi´1q (3.24)
ď cnÿ
i“1
DpPYi|Xi´1ν|PXi´1q ´
nÿ
i“1
DpPXi|Xi´1QX |PXi´1q (3.25)
“ nrcDpPYI |IXI´1ν|PIXI´1q ´DpPXI |IXI´1QX |PIXI´1qs (3.26)
ď n supPX : PXďp1`εnqQX
φpPXq (3.27)
where (3.25) follows since Yi ´X i´1 ´ Y i´1 under P ; in (3.26) we defined I to be a
random variable equiprobably distributed on t1, . . . , nu and independent of everything
else; and in (3.27) φ is defined in Lemma 3.2.3. Note that (3.23)-(3.27) is essentially
the same as the single-letterization property [145, Lemma 9].
Since the probability of the set in (3.22) is lower bounded by (3.21), the result
follows by taking the supremum over PXn ! µn in (3.23)-(3.27) and using Lemma 3.2.3
below.
The following technical result was used in the proof of the main result. For
notational simplicity, the base of log in this lemma is natural.
Lemma 3.2.3 Fix pQX , QY |X , νq and c P p0,8q. Define βX :“ pminxQXpxqq´1 and
αY “ |ν|
›
›
›
›
dQY
dν
›
›
›
›
8
. (3.28)
Let
φpPXq :“ supPU |X
cDpPY |Uν|PUq ´DpPX|UQX |PUq(
, (3.29)
83
for any PX ! QX . Then
φpPXq ď φpQXq ` lnpβc`1X αcY qε (3.30)
for PX : PX ď p1` εqQX , ε P r0, 1q.
Proof Given an arbitrary PX : PX ď p1 ` εqQX , suppose that PU |X is a maximizer
for (3.29) (if a maximizer does not exist, the argument can still carry through by
approximation). Then consider PUX where U “ U Y t‹u, X “ X , defined as the
follows
PU :“1
1` εPU `
ε
1` εδ‹; (3.31)
PX|U“u :“ PX|U“u, @u P U ; (3.32)
PX|U“‹ :“1` ε
ε
ˆ
QX ´1
1` εPX
˙
(3.33)
where δ‹ is the one-point distribution on ‹. Then observe that PX “ QX , and
DpPX|UQX |PUq “1
1` εDpPX|UQX |PUq
`ε
1` εD
ˆ
1` ε
εQX ´
1
εPX
›
›
›
›
QX
˙
(3.34)
and
DpPY |Uν|PUq “1
1` εDpPY |Uν|PUq
`ε
1` εD
ˆ
1` ε
εQY ´
1
εPY
›
›
›
›
ν
˙
(3.35)
84
which imply that
DpPX|UQX |PUq ě DpPX|UQX |PUq
´ εD
ˆ
1` ε
εQX ´
1
εPX
›
›
›
›
QX
˙
(3.36)
ě DpPX|UQX |PUq
´ ε ln βX ; (3.37)
DpPY |Uν|PUq “ p1` εqDpPY |Uν|PUq ´ εD
ˆ
1` ε
εQY ´
1
εPY
›
›
›
›
ν
˙
(3.38)
ď DpPY |Uν|PUq ` εDpPY |Uν|PUq ` ε ln |ν| (3.39)
“ DpPY |Uν|PUq ` εDpPY |UQY |PUq ` εErıQY νpY qs ` ε ln |ν| (3.40)
ď DpPY |Uν|PUq ` ε lnpβXαY q (3.41)
where ıQY ν :“ ln dQYdν
and the last step used the data processing inequality
DpPY |UQY |PUq ď DpPX|UQX |PUq ď ln βX . (3.42)
The proof of the main result used the following standard large deviation bound
[64]:
Theorem 3.2.4 (Chernoff Bound for Bernoulli random variables) Assume
that X1, . . . , Xn are i.i.d. Berppq. Then for any ε P p0,8q,
P
«
nÿ
i“1
Xi ě p1` εqnp
ff
ď e´nDpp1`εqppq (3.43)
ď e´mintε2,εunp
3 . (3.44)
85
3.2.2 Gaussian Case
The main result for the Gaussian case can be summarized as follows:
Theorem 3.2.5 Fix a Gaussian measure QX , Gaussian random transformations
pQYj |Xq4, Lebesgue or Gaussian measures pνjq, and cm P p0,8qm. Then
dδpQbnX , pνbnj q, cmq ď n d‹pQX , pνjq, c
mq `Op
?nq. (3.45)
In the following we prove a non-asymptotic version of the theorem. For simplicity
we consider the setting where m “ 1 and νj is the Lebesgue measure, but the analysis
completely goes through the general case as well. Such a setting is useful in proving
the second-order converse for certain distributed lossy source compression problems
with quadratic cost functions. The base of log in this lemma is natural.
Lemma 3.2.6 Let QX “ N p0, σ2q, QY |X“x “ N px, 1q, and suppose that ν “ λ is
Lebesgue. Then for any c P p0,8q, δ P p0, 1q and n ě 24 ln 2δ,
dδpQbnX , QbnY |X , ν
bn, cq ď nd‹pQX , QY |X , ν, cq `
c
6n ln2
δ. (3.46)
Proof Define the two kinds of typical sets:
Snε1 “"
xn :1
nxn2 ď p1` ε1qσ
2
*
, (3.47)
and
T nε2 “#
xn :1
n
nÿ
i“1
ıQXλpxiq ď ´hpQXq `ε22
+
(3.48)
“
"
xn :1
nxn2 ě p1´ ε2qσ
2
*
. (3.49)
4By a “Gaussian random transformation” we mean the output is a linear transformation of theinput plus an independent Gaussian noise.
86
If we choose εi “Ai?n, i “ 1, 2, since
lnE”
eλXn2σ2
ı
“n
2ln
1
1´ 2λ, @λ P p´8, 12q, (3.50)
we have
P
«
nÿ
i“1
X2i ą np1` ε1qσ
2
ff
ď e´n sup
λPp0, 12qt 1
2lnp1´2λq`λp1`ε1qu
(3.51)
“ e´np´12
lnp1`ε1q`ε12q (3.52)
ď e´npε214´ε
316q. (3.53)
P
«
nÿ
i“1
X2i ă np1´ ε2qσ
2
ff
ď e´n supλPp´8,0qt12
lnp1´2λq`λp1´ε2qu (3.54)
“ e´np´12
lnp1´ε2q´ε22q (3.55)
ď e´nε224. (3.56)
Therefore by the union bound,
1´QbnX rSnε1X T nε2s ď e
´A2
14`
A31
6?n ` e´
A22
4 . (3.57)
Now, suppose that we can show the following result: if µn is the restriction of QbnX
on Snε1 X T nε2 , then
dpµn, QbnY |X , ν
bn, cq ď nd‹pQX , QY |X , ν, cq `?n
ˆ
1´ c
2A1 `
A2
2
˙
. (3.58)
Then, taking
A1 “ A2 “
c
6 ln2
δ(3.59)
87
we see that for n ě 24 ln 2δ, the right side of (3.57) is bounded above by δ and (3.46)
holds.
It remain to prove (3.58). Not complicating the notations too much, let us shift to
the case of general m for the convenience of future use. Given pQYj |Xqmj“1 and cm, let
λ and pλjqmj“1 denote the Lebesgue measures on the Euclidean spaces X and pYjqmj“1,
respectively. Define
F pMq :“ sup!
´ÿ
cjhpYj|Uq ` hpX|Uq)
(3.60)
“ sup!
ÿ
cjDpPYj |Uλj|PUq ´DpPX|Uλ|PUq)
(3.61)
where the suprema are over PUX such that ΣX ĺ M for the given positive semidefinite
matrix M. In Chapter 6 we will discuss results about the Gaussian optimality in this
type of optimizations. In particular, we will be able to show that F pMq equals the
sup in (3.61) restricted to constant U and Gaussian PX, which implies that
F pΣq ` C “ d‹pQX, νj, cmq (3.62)
for
C :“ ´ÿ
j
cjErıνjλjpYjqs ´ hpXq (3.63)
88
where pYj,Xq „ QYj |XQX. For µn defined as the restriction of QbnX on Snε1 X T nε2 and
PXn ! µn, we have5
1
n
«
ÿ
j
cjDpPYjnλbnj q ´DpPXnλbnq
ff
“1
n
nÿ
i“1
«
ÿ
j
cjDpPYjiλj|PYj
i´1q ´DpPXiλ|PXi´1q
ff
(3.64)
ď1
n
nÿ
i“1
«
ÿ
j
cjDpPYjiλj|PXi´1q ´DpPXi
λ|PXi´1q
ff
(3.65)
ď F pp1` ε1qΣq. (3.66)
Since PXn is supported on T nε2 , we also have
1
n
«
ÿ
j
cjDpPYjnλbnj q ´DpPXnλbnq
ff
` C
ě1
n
«
ÿ
j
cjDpPYjnνbnj q ´DpPXnQbnX q
ff
´ ε2 (3.67)
Hence from (3.66)-(3.67) we conclude
1
n
«
ÿ
j
cjDpPYjnνbnj q ´DpPXnµnq
ff
ď F pp1` ε1qΣq ` C ` ε2 (3.68)
where we used DpPXnQbnX q “ DpPXnµnq. Also, a simple scaling property of the
differential entropy shows that
F pp1` ε1qΣq “ F pΣq `lnp1` ε1q
2
´
m´ÿ
cj
¯
(3.69)
ď F pΣq `1
2?n
´
m´ÿ
cj
¯
A1. (3.70)
5As is clear from the context, in Yjn the n indexes the blocklength and the j indexes the terminal,
so Yjn is not an abbreviation for pYj , . . . ,Ynq.
89
Continuing (3.68), we see
1
n
«
ÿ
j
cjDpPYjnνbnj q ´DpPXnµnq
ff
ď F pΣq `1
2?n
´
m´ÿ
cj
¯
A1 ` C `A2?n
(3.71)
ď d‹pQX, νj, cmq `
1
2?n
´
m´ÿ
cj
¯
A1 `A2?n
(3.72)
where the last step used (3.62). Thus (3.58) is established, as desired.
3.2.3 Continuous Density Case
As alluded before, the first-order asymptotics of the BL divergence is known when
X “ Y m and Y m has a continuous distribution. This result is sufficient for es-
tablishing the strong converse of omniscient-helper CR generation with continuous
distributions, but the second-order analysis is still elusive at this point.
The main result of this subsection is the following:
Theorem 3.2.7 Assume that Y1, . . . ,Ym are Polish spaces, νY1 , . . . , νYm and
QYm are Borel measures on Y1, . . . ,Ym and Ym, respectively, and ıQYmνYm is a
(finite-valued) continuous function on Ym, where νYm :“ νY1 ˆ ¨ ¨ ¨ ˆ νYm, such
that Er|ıQYmνYm pY mq|s ă 8, where Y m „ QYm. Then for any δ P p0, 1q and
c1, . . . , cm P p0,8q,
lim infnÑ8
dδpQbnYm , pν
bnYjq, cmq ď d‹pQYm , pνYjq, c
mq. (3.73)
The proof of this result can be found in [144]. It is technically involved, so we
omit it here. The proof idea is outlined as follows: first, we show that if cj ą 1 for
any j P t1, . . . ,mu, then the result is trivial since both sides of the desired inequality
equal `8. Next we prove the theorem in the special case where Y1, . . . ,Ym, using
90
quantization and the result for discrete distributions (Lemma 3.2.3). Finally, we
extend the result to general Polish alphabets using the fact that a probability measure
on a Polish space can be approximated by its restriction on a compact set.
The regularity condition Er|ıQYmνYm pY mq|s ă 8 is rather mild. For example, if
νYj is the Lebesgue measure, then the condition is equivalent to assuming that the
positive and negative parts of the integral for computing the differential entropy of
Y m are both finite. When νYj “ QYj and m “ 2, the condition translates into the
assumption that IpY1;Y2q ă 8.
3.3 Application to Common Randomness Genera-
tion
The smooth BL divergence can be used to give single-shot converse bounds for prob-
lems in multiuser information theory. In this section we will focus on the example of
omniscient-helper common randomness generation. The applications to the discrete
and Gaussian Gray-Wyner source coding problem, which we do not discuss here, is
similar and can be found in our journal version [144].
3.3.1 Second-order Converse for the Omniscient Helper
Problem
The following result can be proved based on the Theorem 2.3.2 and the smoothing
idea.
Theorem 3.3.1 (one-shot converse for omniscient helper CR generation)
Fix QYm, δ P r0, 1q, and cm P p0,8qm such thatřmj“1 cj ą 1. Let QKm be the actual
CR distribution in a coding scheme for omniscient helper CR generation, using
stochastic encoders and deterministic decoders (or stochastic decoders, if cj ď 1,
91
j “ 1, . . . ,m). Then
1
2|QKm ´ TKm | ě 1´
1
|K| ´śm
j“1 |Wj|cjř
ci
|K|1´1
ř
ci
exp
ˆ
dδpQYm , pQYjq, cmq
ř
ci
˙
´ δ, (3.74)
where
TKmpkmq :“1
|K|1tk1 “ ¨ ¨ ¨ “ kmu (3.75)
is the target CR distribution and K and pWjqmj“1 denote the CR and message alphabets.
Proof The δ “ 0 special case of the result was established in Theorem 2.3.2. Note
the argument can be extended to the case where the source probability distributed
is replaced by unnormalized measures (see Proposition 4.4.5 which holds for unnor-
malized measures). In other words, suppose µYm is an unnormalized measure for the
source variables, and let µKm denote the corresponding measure for the CR when the
given encoders and decoders are applied. Then we have
E1pTKmµKmq ě 1´1
|K| ´śm
l“1 |Wl|cl
ř
ci
|K|1´1
ř
ci
exp
ˆ
dpµYm , pQYjq, cmq
ř
ci
˙
. (3.76)
Now Theorem 3.3.1 follows using
1
2|QKm ´ TKm | “ E1pTKmQKmq (3.77)
ě E1pTKmµKmq ´ E1pQKmµKmq (3.78)
ě E1pTKmµKmq ´ E1pQYmµYmq. (3.79)
92
Remark 3.3.1 Let QKKm and
TKKmpk, kmq :“1
|K|1tk “ k1 “ ¨ ¨ ¨ “ kmu (3.80)
denote the actual and the target distributions of the CR generated by T0, T1,. . . ,Tm,
respectively. Since
|QKKm ´ TKKm | ě |QKm ´ TKm |, (3.81)
Theorem 3.3.1 also provides a lower bound on |QKKm ´ TKKm |. Actually, if the
decoders are deterministic, T0 can always produce K such that the two total variations
are equal, because T0 is aware of the CR produced by the other terminals.
Remark 3.3.2 Allowing stochastic decoders can strictly lower 12|QKm ´ TKm |: con-
sider the special case where m “ 2, X and Y are constant, and there are no mes-
sages sent. Then the minimum 12|QKm ´ TKm | achieved by deterministic decoders is
1 ´ 1|K| . On the other hand, T1 and T2 can each independently output an integer in
t1, . . . ,a
|K|u equiprobably, achieving 12|QKm ´ TKm | “ 1´ 1?
|K|. On the other hand,
we can argue that allowing stochastic decoders can at most reduce the error by a factor
of 4, using the following argument: suppose 12|QKm ´ TKm | ď δ for some stochastic
decoders, then 12|QKK1 ´ TKK1 | ď δ and QpK1 “ K2 “, . . . ,“ Kmq ď δ. We can
then remove the stochasticity of decoders at T2. . .Tm but retain the last two inequal-
ities. This is possible since each of Kj, j “ 2, . . . ,m is independent of all other CR
conditioned on the observation of Tj, and we used the fact that average is no greater
than the maximum. Thus 12|QKKm ´ TKKm | ď 2δ is achievable with deterministic
decoders at T2,. . . ,Tm. Applying a similar argument again we can further remove the
stochasticity of the decoder at T1, at the cost of another factor of 2.
Using Theorem 3.3.1 and the definition of the BL divergence, we see that
93
Corollary 3.3.2 (Strong converse for omniscient helper CR generation)
Suppose that QYm, pQYj |Ymq and cm P p0,8qm satisfy
lim infnÑ8
dδpQbnYm , pQ
bnYjq, cmq ď d‹pQYm , pQYjq, c
mq (3.82)
and that pR,R1, . . . , Rmq P p0,8qm`1 satisfy
d‹pQYm , pQYjq, cmq `
ÿ
j
cjRj ě
˜
ÿ
j
cj ´ 1
¸
R. (3.83)
Then, any omniscient helper CR generation scheme for the stationary memoryless
source with per-letter distribution QYm allowing stochastic encoders and decoders at
the rates pR,R1, . . . , Rmq must satisfy
limnÑ8
1
2|QKm ´ TKm | “ 1, (3.84)
where QKm denotes the actual CR distribution and TKm denotes the ideal distribution
under which K1 “ ¨ ¨ ¨ “ Km is equiprobable.
Note that (3.82) is satisfied in the settings discussed in the previous section. More-
over, in the case of finite alphabets or Gaussian distributions, the second-order upper
bounds on the smooth BL divergence in the previous section imply second-order con-
verses for the omniscient helper CR generation problem:
Corollary 3.3.3 (Second-order converse) Assume that the per-letter source dis-
tribution QYm is either
1. m-dimensional Gaussian with a non-degenerate covariance matrix, or
2. a probability distribution on a finite alphabet.
94
Suppose that there is a sequence of CR generation schemes such that
lim infnÑ8
1
2|QKm
n ´ TKmn | ă 1 (3.85)
Then,
lim infnÑ8
?n”´
ÿ
cj ´ 1¯
Rn ´ÿ
cjRjn ´ d‹pQYm , pQYjq, cmq
ı
ě Op?nq (3.86)
where
Rn :“1
nlog |K|; (3.87)
Rjn :“1
nlog |Wj|, j “ 1, . . . ,m (3.88)
are the rates at blocklength n.
95
Chapter 4
The Pump
This chapter is the third part of the trilogy on the functional approach to information-
theoretic converses (Chapters 2-4), and also one of the highlights of the thesis. We
introduce the “pumping-up” argument, where we design an operator T with the
property that for any function f taking values in r0, 1s whose 1-norm is not too small,
the 0-norm of Tf is not too small either. The Markov semigroups provide an excellent
source of such operators. This method is reminiscent of the philosophy of the classical
blowing-up method (the concentration of measure in the Hamming space), and can
indeed substitute the parts of the strong converse proofs which used the blowing-up
argument. However, in contrast to the blowing-up lemma which yields a sub-optimal
Oεp?n log
32 nq second-order term for discrete memoryless systems in the regime of
non-vanishing error probability, the proposed method yields an Opb
n log 11´εq bound
on the second-order term which is sharp in both the blocklength n and the error
probability ε Ò 1. Moreover, the new approach extends beyond stationary memoryless
settings: we discuss how it works under the assumptions of bounded probability
density, Gaussian distribution, and processes with memory and weak correlation.
The optimality and the wide applicability of the new method suggests that it may
be the “right” approach to multiuser second-order converses, and perhaps, that the
classical blowing-up method works for strong converse purposes simply because the
blowing-up operation approximates the optimal semigroup operation.
96
While Chapters 2 and 3 proved the Op?nq second-order converse for the om-
niscient helper CR generation problem, the “pumping-up” machinery allows us to
extend the result to the general one-communicator problem, which is a canonical
example of the side information problems.
4.1 Prelude: Binary Hypothesis Testing
This section we review the classical blowing-up approach and then introduce the
proposed pumping-up approach through the basic example of binary hypothesis test-
ing. Although this is a rather simple setting where many alternative approaches are
applicable, it is rich enough to demonstrate many features of the new approach.
A few words about the notations of the chapter: H`pYq denotes the set of
nonnegative measurable functions on Y and Hr0,1spYq is the subset of H`pYq with
range in r0, 1s. For a measure ν and f P H`pYq, we write νpfq :“ş
fdν and
fpp “ fpLppνq “ş
|f |pdν, while the measure of a set is denoted as νrAs. A ran-
dom transformation QY |X , mapping measures on X to measures on Y , is viewed as
an operator mapping H`pYq to H`pX q according to QY |Xpfq :“ ErfpY q|X “ ¨s
where pX, Y q „ QXQY |X .
4.1.1 Review of the Blowing-Up Method, and Its Second-
order Sub-optimality
Many converses in information theory (most notably, the meta-converse [34]) rely on
the analysis of certain binary hypothesis tests (BHT). Consider probability distribu-
tions P and Q on Y . Let f P Hr0,1spYq be the probability of deciding the hypothesis
P upon observing y. Denote by πP |Q “ Qpfq the probability that P is decided when
Q is true and vice versa by πQ|P . By the data processing property of the relative
97
entropy
DpP Qq ě dpπQ|P 1´ πP |Qq (4.1)
ě p1´ πQ|P q log1
πP |Q´ hpπQ|P q, (4.2)
where dp¨¨q is the binary relative entropy function on r0, 1s2. In the special case of
product measures P Ð Pbn, QÐ Qbn, and πQbn|Pbn ď ε P p0, 1q, (4.2) yields
πPbn|Qbn ě exp
ˆ
´nDpP Qq
1´ ε´Op1q
˙
, (4.3)
which is not sufficiently powerful to yield the converse part of the Chernoff-Stein
Lemma [3] (because of the factor 11´ε
).
In the case of deterministic tests (f “ 1A for some A Ď Yn) we show how to
improve (4.3) by means of a remarkable property enjoyed by product measures: a
small blowup of a set of nonvanishing probability suffices to increase its probability
to nearly 1 [27]. The following “modern version” was due to Marton [63]; see also [48,
Lemma 3.6.2].
Lemma 4.1.1 (Blowing-up) Denote the r-blowup of A Ď Yn by
Ar :“ tvn P Yn : dnpvn,Aq ď ru, (4.4)
where dn is the Hamming distance on Yn. Then, for any c ą 0,
PbnrArs ě 1´ e´c2
(4.5)
where
r “
c
n
2
ˆ
d
ln1
PbnrAs ` c˙
. (4.6)
98
Moreover, as every element of Ar is obtained from an element of A by changing at
most r coordinates, a simple counting argument [27, Lemma 5] shows that
QbnrArs ď Crpr ` 1q
ˆ
n
r
˙
QbnrAs (4.7)
“ exppnhprnq `OprqqQbnrAs (4.8)
for r P Ωp?nq X opnq, where C “ |Y |minyQpyq.
Proposition 4.1.2 Assuming |Y | ă 8 and πQbn|Pbn ď ε P p0, 1q. Any deterministic
test between Pbn and Qbn on Yn satisfies1
πPbn|Qbn ě exp´
´nDpP Qq ´Oεp?n log
32 nq
¯
. (4.9)
Proof Fix A Ď Y . If, in lieu of (4.1), we use
nDpP Qq ě d`
PbnrArsQbnrArs˘
, (4.10)
then similarly as (4.3) we obtain that
QbnrArs ě exp
ˆ
´nDpP Qq
PbnrArs´Op1q
˙
. (4.11)
Now, applying (4.5) and (4.8), and taking the (optimal) r “?αn log n for any fixed
α P p14,8q, we have
PbnrArs ě 1´ oεpn´ 1
2 q; (4.12)
QbnrArs ď exppOp?n log
32 nqqQbnrAs. (4.13)
“ exppOp?n log
32 nqqπPbn|Qbn . (4.14)
1The subscript in Oε indicates that the implicit constant in the big O notation depends on ε.Later we may omit this subscript when the discussion focuses on the asymptotics in n.
99
Then (4.9) follows from (4.11), (4.12) and (4.13).
Although Proposition 4.1.2 is sufficient to obtain the converse part of the Chernoff-
Stein Lemma, there are far easier and more general methods to accomplish that goal.
Moreover, the sublinear term in the exponent of (4.9) is not optimal. Indeed, our aim
is to lower bound QbnrAs given PbnrAs ě 1 ´ ε. By the Neyman-Pearson lemma,
the extremal A is a sublevel set of ıPbnQbn :“ log dPbn
dQbn, which is a sum of i.i.d.
random variables under Pbn. The optimal rate of the second-order term in (4.9)
is thus the central limit theorem rate Op?nq (see for example the analysis in [146,
Lemma 2.15] via the Chebyshev inequality). Such an optimal second-order rate is
fundamentally beyond the blowing-up method: simple examples show that neither
(4.5) nor (4.8) can be essentially improved and the choice r “ Θp?n log nq is optimal.
In other classical applications of BUL such as the settings in Section 4.2 and 4.3, the
suboptimal Op?n log
32 nq second term emerges for a similar reason.
It can be seen that the present setting of BHT is a special case of the change-of-
measure problem in Section 4.3 by taking X to be a singleton. However, in contrast
to BHT, the extremal set for the problem in Section 4.3 cannot be so simply charac-
terized, and BUL appears to be the only previous method to achieve the second-order
strengthening (to a suboptimal Op?n log
32 nq order).
4.1.2 The Fundamental Sub-optimality of the Set-Inequality
Approaches
Previously we have argued that the blowing-up lemma only achieves the?n log
32 n
second-order rate. Here we remark that the optimal?n rate is, in fact, fundamentally
beyond the reach of the idea of transforming sets and applying the data processing
argument (e.g. (4.1) and (4.10)): suppose, generalizing the blowing-up operation, we
apply a certain transformation of the set A ÞÑ A, and then proceed with the same
100
argument as Proposition 4.1.2. It is not hard to see that in order to achieve the
?n rate, such a set transformation must have the property that for any A such that
PbnrAs is bounded away from 0 and 1, say PbnrAs P r13, 23s for all n, we have
• Lower bound
PbnrAs ě 1´Opn´12 q. (4.15)
• Upper bound
QbnrAs ď exppOp?nqqQbnrAs. (4.16)
We remark that for proving a strong converse (i.e. opnq second-order rate), it suffices
to relax (4.15) and (4.16) to
PbnrAs ě 1´ op1q; (4.17)
QbnrAs ď exppopnqqQbnrAs, (4.18)
which are of course satisfied by the blowing-up operation, see (4.12) and (4.13).
Now we will assume that (4.15) and (4.16) indeed hold, and we will reach
a contradiction. Upon choosing A such that PbnrAs P r13, 23s and QrAs “
expp´nDpP Qq `Op?nqq we get
PbnrAs ě 1´Opn´12 q, (4.19)
QbnrAs ď expp´nDpP Qq `Op?nqq. (4.20)
However, under the constraint (4.19), QbnrAs is minimized by A of the form:
A :“ txn : ıPbnQbnpxnq ą nDpP Qq ´ C
?n lnnu (4.21)
101
for some C ą 0, which can be seen from the Neyman-Pearson lemma and the Gaussian
approximation of a normalized iid sum. However, for any C 1 P p0, Cq,
QbnrAs ě P”
C 1?n lnn ă nDpP Qq ´ ıPbnQbnpY
nq ă C
?n lnn
ı
(4.22)
ě exp´
´nDpP Qq ` C 1?n lnn
¯
¨ P”
C 1?n lnn ă nDpP Qq ´ ıPbnQbnpX
nq ă C
?n lnn
ı
(4.23)
where Xn „ Pbn and Y n „ Qbn. We can show [147, Lemma 8.1] that the second
factor in (4.23) is of the order of expp´Oplog nqq. Then (4.23) contradicts (4.20). In
fact, by refining the above analysis we see that the best possible second-order rate
that can be obtained from the data processing argument is?n log n.
4.1.3 The New Pumping-up Approach
In this subsection we propose the pumping-up approach, which is a gentler alterna-
tive to BUL that achieves the optimal Op?nq second-order rate, while retaining the
applicability of BUL to multiuser information theory problems. Instead of applying
a set transformation A ÞÑ A, we will apply a linear transformation T to a function
f P Hr0,1s (which can be thought of as the indicator function of A, but not necessar-
ily). In contrast to (4.15) and (4.16), which cannot be achieved simultaneously as we
have shown, we will find T with the following properties:
• Lower bound: for any f P Hr0,1spYq, with, say2, Pbnpfq ě 12,
TfL0pPbnq ě expp´Op?nqq. (4.24)
2We define fL0pP q :“ limqÓ0 fLqpP q “ exppP pln fqq.
102
• Upper bound: for any f P Hr0,1spYq,
QbnpTfq ď exppOp?nqqQbnpfq. (4.25)
We now explain how (4.24) and (4.25) can lead to an Op?nq second-order converse.
Instead of applying the data processing argument as in (4.1) and (4.10), we note that,
by the variational formula for the relative entropy (e.g. [48, (2.4.67)])3,
DpP Qq ě P pln gq ´ lnQpgq, @g P H`pYq. (4.26)
Taking Y Ð Yn, P Ð Pbn and Q Ð Qbn in (4.26), we see that (4.24) and (4.25)
lead to
Qbnpfq ě e´nDpP Qq´Op?nq. (4.27)
Next, we need to find the operator T satisfying (4.24) and (4.25). Markov semi-
groups appear to be tailor-made for this purpose. We say pTtqtě0 is a simple semi-
group4 with stationary measure P if
Tt : H`pYq Ñ H`pYq, f ÞÑ e´tf ` p1´ e´tqP pfq. (4.28)
In the i.i.d. case P Ð Pbn we consider their tensor product
Tt :“ re´t ` p1´ e´tqP sbn (4.29)
3While the bases of the information theoretic quantities, log and exp were arbitrary (but consis-tent) up to here (see the convention in [43]), henceforth they are natural.
4Markov Semigroups arise from the study of Markov processes. For example, the semigroupin 4.29 has the following Markov process interpretation: A particle sits at xn at the time t “ 0.Whenever a Poisson clock of rate n ticks, we equiprobably select one coordinate among t1, . . . , nuand replace the value by a random sample according to P . Let Pxn,t be the distribution at t. ThenpTtfqpx
nq :“ Pxn,tpfq for any f .
103
Note that (4.24) implies that T is an positivity improving operator, that is, it maps
a nonnegative, not identically zero function to a strictly positive function. The
positivity-improving property of Tt is precisely quantified by the reverse hypercon-
tractivity phenomenon discovered by Borell [148] and recently generalized by Mossel
et al. [77].
Theorem 4.1.3 [77] Let pTtqtě0 be a simple semigroup (defined in (4.28)) or an
arbitrary tensor product of simple semigroups. Then for all 0 ă q ă p ă 1, f P H`and t ě ln 1´q
1´p,
Ttfq ě fp. (4.30)
When Tt is defined by (4.29), for any f P Hr0,1spYq,
PbnplnTtfq “ ln TtfL0pPbnq (4.31)
ě ln fL1´e´t pPbnq (4.32)
ě1
1´ e´tlnPbnpfq (4.33)
ě
ˆ
1
t` 1
˙
lnPbnpfq, (4.34)
where (4.34) follows from et ě 1 ` t. On the other hand, instead of the counting
argument (4.8), in the present setting we use
QbnpTtfq “ Qbnppe´t ` p1´ e´tqP qbnfq (4.35)
ď Qbnppe´t ` αp1´ e´tqQqbnfq (4.36)
“ pe´t ` αp1´ e´tqqnQbnpfq (4.37)
ď epα´1qntQbnpfq, (4.38)
for all f P H`pYq, where α :“ dPdQ8 ě 1.
104
Theorem 4.1.4 If πQbn|Pbn ď ε P p0, 1q, any (possibly stochastic) test between Pbn
and Qbn satisfies
πPbn|Qbn ě p1´ εq exp
˜
´nDpP Qq ´ 2
d
n
›
›
›
›
dP
dQ
›
›
›
›
8
ln1
1´ ε
¸
. (4.39)
Proof Take g Ð Ttf in (4.26) with fpyq the probability for which the test decides
P observing y. Apply (4.34) and (4.38). The result follows by optimizing over t ą 0.
Remark 4.1.1 In the setting of Theorem 4.1.4, the likelihood ratio test is optimal by
the Neyman-Pearson lemma, and using the central limit theorem it is easy to show
that (see e.g. [50, Proposition 2.3])
πPbn|Qbn ě exp´
´nDpP Qq `b
nVarpıP QpXqqQ´1pεq ´ op
?nq¯
(4.40)
where X „ P , which is tight up to the?n second-order term. Since Q´1pεq „
´2b
ln 11´ε
as ε Ò 1, we see that the second-order term in (4.39) is tight up to a
constant factor depending only on P and Q for ε close to 1.
Remark 4.1.2 It is instructive to compare the blowing-up and semigroup operations.
Note that
1Arpxq “ sup|S|ďr
suppziqiPS
1AppziqiPS , pxiqiPScq, (4.41)
while expanding (4.29) gives (cf. (4.51))
Tt1Apxq “ Er1AppXiqiPS , pxiqiPScqs (4.42)
where the set S is uniformly distributed over sets of size |S| „ Binompn, 1´ e´tq and
Xi „ P are i.i.d. Thus the semigroup operation can be viewed as an “average” coun-
105
terpart of blowing-up (with r « np1 ´ e´tq). In contrast to the maximum, averaging
increases the small values of f (positivity-improving) while preserving the total mass
PbnpTtfq “ Pbnpfq, so that QbnpTtfq does not increase too much.
Hamming weight of xn
1A 1Ant
Tbnt 1A
Figure 4.1: Schematic comparison of 1A, 1Ant and Tbnt 1A, where A is the indicatorfunction of a Hamming ball.
We note that the new method draws on a philosophy similar to BUL, but enjoys
the following advantages:
• Achieves the optimal Opb
n ln 11´εq second-order rate, which is sharp in both n
(as nÑ 8) and ε (as εÑ 1);
• Purely measure-theoretic in nature: finiteness of the alphabet is sufficient, but
not necessary, for Theorem 4.1.4. In fact, the boundedness of the Radon-
Nikodym derivative is sometimes not necessary if we apply semigroups other
than (4.29): see Section 4.2.2 for a Gaussian example. In contrast, while ana-
logues of the blowing-up property (4.5) exist for many other measures, no ana-
logue of the counting estimate (4.8) can hold in continuous settings as the
blow-up of a set of measure zero may have positive measure.
A comparison between the BUL method and the pumping-up method is given in
Table 4.1.
106
BUL approachPumping-up
approachConnectinginformation
measures andobservables
Data processingproperty
Convex duality
Lower boundwith respect toa given measure
Blowing-Uplemma
Reverse hyper-contractivity
Upper boundwith respect tothe reference
measure
A countingargument (4.8)
Change ofmeasure (4.36)
(boundeddensity case);
scalingargument (4.61)(Gaussian case)
Table 4.1: Comparison of the basic ideas in BUL and the new methodology
4.2 Optimal Second-order Fano’s Inequality
4.2.1 Bounded Probability Density Case
Consider a random transformation PY |X . For any x P X , denote by pTx,tqtě0 the
simple Markov semigroup:
Tx,t :“ e´t ` p1´ e´tqPY |X“x. (4.43)
Motivated by the steps (4.35)-(4.38), for α P r1,8q, t P r0,8q and a probability
measure ν on Y , define a linear operator Λνα,t : H`pYnq Ñ H`pYnq by
Λνα,t :“
nź
i“1
pe´t ` αp1´ e´tqνpiqq (4.44)
107
where νpiq : HpYnq Ñ HpYnq is the linear operator that integrates the i-th coordinate
with respect to ν. Since νpiq1 “ 1, we see from the binomial formula that
Λνα,t1 “
nÿ
k“0
e´pn´kqtαkp1´ e´tqkˆ
n
k
˙
(4.45)
“`
e´t ` αp1´ e´tq˘n
(4.46)
ď epα´1qnt. (4.47)
Lemma 4.2.1 Fix pPY |X , ν, tq. Suppose that
α :“ supx
›
›
›
›
dPY |X“xdν
›
›
›
›
8
P r1,8q; (4.48)
Txn,t :“ bni“1Txi,t. (4.49)
Then for n ě 1 and f P H`pYnq,
supxn
Txn,tf ď Λνα,tf. (4.50)
Proof For any xn P X n, observe that
Txn,tf “nź
i“1
re´t ` p1´ e´tqPpiqY |X“xi
sf (4.51)
“ÿ
SĎt1,...,nu
e´|Sc|tp1´ e´tq|S|
˜
ź
iPSPpiqY |X“xi
¸
pfq,
The result then follows from (4.44) and (4.48).
Theorem 4.2.2 Fix PY |X and positive integers n and M . Suppose (4.48) holds
for some probability measure ν on Y. If there exists c1, . . . , cM P X n and disjoint
108
D1, . . . ,DM Ď Yn such that
Mź
m“1
P1M
Y n|Xn“cmrDms ě 1´ ε, (4.52)
then
IpXn;Y nq ě lnM ´ 2
c
pα ´ 1qn ln1
1´ ε´ ln
1
1´ ε, (4.53)
where Xn is equiprobable on tc1, . . . , cMu, Yn is its output from PY n|Xn :“ PbnY |X , and
α is defined in (4.48).
Proof Let fm :“ 1Dm , m “ 1, . . . ,M . Fix some t ą 0 to be optimized later. Observe
that
IpXn;Y nq “
1
M
Mÿ
m“1
DpPY n|Xn“cmPY nq (4.54)
ě1
M
Mÿ
m“1
PY n|Xn“cmpln Λνα,tfmq
´1
M
Mÿ
m“1
lnPY npΛνα,tfmq (4.55)
where (4.55) is from the variational formula (4.26). We can lower bound the first
term of (4.55) by
1
M
Mÿ
m“1
ln Λνα,tfmL0pPY n|Xn“cm q
ě1
M
Mÿ
m“1
ln Tcm,tfmL0pPY n|Xn“cm q(4.56)
ě ´
ˆ
ln1
1´ ε
˙ˆ
1`1
t
˙
, (4.57)
109
where (4.56)-(4.57) follows similarly as (4.31)-(4.34). For the second term in the right
of (4.55), using Jensen’s inequality and (4.47),
´1
M
Mÿ
m“1
lnPY npΛνα,tfmq ě ´ lnPY n
˜
1
M
Mÿ
m“1
Λνα,tfm
¸
(4.58)
ě lnM ´ pα ´ 1qnt. (4.59)
The result then follows by optimizing t.
Remark 4.2.1 If |Y | ă 8 then (4.48) holds with ν equiprobable and α “ |Y |. The
“geometric average criterion” in (4.52) is weaker than the maximal error criterion but
stronger than the the average error criterion. Under the average error criterion, one
cannot expect a bound like IpXn;Y nq ě lnM ´ opnq since it would contradict [149,
Remark 5]. In [150] we introduce the notion of “λ-decodability” which subsumes the
geometric average criterion, average error criterion, and the maximum error criterion
as the λ “ 0, 1,´8 special cases, and show that λ “ 0 is the critical value for a
strengthening of the Fano inequality to hold.
Remark 4.2.2 Using the blowing-up lemma, one can get a weaker version of The-
orem 4.2.2 in the case of finite alphabets under the maximal error criterion with the
Op?nq second-order term replaced by Op
?n log
32 nq. This is essentially contained in
the analysis in [27][63]; see also [149, Theorem 6].
4.2.2 Gaussian Case
We now find an analogue of Theorem 4.2.2 for Gaussian channels, i.e., PY |X“x “
N px, 1q. The previous argument does not immediately carry through due to the
bounded density assumption (4.48). To surmount this problem, it will be convenient
to replace (4.49) in the Gaussian setting by the Ornstein-Uhlenbeck semigroup with
110
stationary measure N pxn, Inq:
Txn,tfpynq :“ Erfpe´tyn ` p1´ e´tqxn `
?1´ e´2tV n
qs (4.60)
for f P H`pRnq, where V n „ N p0n, Inq. In this setting, (4.30) holds under the even
weaker assumption t ě 12
ln 1´q1´p
[148].
Original
ynConvolution
ynDilation
yn
xn
xn
Figure 4.2: Illustration of the action of Txn,t. The original function (an indicatorfunction) is convolved with a Gaussian measure and then dilated (with center xn).
The proof proceeds a little differently than the discrete case. Here, the analogue
of Λνα,t in (4.44) is simply T0n,t. Instead of Lemma 4.2.1, we will exploit a simple
change-of-variable formula: for any f ě 0, t ą 0 and xn P Rn, we have
PY n|Xn“xnplnT0n,tfq “ PY n|Xn“e´txnplnTe´txn,tfq, (4.61)
which can be verified from the definition in (4.60). For later applications in broadcast
channels, we consider a slight extension of the setting of Theorem 4.2.2 to allow
stochastic encoders.
111
Theorem 4.2.3 Let PY |X“x “ N px, σ2q. Assume that there exist
φ : t1, . . . ,Mu ˆ t1, . . . , Lu Ñ Rn, (4.62)
and disjoint sets pDwqMw“1 such that
ź
w,v
P1ML
Y n|Xn“φpw,vqrDws ě 1´ ε (4.63)
where ε P p0, 1q. Then
IpW ;Y nq ě lnM ´
c
2n ln1
1´ ε´ ln
1
1´ ε(4.64)
where pW,V q is equiprobable on t1, . . . ,Mu ˆ t1, . . . , Lu, Xn :“ φpW,V q, and Y n is
the output from PY n|Xn :“ PbnY |X .
Remark 4.2.3 Under the maximal error probability criterion, one can also derive
a second-order strengthening of the Gaussian Fano inequality (4.64) by applying
Poincare’s inequality to an information spectrum bound; see [149, Theorem 8].
Proof By the scaling invariance of the bound it suffices to consider the case of
σ2 “ 1. Let fw “ 1Dw for w P t1, . . . ,Mu. Put Xn “ etXn and Y n the corresponding
output from the same channel. Note that for each w,
DpPY n|W“wPY nq
ě1
L
Lÿ
v“1
PY n|Xn“etφpw,vqplnT0n,tfwq ´ lnPY npT0n,tfwq (4.65)
“1
L
Lÿ
v“1
PY n|Xn“φpw,vqplnTφpw,vq,tfwq ´ lnPY npT0n,tfwq (4.66)
where the key step (4.66) used (4.61). The summand in the first term of (4.66) can
be bounded using reverse hypercontractivity for the Ornstein-Uhlenbeck semigroup
112
(4.60) as
PY n|Xn“φpw,vqplnTφpw,vq,tfwq ě1
1´ e´2tlnPY n|Xn“φpw,vqpfwq
thus from the assumption (4.63),
1
ML
ÿ
w,v
PY n|Xn“etφpw,vqplnT0n,tfwq ě ´1
1´ e´2tln
1
1´ ε.
On the other hand, by Jensen’s inequality,
1
M
Mÿ
w“1
lnPY npT0n,tfwq ď lnPY n
˜
T0n,t1
M
Mÿ
w“1
fw
¸
ď ln1
M
where the last step usesřMw“1 fw ď 1. Thus taking 1
M
řMw“1 on both sides of (4.66),
we find
IpW ; Y nq ě lnM ´
1
1´ e´2tln
1
1´ ε. (4.67)
Moreover, let Gn „ N p0n, Inq be independent of Xn,
hpY nq “ hpetXn
`Gnq (4.68)
“ hpXn` e´tGn
q ` nt (4.69)
ď hpY nq ` nt, (4.70)
where (4.70) can be seen from the entropy power inequality. On the other hand for
each w we have
hpY n|W “ wq ´ hpY n
|W “ wq
“ IpY n;Xn|W “ wq ´ IpY n;Xn
|W “ wq ě 0 (4.71)
113
where (4.71) can be seen from [151, Theorem 1]. The result follows from (4.67),
(4.70), (4.71) and by optimizing t.
4.3 Optimal Second-order Change-of-Measure
We now revisit a change-of-measure problem considered in [27] in proving the strong
converse of source coding with side information (cf. Section 4.4.3). For a “combina-
torial” formulation, which amounts to assigning an equiprobable distribution on the
strongly typical set, see [8, Theorem 15.10]. Variations on such an idea have been
applied to the “source and channel networks” in [8, Chapter 16] under the name
“image-size characterizations”. Given a random transformation QY |X , measures ν on
Y and µn on X n, we want to lower bound the νbn-measure of A Ď Yn in terms of
the µn-measure of its “ε-preimage” under QY n|Xn :“ QbnY |X . More precisely, for given
c ą 0, ε P p0, 1q find an upper bound on
supA
lnµnrxn : QY n|Xn“xnrAs ą 1´ εs ´ c ln νbnrAs
(
.
X
x : QY |X“xrAs ě 1´ ε
x
Y
A
Figure 4.3: A set A Ď Y and its “preimage” under QY |X in X .
114
Definition 4.3.1 Fix µ on X , ν on Y, and QY |X . For c P p0,8q and PX Ñ QY |X Ñ
PY ,
dpµ,QY |X , ν, cq :“ supPX : PX!µ
tcDpPY νq ´DpPXµqu (4.72)
“ supfPH`pYq
lnµpecQY |Xpln fqq ´ c ln νpfq(
. (4.73)
where (4.73) follows from, e.g. Theorem 2.2.3.
In Definitions 4.3.1 and 4.3.2 we adopt the convention 8´8 “ ´8.
Observe that given QXY , the largest c ą 0 for which dpQX , QY |X , QY , cq “ 0 is
the reciprocal of the strong data processing constant ; see the references in [80]. If
in definition (4.72) for dpµn, QbnY |X , ν
bn, cq we choose PXn to be µn conditioned on
B :“ txn : QY n|Xn“xnrAs ą 1 ´ εu, i.e., PXnrCs :“ µnrBXCsµnrBs , @C, then [27, (19)-(21)]
showed that when |ν| “ 1,
lnµnrxn : QY n|Xn“xnrAs ą 1´ εs ´ cp1´ εq ln νbnrAs
ď dpµn, QbnY |X , ν
bn, cq ` c ln 2. (4.74)
We quickly recall the derivation of (4.74) here: define PXn by
PXnrCs :“µnrB X CsµnrBs
, @C. (4.75)
115
Then
DpPXnµnq “ ln1
µnrBs, (4.76)
DpPY nνbnq ě PY nrAs ln
PY nrAsνbnrAs ` PY nrA
cs ln
PY nrAcsνbnrAcs (4.77)
ě ´hpPY nrAsq ` PY nrAs ln1
νbnrAs (4.78)
ě ´ ln 2` p1´ εq ln1
νbnrAs . (4.79)
Thus
dpµn, QbnY |X , ν
bnq ě cDpPY nν
bnq ´DpPXnµnq (4.80)
ě ´c ln 2` cp1´ εq ln1
νbnrAs ´ ln1
µnrBs(4.81)
which is (4.74).
Note the undesired second ε in (4.74), which would result in a weak converse. For
finite Y , [27] used the blowing-up lemma to strengthen (4.74). For the same reason
discussed in Section 4.1, even using modern results on concentration of measure, one
can only obtain Op?n ln
32 nq in the second order term:
lnµnrxn : QY n|Xn“xnrAs ą 1´ εs ´ c ln νbnrAs
ď dpµn, QbnY |X , ν
bn, cq `Op?n ln
32 nq. (4.82)
If µn “ QbnX , then by the tensorization of the BL divergence (Chapter 6),
dpQbnX , QbnY |X , νbn, cq “ n dpQX , QY |X , ν, cq, (4.83)
we see the right side of (4.74) grows linearly with slope dpQX , QY |X , ν, cq, which is
larger than desired (i.e. when applied to the source coding problem in Section 4.4.3
116
would only result in an outer bound). Luckily, when |X | ă 8, it was noted in [27]
that if µn is the restriction of QbnX on the QX-strongly typical set5, then the linear
growth rate of dpµn, QbnY |X , ν
bn, cq is the following (desired) quantity:
Definition 4.3.2 Given QX , QY |X , measure ν on Y and c P p0,8q, define
d‹pQX , QY |X , ν, cq
:“ supPUX : PX“QX
cDpPY |Uν|PUq ´DpPX|UQX |PUq(
where PUXY :“ PUXQY |X , and DpPX|UQX |PUq :“ş
DpPX|UҬQXqdPU denotes the
conditional relative entropy.
It follows from Definitions 4.3.1 and 4.3.2 that d‹pQX , QY |X , ν, cq ď dpQX , QY |X , ν, cq.
For general alphabets, we can extend the idea and let µn “ QbnXˇ
ˇ
Cnfor some Cn
with 1 ´ QbnX rCns ď δ for some δ P p0, 1q independent of n; this will not affect its
information-theoretic applications in the non-vanishing error probability regime. In
Chapter 3, we have shown, in the discrete and the Gaussian cases, that we can choose
Cn so that
dpµn, QbnY |X , ν
bn, cq ď n d‹pQX , QY |X , ν, cq `Op?nq. (4.84)
Using the semigroup method, we can also improve the Op?n ln
32 nq term in (4.82) to
Op?nq in the discrete and the Gaussian cases. This combined with (4.84) implies
lnµnrxn : QY n|Xn“xnrAs ą 1´ εs ´ c ln νbnrAs
ď n d‹pQX , QY |X , ν, cq `Op?nq. (4.85)
5The restriction µ|C of a measure µ on a set C is µ|CrDs :“ µrC XDs.
117
While the original proof [27] used a data processing argument, just as (4.2), to get
(4.74), the present approach uses the functional inequality (4.73), just as (4.26). In
the discrete case we have the following result:
Theorem 4.3.1 Consider QX a probability measure on a finite set X , ν a probability
measure on Y, and QY |X . Let
βX :“ 1minxQXpxq P r1,8q, (4.86)
α :“ supx
›
›
›
›
dQY |X“x
dν
›
›
›
›
8
P r1,8q. (4.87)
Let c P p0,8q, η, δ P p0, 1q and n ą 3βX ln |X |δ
. We can choose some set Cn with
QbnX rCns ě 1´ δ, such that for µn :“ QbnXˇ
ˇ
Cnwe have
lnµnrxn : QY n|Xn“xnpfq ě ηs ´ c ln νbnpfq
ď nd‹pQX , QY |X , ν, cq ` A?n` c ln
1
η(4.88)
for any f P Hr0,1spYnq, where
A :“ lnpαcβc`1X q
c
3βX ln|X |δ` 2c
c
pα ´ 1q ln1
η. (4.89)
The proof of Theorem 4.3.1 relies on some ideas in the proof of Theorem 4.2.2; in
particular the properties (4.47) and (4.50) for the operator Λνα,t play a critical role.
Proof We first establish the following claim: for any n ě 1, f P Hr0,1spYnq and µn
a measure on X n,
lnµnrxn : QY n|Xn“xnpfq ě ηs ´ c ln νbnpfq
ď dpµn, QbnY |X , ν
bn, cq ` 2c
c
pα ´ 1qn ln1
η` c ln
1
η, (4.90)
118
where QY n|Xn :“ QbnY |X . Notice that for any g P H`pYnq, from the variational formula
of the relative entropy (4.26) we have
ż
gL0pQY n|Xn“xn qdµnpxnq “
ż
eQbnY |X
pln gqdµn (4.91)
“ e´DpPXnµnq`
ş
QbnY |X
pln gqdPXn (4.92)
ď edpµn,QbnY |X
,νbn,cq´cDpPY nνbnq`cPXn pln g
1c q (4.93)
ď edpµn,QbnY |X
,νbn,cqgL1cpνbnq. (4.94)
where we defined PXn via dPXndµn
“ eQbnY |X
pln gq´
ş
eQbnY |X
pln gqdµn
¯´1
. Note that (4.94) can
also be recovered as a particular instance of [80, Theorem 1]. Now take g “ pΛνα,tfq
c
and observe that by (4.47),
gL1cpνbnq ď ecpα´1qntrνbnpfqsc. (4.95)
Moreover,
ż
gL0pQY n|Xn“xn qdµnpxnq “
ż
Λνα,tf
cL0pQY n|Xn“xn q
dµnpxnq (4.96)
ě
ż
fcL1´e´t pQY n|Xn“xn q
dµnpxnq (4.97)
ě
ż
QY n|Xn“xnpfqc
1´e´t dµnpxnq (4.98)
ě ηc
1´e´t µnrQY n|Xnpfq ě ηs. (4.99)
where (4.97) used Lemma 4.2.1 and Theorem 4.1.3; and (4.98) used the assumption
f ď 1. Then (4.90) follows by using
1
1´ e´tď
1
t` 1 @t P p0,8q (4.100)
and optimizing over t ą 0.
119
Having established (4.90), the rest of the proof follows from the estimate of
dpµn, QbnY |X , ν
bn, cq in Lemma 3.2.2, noting that the αY defined therein satisfies αY ď
α.
We now proceed to find an analogue of Theorem 4.3.1 in the case of Gaussian QX ,
QY |X and Lebesgue ν. Again we consider the Ornstein-Uhlenbeck semigroup. The
proof of the following uses some ideas already touched in Section 4.2.2, in particular
the change-of-variable trick in (4.61).
Theorem 4.3.2 Suppose QX “ N p0, σ2q, QY |X“x “ N px, 1q, ν is Lebesgue on R,
and a, δ P p0, 1q. For n ě 24 ln 2δ, there exists Cn Ď Rn with QbnX rCns ě 1 ´ δ such
that for µn “ QbnXˇ
ˇ
Cn,
lnµnrxn : QY n|Xn“xnpfq ą ηs ´ c ln νbnpfq
ď nd‹pQX , QY |X , ν, cq
` c
c
2n ln1
η`
c
6n ln2
δ` c ln
1
η(4.101)
for any f P Hr0,1spRnq.
Note that the value of σ does not affect the bound.
Proof We first prove the following claim: for any nonnegative finite measure µn on
Rn,
lnµnrxn : QY n|Xn“xnpfq ą ηs ´ c ln νbnpfq
ď dpµn, QbnY |X , ν
bn, cq ` c
d
2
ˆ
ln1
η
˙
n` c ln1
η(4.102)
for any f P Hr0,1spRnq.
120
Let µn be the dilation of µn by factor et, that is,
µnretCs “ µnrCs, (4.103)
for any measurable C. As before, the functional inequality gives
ż
eQbnY |X
pln gqdµn ď exppdpµn, QbnY |X , ν
bn, cqqgL1cpRnq, @g ě 0. (4.104)
Now substitute with
g “ pT0n,tfqc. (4.105)
Observe that
gL1cpRnq “ T0n,tfcL1pRnq (4.106)
“ ecntfcL1pRnq (4.107)
“ ecntrνbnpfqsc, (4.108)
121
where (4.107) can be verified using Fubini’s theorem and (4.60).
ż
eQbnY |X
pln gqdµn “
ż
ecQY n|Xn“xn plnT0n,tfqdµnpxnq (4.109)
“
ż
ecQY n|Xn“e´txn plnTe´txn,tfqdµnpxnq (4.110)
“
ż
Te´txn,tfcL0pQY n|Xn“e´txn q
dµnpxnq (4.111)
ě
ż
fcL1´e´2t
pQY n|Xn“e´txn qdµnpx
nq (4.112)
ě
ż
fc
1´e´2t
L1pQY n|Xn“e´txn qdµnpx
nq (4.113)
ě ηc
1´e´2t µnrxn : QY n|Xn“e´txnpfq ą ηs (4.114)
“ ηc
1´e´2t µnrxn : QY n|Xn“xnpfq ą ηs (4.115)
where (4.110) used (4.61); (4.112) is from (4.30); (4.113) used the fact that f ď 1;
(4.115) is from (4.103). Combining (4.104), (4.108) (4.115) we obtain
lnµnrxn : QY n|Xn“xnpfq ą ηs ´ c ln νbnpfq
ď dpµn, QbnY |X , ν
bn, cq ` inftą0
"ˆ
ln1
η
˙
c
1´ e´2t` cnt
*
(4.116)
ď dpµn, QbnY |X , ν
bn, cq ` c
d
2
ˆ
ln1
η
˙
n` c ln1
η(4.117)
where the last step used (4.100). Using an argument similar to (4.71), we find
dpµn, QbnY |X , ν
bn, cq ď dpµn, QbnY |X , ν
bn, cq. (4.118)
Indeed, by definition
dpµn, QbnY |X , ν
bn, cq “ supPXntcDpPY nν
bnq ´DpPXnµnqu. (4.119)
122
When PXn and µn are both scaled by et, DpPXnµnq remains unchanged. On the
other hand DpPY nνbnq “ ´hpPY nq, so by the same argument as (4.71) we have
DpPY nνbnq ď DpPY nν
bnq. This establishes (4.118). Combining (4.117), (4.118),
we have established the claim (4.102). The rest of the proof of Theorem 4.3.2 follows
from the bound on dpµn, QbnY |X , ν
bn, cq in Lemma 3.2.6.
Remark 4.3.1 If ν “ N p0, σ2q (σ ě 1) instead, then an Op?nq second-order bound
still holds. Indeed, in this case notice that
T0n,tfL1pνnq “ E”
fpe´tW n`?
1´ e´2tV nq
ı
(4.120)
“ E rf pUnqs (4.121)
“ E„
fpW nq
dPUn
dPWn
pW nq
(4.122)
ď E
«
fpW nq
ˆ
σ2
e´2tσ2 ` 1´ e´2t
˙n2ff
(4.123)
ď E“
fpW nqent
‰
(4.124)
“ entfL1pνnq (4.125)
where V n „ N p0n, Inq and W n „ N p0n, σ2Inq are independent, and Un „
N p0n, pe´2tσ2 ` 1 ´ e´2tqInq. Therefore the equality in (4.107) will be replaced
by ď, so the analysis can still go through. On the other hand, the third term on the
right side of (4.101), which is the approximation error for dpµn, QbnY |X , ν
bn, cq, will
be slightly changed, but is still of the order of n12. As for (4.118), notice that
DpPY nνbnq ´DpPY nν
bnq ď
1
2σ2ErY n
2s ´
1
2σ2ErY n
2s (4.126)
“1
2σ2ErXn
2s ´
1
2σ2ErXn
2s (4.127)
“ pe2t´ 1q
1
2σ2ErXn
2s. (4.128)
123
Therefore if µn (hence PXn) is supported on txn : xn ď Op?nqu then
dpµn, QbnY |X , ν
bn, cq ď dpµn, QbnY |X , ν
bn, cq `Op?nq (4.129)
as long as t “ Θpn´12q.
4.4 Applications
4.4.1 Output Distribution of Good Channel Codes
Consider a stationary memoryless channel PY |X with capacity C. If |Y | ă 8 then by
the steps [149, (64)-(66)] and our Theorem 4.2.2 we conclude that an pn,M, εq code
under the maximal error criterion with deterministic encoders satisfies the converse
bound
DpPY nP‹Y nq ď nC ´ lnM ` 2
c
|Y |n ln1
1´ ε` ln
1
1´ ε. (4.130)
This implies that the Burnashev condition
supx,x1
›
›
›
›
dPY |X“xdPY |X“x1
›
›
›
›
8
ă 8 (4.131)
in [149, Theorem 6] (cf. [48, Theorem 3.6.6]) is not necessary. Using the blowing-up
lemma, [149, Theorem 7] bounded DpPY nP‹Y nq without requiring (4.131), but with
a suboptimal?nplnnq32 second-order term. Also note that in our approach |Y | ă 8
can be weakened to a bounded density assumption (4.48), and the maximal error
criterion assumption can be weakened to the geometric average criterion (4.52).
124
4.4.2 Broadcast Channels
Consider a Gaussian broadcast channel where the SNRs in the two component chan-
nels are S1, S2 P p0,8q. Suppose there exists an pn,M1,M2, εq-maximal error code.
Using Theorem 4.2.3 and the same steps in the proof of the weak converse (see e.g. [18,
Theorem 5.3]), we immediately obtain
lnM1 ď nCpαS1q `
c
2n ln1
1´ ε` ln
1
1´ ε; (4.132)
lnM2 ď nC
ˆ
p1´ αqS2
αS2 ` 1
˙
`
c
2n ln1
1´ ε` ln
1
1´ ε, (4.133)
for some α P r0, 1s, where Cptq :“ 12
lnp1 ` tq. An alternative proof of the strong
converse via information spectrum, which does not give simple explicit bounds on the
sublinear terms, is given in [152]. Converses under the average error criterion can be
obtained by codebook expurgation (e.g. [18, Problem 8.11]).
Next, let us prove an Op?nq second-order counterpart for the discrete broadcast
channel. Consider a degraded broadcast channel pPY |X , PZ|Xq with |Y |, |Z| ă 8.
Suppose that there exists an pn,M1,M2, εq code under the maximal error criterion. We
can generalize Theorem 4.2.2 to allow stochastic encoders as in Theorem 4.2.3 which,
combined with the steps in the proofs of the weak converse of degraded broadcast
channels (e.g. [18, Theorem 5.2]) show that there exist PUX such that
lnM1 ď nIpX;Y |Uq ` 2
c
|Y |n ln1
1´ ε` ln
1
1´ ε; (4.134)
lnM2 ď nIpU ;Zq ` 2
c
|Z|n ln1
1´ ε` ln
1
1´ ε. (4.135)
This improves the best Op?n ln
32 nq second term obtained by the blowing-up ar-
gument ([27], [48, Theorem 3.6.4]). Again, with our approach the finite alphabet
assumption may be weakened to (4.48). Converse bounds under the average error
criterion can be obtained by an expurgation argument (e.g. [18, Problem 8.11]).
125
4.4.3 Source Coding with Compressed Side Information
DecoderEncoder 2
Encoder 1
Y
X
Y
W2
W1
Figure 4.4: Source coding with compressed side information
As alluded in Chapter 1, source coding with compressed side information [49][27]
is a canonical example of the side-information problem, whose second-order converse
was open [50]. See Figure 4.4 for the setup. Using Theorem 4.3.1 plus essentially the
same arguments as [27, Theorem 3], we can show the following second-order converse:
Theorem 4.4.1 Consider a stationary memoryless discrete source with per-letter
distribution QXY . Let ε P p0, 1q and n ě 3βX ln 2p1`εq|X |1´ε
, where βX is defined in
(4.86). Suppose that there exist encoders f : X n ÑW1 and g : Yn ÑW2 and decoder
V : W1 ˆW2 Ñ Yn such that
PrY n‰ Y n
s ď ε. (4.136)
Then for any c P p0,8q,
ln |W1| ` c ln |W2|
ě n infU : U´X´Y
tcHpY |Uq ´ IpU ;Xqu
´?n
˜
lnp|Y |cβc`1X q
c
3βX ln4|X |1´ ε
` 2c
c
|Y | ln 2
1´ ε
¸
´ p1` cq ln2
1´ ε. (4.137)
126
Proof The first part of the proof parallels the proof of [27, Theorem 3]. For any
w PW1, define the “correctly decodable set”
Bw :“ tyn : yn “ V pw, gpynqqu . (4.138)
Then
ErQY n|XnrBfpXnq|Xnss ě 1´ ε, (4.139)
where Xn „ QbnX . Choose any ε1 P pε, 1q. By the Markov inequality,
QbnX rxn : QY n|Xn“xnrBfpxnqs ě 1´ ε1s ě 1´
ε
ε1. (4.140)
Take ν to be the equiprobable measure on Y . Applying Theorem 4.3.1 with δ “ ε1´ε2ε1
and η “ 1´ ε1, for n ą 3βX ln 2|X |1´εε1
, we find µn such that
µnrxn : QY n|Xn“xnrBfpxnqs ě 1´ ε1s ě 1´
ε
ε1´ δ (4.141)
“ε1 ´ ε
2ε1(4.142)
and
lnµnrxn : QY n|Xn“xnrBfpxnqs ą as ´ c ln νbnrBfpxnqs
ď nd‹pQX , QY |X , ν, cq ` A?n` c ln
1
η. (4.143)
Since f takes at most |W1| possible values, from (4.142), there exists a w˚ such that
µnrxn : QY n|Xn“xnrBw˚s ě 1´ ε1s ě
ε1 ´ ε
2ε1|W1|
´1. (4.144)
127
Moreover, since Bw˚ ĎŤ
wPW2tyn : yn “ V pw˚, wqu, by the union bound,
νbnrBw˚s “ |Y |´n|Bw˚ | ď |W2||Y |´n. (4.145)
Comparing (4.143), (4.144) and (4.145), we obtain
ln |W1| ` c ln |W2|
ě ´n supPUX : PX“QX
tcDpPY |Uν|PUq ´ IpU ;Xqu ` nc ln |Y | ´ A?n´ c ln
1
η` ln
1´ εε1
2
(4.146)
ě n infU : U´X´Y
tcHpY |Uq ´ IpU ;Xqu ´ A?n´ c ln
1
1´ ε1´ ln
2
1´ εε1(4.147)
(4.137) then follows by taking ε1 “ 1`ε2
.
Note that the first term on the right side of (4.137) corresponds to the rate region
(see e.g. [18, Theorem 10.2]). Using the BUL method [27], on the other hand, one
can only bound the second term as Op?n ln
32 nq, which is suboptimal.
A counterpart for Gaussian sources under the quadratic distortion is the following
(which is a special case of the Berger-Tung problem), for which we also derive the
Op?nq second-order converse. The setup is still as in Figure 4.4, except that we seek
to approximate Y by Y in quadratic error, instead of seeking exact reconstruction.
Theorem 4.4.2 Consider stationary memoryless Gaussian sources where per-letter
distribution QXY is jointly Gaussian with covariance matrix
¨
˚
˝
1 ρ
ρ 1
˛
‹
‚
. Suppose ε P
p0, 1q and n ě 24 ln 4p1`εq1´ε
, and that there exist encoders f : X n Ñ W1 and g : Yn Ñ
W2 and decoder V : W1ˆW2 Ñ Yn (X “ Y “ Y “ R) such that for some D P p0,8q,
PrY n´ Y n
2ą nDs ď ε. (4.148)
128
Then for any c P p0,8q,
ln |W1| ` c ln |W2| ě n infρPr0,1s
"
c
2ln
1´ ρ2ρ2
D`
1
2ln
1
1´ ρ2
*
´ c
c
2n ln2
1´ ε´
c
6n ln4p1` εq
1´ ε
´ c ln2
1´ ε´ ln
2p1` εq
1´ ε. (4.149)
Proof The first part of the proof parallels the proof of [27, Theorem 3], with adap-
tations to include the distortion function. For any w P W1, define the “correctly
decodable set”
Bw :“
yn : yn ´ V pw, gpynqq2 ď nD(
. (4.150)
Then
ErQY n|XnrBfpXnq|Xnss ě 1´ ε, (4.151)
where Xn „ QbnX . Choose any ε1 P pε, 1q. By the Markov inequality,
QbnX rxn : QY n|Xn“xnrBfpxnqs ě 1´ ε1s ě 1´
ε
ε1. (4.152)
Applying Theorem 4.3.2 with δ “ ε1´ε2ε1
and ε1 “ 1´ η, for n ě 24 ln 41´εε1
, we find µn
such that
µnrxn : QY n|Xn“xnrBfpxnqs ě 1´ ε1s ě 1´
ε
ε1´ η “
ε1 ´ ε
2ε1(4.153)
129
and
lnµnrxn : QY n|Xn“xnrBfpxnqs ą 1´ ε1s ´ c ln νbnrBfpxnqs
ď nd‹pQX , QY |X , ν, cq ` c
c
2n ln1
1´ ε1`
c
6n ln4ε1
ε1 ´ ε` c ln
1
1´ ε1. (4.154)
Since f takes at most |W1| possible values, from (4.153), there exists a w˚ such that
µnrxn : QY n|Xn“xnrBw˚s ě 1´ ε1s ě
ε1 ´ ε
2ε1|W1|
´1. (4.155)
Moreover, since Bw˚ ĎŤ
wPW2tyn : yn ´ V pw˚, wq2 ď nDu, by the union bound and
the volume estimate of the n-dimensional ball,
νbnrBw˚s ď |W2|p2πeDqn2 . (4.156)
Comparing (4.154), (4.155) and (4.156), we obtain
ln |W1| ` c ln |W2| ě n infρPr0,1s
"
c
2ln
1´ ρ2ρ2
D`
1
2ln
1
1´ ρ2
*
´ c
c
2n ln1
1´ ε1´
c
6n ln4ε1
ε1 ´ ε
´ c ln1
1´ ε1´ ln
2ε1
ε1 ´ ε. (4.157)
(4.149) then follows by taking ε1 “ 1`ε2
.
Remark: The weak converse version of (4.149) is the following result, which
is a special case of [18, Theorem 12.3]. The first term on the right side of (4.149)
corresponds to the known (first-order) single-letter region which can be obtained by
taking D2 Ñ 8 in [18, Theorem 12.3].
Theorem 4.4.3 (see [18, Theorem 12.3] and the references therein) Consider
lossy source coding with compressed side information where the per-letter distribution
130
QXY is jointly Gaussian with covariance matrix
¨
˚
˝
1 ρ
ρ 1
˛
‹
‚
and the distortion function
is quadratic. Then pR1, R2, Dq P r0,8q2 is achievable if and only if
R2 ě1
2ln
1´ ρ2 ` ρ2 expp´2R1q
D. (4.158)
4.4.4 One-communicator CR Generation
We are now ready (finally!) to give a Op?nq second-order converse to the one-
communicator CR generation problem in Section 2.3.1. Armed with the optimal
second-order change-of-measure lemma of the present chapter, we now only need the
following result relating the change-of-measure lemma to the performance of the CR
generation. For simplicity m “ 1 is assumed, while the method should carry over to
the case of arbitrarily many receivers.
Lemma 4.4.4 Consider the single-shot setup as depicted in Figure 1.1 with m “ 1.
Suppose that there exist δ1, δ2 P p0, 1q, a stochastic encoder QW |X , and deterministic
decoders QK|X and QK|WY , such that
PrK “ K1s ě 1´ δ1; (4.159)
1
2|QK ´ Tk| ď δ2, (4.160)
where TK denotes the equiprobable distribution. Also, suppose that there exist µX ,
δ, ε1 P p0, 1q and c P p1,8q, d P p0,8q such that
E1pQXµXq ď δ; (4.161)
µX`
x : QY |X“xpAq ě 1´ ε1˘
ď 2c exppdqQcY pAq (4.162)
131
for any A Ď Y. Then, for any δ3, δ4 P p0, 1q such that δ3δ4 “ δ1 ` δ, we have
δ2 ě 1´ δ ´ δ3 ´1
|K| ´2 exp
`
dc
˘
|W |pε1 ´ δ4q
1c |K|1´ 1
c
. (4.163)
Proof As we assumed m “ 1, let us write K :“ K1. Define the joint measure
µXYWKK :“ µXQY |XQW |XQK|XQK|YW (4.164)
which we sometimes abbreviate as µ. Since E1pQµq “ E1pQXµXq ď δ, (4.159)
implies
µpK ‰ Kq ď δ1 ` δ. (4.165)
Put
J :“ tk : µK|Kpk|kq ě 1´ δ4u. (4.166)
The Markov inequality implies that µKpJ q ě 1´ δ3. Now for each k P J , we have
p1´ δ4qµkpkq
ď µKKpk, kq (4.167)
ď
ż
FkQY |X“x
˜
ď
w
Awk¸
dµXpxq (4.168)
ď p1´ ε1qµKpkq ` µ
˜
x : QY |X“x
˜
ď
w
Akw ě 1´ ε1
¸¸
(4.169)
ď p1´ ε1qµKpkq ` 2c exppdqQcY
˜
ď
w
Akw¸
, (4.170)
132
where Fk Ď X is the decoding set for K, and Akw is the decoding set for K upon
receiving w. Rearranging,
pε1 ´ δ4q1cµ
1ck pkq ď 2 exp
ˆ
d
c
˙
QY
˜
ď
w
Akw¸
. (4.171)
Now let µ be the restriction of µK on J . Then summing both sides of (4.171) over
k P J , applying the union bound, and noting that tAkwuk is a partition of Y for each
w, we obtain
D 1cpµT q ď log |K| ´ 1
1´ 1c
log2|W |
pε1 ´ δ4q1c
´d
c´ 1. (4.172)
The proof is completed invoking Proposition 4.4.5 below and noting that
E1pQKµq ď E1pQKµq ` E1pµµq ď δ ` δ3. (4.173)
Proposition 4.4.5 Suppose T is equiprobable on t1, . . . ,Mu and µ is a nonnegative
measure on the same alphabet. For any α P p0, 1q,
E1pT µq ě 1´1
M´ expp´p1´ αqDαpµT qq. (4.174)
The special case of Proposition 4.4.5 when µ is a probability measure was proved in
the end of the proof of Theorem 2.3.2. The extension to unnormalized µ can be easily
proved in a similar way.
Now specializing the single-shot result of Lemma 4.4.4 to the discrete memoryless
case, and using Theorem 4.3.1, we obtain the following non-asymptotic second-order
133
converse to the CR generation problem. Note that the right side of (4.181) is essen-
tially
exp`
nd‹
c` A
c
?n˘
|W ||K|1´ 1
c
(4.175)
where the?n term characterizes the correct second-order coding rate.
Theorem 4.4.6 Consider a stationary memoryless source with per-letter distribution
QXY . Let
βX :“1
minxQXpxq P r1,8q; (4.176)
α :“ supx
›
›
›
›
dQY |X“x
dQY
›
›
›
›
8
P r1,8q; (4.177)
c P p1,8q. (4.178)
Suppose that there exist a CR generation scheme (see the setup in Figure 1.1) for
which
PrK “ K1s ě 1´ δ1 P p0, 1q; (4.179)
1
2|QK ´ Tk| ď δ2 P p0, 1q. (4.180)
Then for any δ11 P pδ1, 1q and n ě 3βX ln 3|X |δ11´δ1
, we have
1´ δ2 ´ δ11 ď
1
|K| `exp
´
nd‹
c` A
c
?n` ln
4δ11`2δ1δ11´δ1
¯
|W |´
δ11´δ14δ11`2δ1
¯1c|K|1´ 1
c
, (4.181)
134
where we have defined
A :“ lnpαcβc`1X q
d
3βX ln3|X |δ11 ´ δ1
` 2c
d
pα ´ 1q ln4δ11 ` 2δ1
δ11 ´ δ1
, (4.182)
d‹ :“ d‹pQX , QY |X , QY , cq. (4.183)
Proof Specialize Lemma 4.4.4 to the stationary memoryless setting:
QXY Ð QbnXY . (4.184)
Suppose we first pick the parameter δ3 P pδ1, 1q in Lemma 4.4.4 arbitrarily, which in
turn yields the following choices of the other parameters therein:
δ “δ3 ´ δ1
2; (4.185)
δ4 “δ1 ` δ3
2δ3
. (4.186)
Moveover, let ε1 “ 1`δ42
in Lemma 4.4.4. Then we invoke Theorem 4.3.1, with ν “ QY
and
η “ 1´ ε1. (4.187)
Then (4.163) implies that
1´ δ2 ´3δ3 ´ δ1
2ď
1
|K| `exp
´
n d‹
c` A
c
?n` ln 4δ3
δ3´δ1
¯
|W |´
δ3´δ14δ3
¯1c|K|1´ 1
c
(4.188)
which is equivalent to (4.181) upon setting δ11 “3δ3´δ1
2.
135
4.5 About Extensions to Processes with Weak
Correlation
The majority of this chapter focuses on the stationary memoryless settings, and in
particular, we succeeded in using the pumping-up approach to determine that the
second-order rate is the square root of the blocklength in such settings for many prob-
lems from the network information theory. However, the potential of the pumping-up
approach is not limited to the stationary memoryless setting.
Recall that the pumping-up approach consists of two ingredients: one is reverse
hypercontractivity (4.24) and the other is to show that the integral of the function
does not increase too much after applying the operator (4.25). Regarding the first in-
gredient, we note that reverse hypercontractivity follows from a modified log-Sobolev
inequality (also known as the 1-log Sobolev inequality) [77]. There have already been
results on a modified log-Sobolev inequality for “weakly correlated processes” (e.g.
[153, Theorem 2.1]). The second ingredient can also be extended to certain weakly
correlated processes. For example, consider the steps (4.35)-(4.38) where Pbn is re-
placed with some PY n which is not necessarily stationary memoryless. Those steps
continue to hold under the assumption that
α :“
›
›
›
›
dPYi|Yi“yidQ
›
›
›
›
8
ă 8, @i “ 1, . . . , n, yi P Yn´1. (4.189)
136
Chapter 5
More Dualities
In Chapters 2-4 we have introduced the three ingredients of the functional approach
to information-theoretic converse: duality, smoothing, and reverse hypercontractivity,
and we have focused on the example of the Brascamp-Lieb divergence to illustrate
these ideas. The present chapter is a case study of two other information measures,
their dual representations, and the applications to the achievability and the converse
to coding theorems.
5.1 Air on g‹
The section title only bears a superficial connection to Bach’s “Air on the G String”.
We will revisit the image-size characterization, a technique in the classical book [8,
Chap 16] which is built on the blowing-up lemma and succeeded in showing the strong
converse property of all source-channel networks with known first-order rate region in
[8]. The problem considered in Section 4.3 is essentially a basic example of the image-
size problem consisting only of a “forward channel”. The purpose of this section is
to further demonstrate that the pumping-up argument is capable of substituting the
blowing-up argument whenever the latter is used for the image-size analysis, to yield
the optimal second-order bounds for the related coding theorems. To achieve this goal,
we must show how the pumping-up approach applies to a more involved image size
137
problem containing a “reverse channel”. In particular, we introduce an information-
theoretic quantity, denoted by g‹, to replace the g-function for the image size of a
set used in [8] in some intermediate steps. Informally speaking, [8] upper bounds the
image-size gpCq in terms of gpBq, where B and C are the images of a set A under two
channels as shown in Figure 5.1. In contrast, we upper bound g‹pCq in terms of gpBq.
However the end result needed for the converse proofs of the operational problems
in [8] is an upper-bound on |A| in terms of gpBq when A is a good codebook for
the Z-channel. We prove Theorem 5.1.9, a second-order sharpening of such an end
result, which can be readily plugged into the converse proofs of the various operational
problems in [8] to obtain Op?nq second-order rates.
In addition to getting the optimal second-order bounds, we find that the new
approach also has the advantage of simplifying certain technical aspects in dealing
with the reverse channel; for example we introduce a simpler definition of the image
of a set so that the “maximal code lemma” used in [8, Chap 16] is no longer needed
in the proof of certain results. Once again, this is made possible by the fact that we
are looking at functions (not necessarily the indicator functions of sets) instead of
restricting our attention to sets.
We start by reviewing some relevant concepts from [8] in Section 5.1.1. Then we
introduce g‹ in Section 5.1.2. In Section 5.1.3 we show that in the discrete memoryless
case, g and g‹ differ by a sub-exponential factor.
5.1.1 Review of the Image-size Problem
Definition 5.1.1 ([8, Definition 6.2]) Consider QZ|X , a nonnegative measure νZ,
and η P p0, 1q. An η-image of A Ď X is any set B Ď Z for which QZ|XrBsˇ
ˇ
A ě η. 1
1QZ|X rBsˇ
ˇ
A denotes the restriction of the function QZ|Xp1Bq on A.
138
X
A
Y
B
Z
C
Figure 5.1: The tradeoff between the sizes of the two images.
Define the η-image size of A
gZ|XpA, νZ , ηq :“ infB : η-image of A
νZrBs. (5.1)
Fundamental properties of g are discussed in [8], which we informally summarize
below (for precise statements see [8, Lemma 15.2]).
Proposition 5.1.1 ([8, Lemma 15.2]) Consider QbnZ|X , counting measure νbnZ , and
any η P p0, 1q. For any A Ď X ,
1
nlog gZn|XnpA, νbnZ , ηq ě
1
nHpZn
q ´ op1q, (5.2)
where Zn is the channel output of Xn which is equiprobably distributed on A.
Proposition 5.1.2 ([8, Lemma 15.2]) In the setting of Proposition 5.1.1, if, addi-
tionally, A is the codebook of a certain pn, εq-fixed composition code under the average
139
error criterion, then
1
nlog gZn|XnpA, νbnZ , ηq ď
1
nHpZn
q ` ε log |X | ` op1q. (5.3)
Proposition 5.1.3 ([8, Theorem 15.10]) Consider QX , QY |X and QZ|X where
|X | ă 8. Let η P p0, 1q and let νY and νZ be the counting measures. Then for any
A Ď X n,
1
nlog gZn|XnpAX Cn, νbnZ , ηq ´
1
nlog gY n|XnpAX Cn, νbnY , ηq
ď supPUX : PX“QX
tHpZ|Uq ´HpY |Uqu ` op1q (5.4)
where Cn :“ txn : | pPxn ´QX | P ωpn´ 1
2 q X op1qu denotes the typical set.
Remark 5.1.1 In the proof of Proposition 5.1.3, [8] used a technical result called the
maximal code lemma [8, Lemma 6.3]. Roughly speaking, the proof therein proceeded
by connecting g with the entropy of the output for the equiprobable distribution on the
input set (Proposition 5.1.1-5.1.2). Since the bound in Proposition 5.1.1 is generally
not tight when A is not a good codebook, [8] used the maximal code lemma to find
a good codebook A Ď A such that gZn|XnpA, νbnZ , ηq « gZn|XnpA, νbnZ , ηq, and then
finished the proof by focusing on A instead. On the other hand, our approach below
using convex duality sidesteps the maximal code lemma.
Since g is defined as the minimum size of an image, (5.4) of course continues to hold
when gY n|XnpA X Cn, νbnY , ηq is replaced by the size of any image of A X Cn. (For
example, in the proof of the strong converse of asymmetric broadcast channel with
a common message [8], A is taken as the set of inputs for a fixed common message,
then its decoding set will be an image of A X Cn.) But (5.4) no longer holds when
gZn|XnpA X Cn, νbnZ , ηq is replaced by the size of an arbitrary image, therefore when
(5.4) is used in the proof of coding theorems, we never substitute a decoding set for
140
gZn|XnpA X Cn, νbnZ , ηq. Instead, one uses the following property for the image size
when A is a good codebook of P -typical sequences:
gZn|XnpA, νbnZ , ηq ě |A|e´nDpQZ|XνZ |P q`opnq; (5.5)
see [8, Lemma 6.4] for the precise statements, and see Proposition 5.1.6 below for a
refinement. Using (5.5) and Proposition 5.1.3 we have the following.
Proposition 5.1.4 In the setting of Proposition 5.1.3, let Cn be the QX-typical set
and let A be an pn, εq-maximal error codebook for QbnZ|X . Then
1
nlog |AX Cn| ´
1
nlog gY n|XnpAX Cn, νbnY , ηq
ď supPUX : PX“QX
tIpZ;X|Uq ´HpY |Uqu ` op1q (5.6)
Proposition 5.1.4 is the key argument used in [8] for the strong converse proof of
coding theorems including2
• Asymmetric broadcast channel with a common message [8].
• Zigzag network [8].
• Gelfand-Pinsker channel coding [154].
In the remainder of this chapter, we use the pumping-up approach to sharpen
Proposition 5.1.4 (see Theorem 5.1.9 below), which will imply converses of the above
operational problems with an Op?nq second-order term.
5.1.2 g‹ and the Optimal Second-order Image Size Lemmas
We now introduce a variant of image size, which is technically more convenient
(e.g. avoids the use of the maximal code lemma [8, Lemma 6.3] when we charac-
2In the Gelfand-Pinsker case, [154] defined a deterministic channel QZn|Xn , for which any inputset is obviously a good codebook.
141
terize the exponents of the image sizes), but can be applied to the converse proof of
operational problems in the same way. By the end of this subsection, we present the
main result Theorem 5.1.9, which is a second order refinement of Proposition 5.1.4,
allowing one to show an Op?nq second-order converse for those network information
theory problems listed in the end of the previous subsection.
Definition 5.1.2 Consider QZ|X and nonnegative measure νZ. Define
g‹Z|XpA, νZq :“ infgPHr0,8qpZq : QZ|Xpln gq|Aě 0
νZpgq. (5.7)
In our approach to converse proofs, we will use g‹ instead of g since the former
is mathematically more convenient (e.g. it satisfies the exact tensorization property).
However, in Section 5.1.3, we show that g and g‹ differ by a factor of eopnq for discrete
memoryless channels.
The following exact formula of g˚ in terms of the relative entropy can be viewed
as an improvement of Proposition 5.1.1-5.1.2.
Proposition 5.1.5 Fix QZ|X , νZ, and let C be an arbitrary subset of X . Then
g‹Z|XpA, νZq “ exp
ˆ
´ infPX : supppPXqĎC
DpPZνZq
˙
(5.8)
where PX is supported on C, and PX Ñ QZ|X Ñ PZ.
While it is possible to provide a direct proof of Proposition 5.1.5, we shall derive it
as a special case of Theorem 5.1.7 below. We now lower bound g‹ in terms of A in
the case of a good codebook:
Proposition 5.1.6 Consider |Z| ă 8, QZ|X , νZ. Let A Ď X n be an pn, εq-codebook
for QbnZ|X under the geometric criterion (i.e. the codebook satisfies the assumption in
142
Theorem 4.2.2). Then
ln g‹Zn|XnpA, νbnZ q ě ln |A| ´DpQbnZ|XνbnZ |PXnq
´ 2
c
pα ´ 1qn ln1
1´ ε´ ln
1
1´ ε(5.9)
where PXn is the equiprobable distribution on A.
Proof
ln g‹Zn|XnpA, νbnZ q ě ´DpPZnνbnZ q (5.10)
“ ´DpQbnZ|XνbnZ |PXnq ` IpX
n; Znq (5.11)
ě ln |A| ´DpQbnZ|XνbnZ |PXnq
´ 2
c
pα ´ 1qn ln1
1´ ε´ ln
1
1´ ε(5.12)
where (5.10) used Proposition 5.1.5; and (5.12) applied the sharp Fano inequality
(Theorem 4.2.2).
The following key observation, which follows from convex duality, establishes the
equivalence of a functional inequality and an entropic inequality. Note that the Z “ X
(i.e. QZ|X is the identity) special case has been considered in the previous section.
Theorem 5.1.7 Consider finite alphabets X , Y, Z, random transformations QY |X ,
QZ|X , nonnegative measures νZ fully supported on Z and νY fully supported on Y.
For d P R, the following statements are equivalent:
infg : QZ|Xpln gqěQY |Xpln fq
νZpgq ď edνY pfq, @f ě 0; (5.13)
DpPZνZq ` d ě DpPY νY q, @PX , (5.14)
143
where PX Ñ QY |X Ñ PY and PX Ñ QZ|X Ñ PZ.
Proof (5.13)ñ(5.14). For any PX and ε P p0,8q,
DpPY νY q “ PXPY |Xpln fq ´ ln νY pfq (5.15)
ď PXPZ|Xpln gq ´ ln νY pgq ` ε` d (5.16)
“ PZpln gq ´ ln νY pgq ` ε` d (5.17)
ď DpPZνY q ` ε` d (5.18)
where
• In (5.15) we have chosen f :“ dPYdνY
.
• By (5.13) we can choose g such that (5.16) holds.
• (5.18) uses the variational formula of the relative entropy.
(5.14)ñ(5.13). For the simplicity of presentation we shall assume, without loss of
generality, that νZ is a probability measure. We first note that the infimum in (5.13)
is achieved. Let a :“ maxypQY |X ln fqpyq P r´8,8q. Then by considering g :“ ea we
see that
infg : QZ|Xpln gqěQY |Xpln fq
νZpgq ď ea. (5.19)
But if gpz1q ą ea maxz ν´1Z pzq for some z1 P Z then νZpgq ą ea. This means that
infg : QZ|Xpln gqěQY |Xpln fq
νZpgq “ infgPC
νZpgq (5.20)
144
where C “ C1 X C2 and
C1 :“ tg P H`pZq : QZ|Xpln gq ě QY |Xpln fqu; (5.21)
C2 :“ Hr0, ea maxz ν´1Z pzqspZq. (5.22)
Note that the closed set C1 is the domain of the infimum on the left side of (5.20), and
C2 is compact. Therefore C is compact, and the infimum on the right side of (5.20) is
achieved at some g‹ P C. The infimum on the left side of (5.20) is thus also achieved
at g‹.
Next, define a probability measure PZ by dPZdνZ
“ 1νZpg‹q
g‹, and we will argue that
that there exists some PZ such that PX Ñ QY |X Ñ PZ . Indeed, apply Lagrange
multiplier method to the minimization in (5.13),
mingtνZpgq ´ λXrQZ|Xpln gq ´QY |Xpln fqsu (5.23)
where λX is a nonnegative measure on X . Then there must be some λX for which
g‹ is a stationary point for the functional in (5.23). Then the claim follows since the
desired PX is the λX normalized to a probability measure. From the complementary
slackness condition, we also see that
QZ|X ln g “ QY |X ln f, PX ´ a.s. (5.24)
With those preparations the rest of the proof follows immediately:
ln νZpg‹q “ PXQZ|Xpln gq ´DpPZνZq (5.25)
“ PXQY |Xpln fq ´DpPY q ` d (5.26)
ď ln νY pfq ` d (5.27)
145
where (5.26) uses (5.24) and (5.14); and (5.27) uses the variational formula of the
relative entropy.
Proof of Proposition 5.1.5 We first note that it is without loss of generality to
consider the case of X “ C. The claim then follows from Theorem 5.1.7 by taking Y
to be a singleton.
Applying the now familiar reverse hypercontractivity argument to Theorem 5.1.7,
we obtain the following result, which can be viewed as a second order sharpening of
Proposition 5.1.3.
Theorem 5.1.8 For (QZ|X , QY |X , νZ, νY , C), where C Ď X , define
dpQZ|X , QY |X , νZ , νY , Cq :“ supPX : supppPXqĎC
tDpPY νY q ´DpPZνZqu, (5.28)
where PX Ñ QY |X Ñ PY and PX Ñ QZ|X Ñ PZ. Let αY :“ supx
›
›
›
dQY |X“xdνY
›
›
›
8. In the
stationary memoryless case, for any η P p0, 1q and Cn Ď X n, we have
ln g‹Zn|XnpCn X tQbnY |Xpfq ě ηu, νbnZ q
ď ln νbnZ pfq ` dpQbnZ|X , QbnY |X , ν
bnZ , νbnY , Cnq
` 2
c
pαY ´ 1qn ln1
η` ln
1
η. (5.29)
146
In particular, given QX , δ P p0, 1q. For n ą 3βX ln |X |δ
we can find Cn with QbnX rCns ě
1´ δ so that for any A Ď Cn,
ln g‹Zn|XnpA, νbnZ q ď ln gY n|XnpA, νbnY , ηq
` n supPXU : PX“QX
tDpPY |UνY |PUq ´DpPZ|UνZ |PUqu
`
c
3nβX ln|X |δ
lnαY αZβX
` 2
c
pαY ´ 1qn ln1
η` ln
1
η. (5.30)
where PX|U“u Ñ QZ|X Ñ PZ|U“u and PX|U“u Ñ QY |X Ñ PY |U“u for each u P U , and
βX :“ pminxQXpxqq
´1; (5.31)
αZ :“ |νZ | supx
›
›
›
›
dQZ|X“x
dνZ
›
›
›
›
8
. (5.32)
Proof For t P r0,8q and α P r1,8q, put
ΛνYα,t “ b
ni“1pe
´t` αp1´ e´tqνY q. (5.33)
Consider any f P Hr0,1spYnq. Denote d :“ dpQbnZ|X , QbnY |X , ν
bnZ , νbnY , Cnq for simplicity.
epαY ´1qnt`d¨ νbnZ pfq ě ed ¨ νbnZ pΛνY
α,tfq (5.34)
ě infg : Qbn
Z|Xpln gq|Cn
ě QbnY |X
pln ΛνYα,tfq|Cn
νbnZ pgq (5.35)
ě infg : Qbn
Z|Xpln gq|CnXA
ě 11´e´t
ln ηνbnZ pgq (5.36)
“ infg : Qbn
Z|Xpln gq|CnXA
ě 0νbnZ pgq ¨ η
11´e´t (5.37)
147
where we defined
A :“ txn : QY n|Xn“xnpfq ě ηu. (5.38)
This proves (5.29).
The proof of (5.30) follows by applying the single-letterization argument to (5.29).
More specifically, for any PXn supported on
Cn :“ txn : pPxn ď p1` εnqQXu, (5.39)
we have
DpPY nνbnY q ´DpPZnν
bnZ q
“
nÿ
i“1
rDpPY iZni`1νbiY b ν
bpn´iqZ q ´DpPY i´1Zni
νbpi´1qY b ν
bpn´i`1qZ qs (5.40)
“
nÿ
i“1
rDpPY iZni`1νY b PY i´1Zni`1
q ´DpPY i´1ZniPY i´1Zni`1
b νZqs (5.41)
“ nrDpPYI |Y I´1ZnI`1νY |PIY I´1ZnI`1
q ´DpPZI |Y I´1ZnI`1νZ |PIY I´1ZnI`1
qs (5.42)
ď n supPUX : PXďp1`εnqQX
tDpPY |UνY |PUq ´DpPZ|UνZ |PUqu (5.43)
“ n supPUX : PXďQX
tDpPY |UνY |PUq ´DpPZ|UνZ |PUqu
` nεn ln βXαY αZ (5.44)
where we defined I to be an equiprobable random variable on t1, . . . , nu independent
of XnY nZn, and
• (5.42) follows from the definition of the conditional relative entropy.
• To see (5.43) we note that U ´ XI ´ YIZI forms a Markov chain if U Ð
IY I´1ZnI`1, and that PXI ď p1` εnqQX .
148
• The proof of (5.44) follows from some simple estimates of the relative entropy.
Abbreviate εn as ε. Consider PUX where U “ U Y t‹u, X “ X , defined as the
follows
PU :“1
1` εPU `
ε
1` εδ‹; (5.45)
PX|U“u :“ PX|U“u, @u P U ; (5.46)
PX|U“‹ :“1` ε
ε
ˆ
QX ´1
1` εPX
˙
(5.47)
where δ‹ is the one-point distribution on ‹. Then observe that PX “ QX , and
DpPZ|UνZ |PUq ´DpPZ|UνZ |PUq
“ D
ˆ
PZ|U
›
›
›
›
1
|νZ |νZ
ˇ
ˇ
ˇ
ˇ
PU
˙
´D
ˆ
PZ|U
›
›
›
›
1
|νZ |νZ
ˇ
ˇ
ˇ
ˇ
PU
˙
(5.48)
ě ´εD
ˆ
1` ε
εQZ ´
1
εPZ
›
›
›
›
1
|νZ |νZ
˙
(5.49)
ě ´ε lnαZ . (5.50)
An upper bound on DpPY |UνY |PUq can also be obtained, the steps for which
we omit here.
The proof of (5.30) is completed by choosing εn “b
3βXn
ln |X |δ
, and using
Pr pPXn ď p1` εnqQXs ě 1´ δ. (5.51)
.
Finally, applying Proposition 5.1.6 to (5.30), we obtain a second-order sharpening of
Proposition 5.1.4:
Theorem 5.1.9 Fix QX , QY |X , QZ|X , η, δ, ε P p0, 1q. Then for n ą 3βX ln 2|X |δ
,
there exists Cn Ď X n with QbnX rCns ě 1 ´ δ such that for any A Ď Cn which is an
149
pn, εq-codebook under the geometric error criterion,
log |A| ´ log gY n|XnpA, νbnY , ηq
ď n supPUX : PX“QX
tIpZ;X|Uq `DpPY |UνY |PUqu `D 12
?n`D0. (5.52)
where
D 12
:“
c
3nβX ln2|X |δ
lnpαY αZβXq ` 2
c
pαY ´ 1qn ln1
η
`
c
n
2ln
2
δlnαZ ` 2
c
pαZ ´ 1qn ln1
1´ ε, (5.53)
D0 :“ ln1
1´ ε` ln
1
η, (5.54)
and αZ , αY and βX are as defined in Theorem 5.1.8.
Proof Consider
Dn :“
#
xn :nÿ
i“1
DpQZ|X“xiνZq ď nDpQZ|XνZ |QXq `
c
n
2ln
1
δlnαZ
+
. (5.55)
In Proposition 5.1.6, for A Ď Dn, we have
ln g‹Zn|XnpA, νbnZ q
ě ln |A| ´ nDpQZ|XνZ |QXq ´
c
n
2ln
1
δlnαZ
´ 2
c
pαZ ´ 1qn ln1
1´ ε´ ln
1
1´ ε. (5.56)
Moreover, since
supxDpQZ|Xp¨|xqνZq ´ inf
xDpQZ|Xp¨|xqνZq ď lnαZ , (5.57)
150
by Hoeffding’s inequality we have
QbnX rDns ě 1´ δ. (5.58)
If Cn is as in Theorem 5.1.8 for which (5.30) holds, then for any A P Cn X Dn, by
(5.56) and (5.30) we have
ln |A| ď ln gY n|XnpA, νbnZ , ηq
` n supPXU : PX“QX
tDpPY |UνY |PUq ´DpPZ|UνZ |PUqu
`
c
3nβX ln|X |δ
lnαY αZβX
` 2
c
pαY ´ 1qn ln1
η` lnαZ
c
n
2ln
1
δ
` 2
c
pαZ ´ 1qn ln1
1´ ε` ln
1
1´ ε` ln
1
η. (5.59)
5.1.3 Relationship between g and g‹
Proposition 5.1.10 Consider QZ|X , the counting measure νZ on the finite Z, η P
p0, 1q, λ P p0, 1 ´ ηq, and A Ď X n. We claim that in the definition of gZ|X , the
assumption g P Ht0,1u can be changed to g P Hr0,1s without essential difference. More
precisely,
infgPHr0,1spZq : QZ|Xpgq|Aěη
νZpgq ď gZ|XpA, νZ , ηq (5.60)
gZ|XpA, νZ , ηq ď λ´1 infgPHr0,1spZq : QZ|Xpgq|Aěη`λ
νZpgq. (5.61)
151
Moreover, gZ|X would be upper-bounded by g‹Z|X if the assumption g P Hr0,8q in the
definition of g‹Z|X were replaced by Hr0,1s:
infgPHr0,1spZq : QZ|Xpgq|Aěη
νZpgq ď η infgPHr0,1spZq : QZ|Xpln gq|Aě0
νZpgq. (5.62)
Further, in the stationary memoryless setting with |X | ă 8, gZn|Xn and g‹Zn|Xn are
equivalent up to a sub-exponential factor. More precisely, we also have
g‹Zn|XnpA, νbnZ q
ď1
ηe
2b
pα´1qn ln 1η infgPHr0,1spZnq : QbnZ|Xpgq|Aěη
νbnZ pgq, (5.63)
where α :“ supx
›
›
›
dPZ|X“xd νZ
›
›
›
8, and
gZn|XnpA, νbnZ , ηq ď eopnq ¨ g‹Zn|XnpA, νbnZ q. (5.64)
Proof By definition,
gZ|XpA, νZ , ηq “ infgPHt0,1upZq : QZ|Xpgq|Aěη
νZpgq (5.65)
so (5.60) follows because Ht0,1upZq Ď Hr0,1spZq.
To see (5.61), consider an arbitrary g P Hr0,1spZq, and put g :“ 1tz : gpzq ě λu.
Note that for any z such that QZ|X“xpgq ě η ` λ, we have
QZ|X“xpgq ě QZ|X“xpgq ´QZ|X“xpg ´ gq (5.66)
ě η ` λ´ supzpgpzq ´ gpzqq (5.67)
ě η. (5.68)
But λg ď g, which establishes (5.61).
152
The proof of (5.62) follows from the fact that for any g P Hr0,1spZq and x,
QZ|X“xpηgq ě ηeQZ|X“xpln gq (5.69)
and obviously ηg P Hr0,1spZq.
To see (5.63), given g P Hr0,1spZnq such that QbnZ|Xpgqˇ
ˇ
ˇ
Aě η, let g :“
´
1η
¯1` 1t
ΛνZα,tg
(see (4.44)), with t :“b
1pα´1qn
ln 1η. Then by the same argument as in the proof of
Theorem 4.2.2,
QZ|X pln gqˇ
ˇ
A ě 0, (5.70)
νbnZ pgq ď1
ηe
2b
pα´1qn ln 1η νbnZ pgq. (5.71)
Finally, to see (5.64), consider
gZn|XnpA, νbnZ , ηq
ďÿ
P : n-type
gZn|XnpAX T nP , νbnZ , ηq (5.72)
ď polypnq ¨maxP
gZn|XnpAX T nP , νbnZ , ηq (5.73)
ď polypnq ¨M ¨ enHpQZ|X |P q`npε´ηq (5.74)
ď polypnq ¨ eHpZnXnq`npε´ηq (5.75)
ď polypnq ¨ eHpZnq`Op
?nq`npε´ηq (5.76)
where
• In (5.74) we used [8, Lemma 6.3] to find a ε-maximal error code where the
codebook A Ď AX T nP is of size M and ε P pη, 1q is arbitrary. We have chosen
P to be one that achieves the maximum in (5.73).
153
• In (5.75) we set Xn to be equiprobable on A and let Zn be the corresponding
output.
• (5.76) used the sharp Fano’s inequality (Theorem 4.2.2).
The proof is completed noting that ε can be arbitrarily close to η and using
eHpZnqď sup
PXn : supppPXn qĎAeHpPZn q (5.77)
“ g‹Zn|XnpA, νbnZ q (5.78)
ď g‹Zn|XnpA, νbnZ q. (5.79)
5.2 Variations on Eγ
Fundamental to the information spectrum approach [44][43] to information theory
is the quantity PrıP QpXq ą λs, where X „ P and λ P R. The knowledge of
PrıP QpXq ą λs for all parameters λ is equivalent to the knowledge of the binary
hypothesis region for the probability measures P and Q (see e.g. [43]). In this sec-
tion, we study another information theoretic quantity, the knowledge of which (for
all parameters) is also equivalent to the knowledge of the binary hypothesis testing
region. We show how its variational formula (i.e. the functional representation of
this quantity) is used in the achievability and converse proofs of some operational
problems.
We define Eγ and discuss its mathematical properties in Section 5.2.1. Sec-
tion 5.2.2 discusses the relationship between Eγ and the smooth Renyi divergence,
which suggests the potential of Eγ in non-asymptotic information theory and the large
deviation analysis. To illustrate such a potential, in Section 5.2.3 we generalize the
channel resolvability problem [155] by replacing the total variation metric with the
154
Eγ metric, and show how results on Eγ-resolvability can be applied in source coding,
the mutual covering lemma [156] and the wiretap channels in Sections 5.2.4-5.2.6.
5.2.1 Eγ: Definition and Dual Formulation
Definition 5.2.1 Given probability distributions P ! Q on the same alphabet and
γ ě 1, define3
EγpP Qq :“ PrıP QpXq ą log γs ´ γ PrıP QpY q ą log γs (5.80)
where X „ P and Y „ Q.
Then we see that Eγ is an f -divergence [158] with
fpxq “ px´ γq`. (5.81)
We now find a variational formula for Eγ via convex duality. Recall that the Legendre-
Fenchel dual of a convex functional φp¨q on the set of measurable functions is defined
by
φ‹pπq :“ supf
"ż
f dπ ´ φpfq
*
(5.82)
for any finite measure π, and under mild regularity conditions (see e.g. [105, Lemma
4.5.8]), we have
φpfq :“ supπ
"ż
f d π ´ φ‹pπq
*
(5.83)
3We found the Eγ notation in an email communication with Yury Polyanskiy [157]. However, weare uncertain about the genesis of such a notation.
155
Now, definition (5.80) can be extended to the set of all finite (not necessarily non-
negative and normalized) measures by
EγpπQq “
żˆ
dπ
dQ´ γ
˙`
dQ (5.84)
which is convex in π. Then for any bounded function f , it is easy to verify that
supπ
"ż
fdπ ´ EγpπQq
*
“ supπ
#
ż
«
dπ
dQf ´
ˆ
d π
dQ´ γ
˙`ff
dQ
+
(5.85)
“
$
’
&
’
%
γş
fdQ, 0 ď f ď 1 a.e.;
`8, otherwse.(5.86)
We thus obtain
Theorem 5.2.1 For a given γ P r1,8q and a probability measure Q, Eγp¨Qq is the
Legendre-Fenchel dual of the functional
f ÞÑ“
$
’
&
’
%
γş
fdQ, 0 ď f ď 1 a.e.;
`8, otherwse.(5.87)
Thus
EγpπQq “ supf : 0ďfď1
"ż
f d π ´ γ
ż
f dQ
*
. (5.88)
The variational formula (5.88) can also be derived from (5.80) by other means; the
present derivation emphasizes the dual optimization viewpoint of this thesis.
Below, we prove some basic properties of Eγ useful for later sections. Additional
properties of Eγ can be found in [132][146, Theorem 21][43].
Proposition 5.2.2 Assume that P ! S ! Q are probability distributions on the
same alphabet, and γ, γ1, γ2 ě 1.
156
1 1.8 40
0.4
1
γ
Eγ(P‖Q)
Figure 5.2: EγpP Qq as a function of γ where P “ Berp0.1q and Q “ Berp0.5q.
1. r1,8q Ñ r0,8q : γ ÞÑ EγpP Qq is convex, non-increasing, and continuous.
2. For any event A,
QpAq ě 1
γpP pAq ´ EγpP Qqq. (5.89)
3. Triangle inequalities:
Eγ1γ2pP Qq ď Eγ1pP Sq ` γ1Eγ2pSQq, (5.90)
EγpP Qq ` EγpP Sq ěγ
2|S ´Q| ` 1´ γ. (5.91)
4. Monotonicity: if PXY “ PXPY |X and QXY “ QXQY |X are joint distributions
on X ˆ Y, then
EγpPXQXq ď EγpPXY QXY q (5.92)
where equality holds for all γ ě 1 if and only if PY |X “ QY |X .
157
5. Given PX , PY |X and QY |X , define
EγpPY |XQY |X |PXq :“ ErEγpPY |Xp¨|XqQY |Xp¨|Xqqs (5.93)
where the expectation is w.r.t. X „ PX . Then we have
EγpPXPY |XPXQY |Xq “ EγpPY |XQY |X |PXq. (5.94)
6.
1´ γ
ˆ
1´1
2|P ´Q|
˙
ď EγpP Qq ď1
2|P ´Q|. (5.95)
Proof The proofs of 1), 2), 4), 5) and the second inequality in (5.95) are omitted
since they follow either directly from the definition of Eγ or are similar to the proofs
of the corresponding properties for total variation distance.
For 3), observe that
Eγ1γ2pP Qq “ maxApP pAq ´ γ1SpAq ` γ1SpAq ´ γ1γ2QpAqq (5.96)
ď maxApP pAq ´ γ1SpAqq `max
Apγ1SpAq ´ γ1γ2QpAqq (5.97)
“ Eγ1pP Sq ` γ1Eγ2pSQq, (5.98)
158
and that
EγpP Qq ` EγpP Sq “ maxApP pAq ´ γQpAqq
`maxAp1´ P pAq ´ γ ` γSpAqq (5.99)
ě maxApP pAq ´ γQpAq
` 1´ P pAq ´ γ ` γSpAqq (5.100)
“ γmaxApSpAq ´QpAqq ` 1´ γ (5.101)
“γ
2|S ´Q| ` 1´ γ. (5.102)
As far as 6) note that the left inequality in (5.95) follows by setting S “ P in (5.91).
The Eγ metric provides a very convenient tool for change-of-measure purposes: if
EγpP Qq is small, and the probability of some event is large under P , then the
probability of this event under Q is essentially lower-bounded by 1γ. In the special
case of γ “ 1 (total variation distance), this change-of-measure trick has been widely
used, see [159][160], but the general γ ě 1 case is more versatile, since P and Q need
not be essentially the same.
5.2.2 The Relationship of Eγ with Excess Information and
Smooth Renyi Divergence
This subsection shows that Eγ (as a function of γ) is the inverse function of the
smooth Renyi divergence [53] of order `8. Various bounds among Eγ and other
popular information measures used in non-asymptotic information theory are pre-
sented. These facts buttress the potential of Eγ in non-asymptotic analysis and large
deviation theory.
159
Definition 5.2.2 For γ ą 0, a probability measure P , and a nonnegative σ-finite
measure µ, where P ! µ, we define the complementary relative information spectrum
metric with threshold γ,
FγpP µq :“ PrıP µpXq ą log γs (5.103)
where X „ P .
As such, the complementary relative information spectrum can be expressed in
terms of the relative information spectrum (see [43]) as
FγpP µq “ 1´ FP µplog γq. (5.104)
Also, it is clear from (5.103) that Fγ upper-bounds Eγ.
Note that (5.103) is nonnegative and vanishes when P “ µ provided that γ ą 1,
and can therefore be considered as a measure of the discrepancy between P and µ. In
addition to playing an important role in one-shot analysis (see [161]), (5.103) provides
richer information than the relative entropy measure since
DpP µq :“ ErıP µpXqs (5.105)
“
ż
r0,`8q
PrıP µpXq ą τ sdτ
´
ż
p´8,0s
p1´ PrıP µpXq ą τ sqdτ. (5.106)
Moreover, the complementary relative information spectrum is also related to total
variation distance since for probability measures P and Q,
1
2|P ´Q| “ PrıP QpXq ą 0s ´ PrıP QpY q ą 0s, (5.107)
160
where X „ P and Y „ Q [162]. However, perhaps surprisingly, the complementary
relative information spectrum metric does not satisfy the data processing inequality, in
contrast to the relative entropy and total variation distance; the proof of such a result,
along with other properties of the complementary relative information spectrum, can
be found in [143].
Another information measure closely related to Eγ is the smooth Renyi. Let us
first generalize some of our definitions to allow nonnegative finite measures that are
not necessarily probability measures:
Definition 5.2.3 For γ ě 1, nonnegative finite measures µ and ν on X , µ ! ν,
|µ´ ν| :“
ż
|dµ´ dν|, (5.108)
and4
Eγpµνq :“ supAtµpAq ´ γνpAqu (5.109)
“ pµ´ γνqpıµν ą log γq (5.110)
“1
2pµ´ γνqpX q ´ 1
2pµ´ γνqpıµν ď log γq
`1
2pµ´ γνqpıµν ą log γq (5.111)
“1
2µpX q ´ γ
2νpX q ` 1
2|µ´ γν|. (5.112)
Note that E1pP µq ‰12|P ´ µ| when µ is not a probability measure.
The following result is a generalization of the γ1 “ 1 case of (5.90) to unnor-
malized measures, and the proof is immediate from the definition of (5.109) and the
subadditivity of the sup operator.
4 Following established usage in measure theory, we use λpıµν ą log γq as an abbreviation ofλptx : ıµνpxq ą log γuq for an arbitrary signed measure λ.
161
Proposition 5.2.3 Triangle inequality: if µ, ν and θ are nonnegative finite measures
on the same alphabet, µ ! θ ! ν, then
Eγpµνq ď E1pµθq ` Eγpθνq. (5.113)
The Renyi divergence is not an f -divergence, but is a monotonic function of the
Hellinger distance [42]. We have seen the definition of the Renyi divergence between
two probability measures in Section 2.5.1. Below we extend the definition to the case
of unnormalized measures:
Definition 5.2.4 (Renyi α-divergence) Let µ be a nonnegative finite measure and
Q a probability measure on X , µ ! Q, X „ Q. For α P p0, 1q Y p1,`8q,
DαpµQq :“1
α ´ 1logE
„ˆ
dµ
dQpXq
˙α
, (5.114)
and
D0pµQq :“ log1
QpıµQ ą ´8q, (5.115)
D8pµQq :“ µ- ess sup ıµQ (5.116)
which agree with the the limits as α Ó 0 and α Ò 8.
DαpP Qq is non-negative and monotonically increasing in α. More properties about
the Renyi divergence can be found, e.g. in [163][130][43].
Definition 5.2.5 (smooth Renyi α-divergence) For α P p0, 1q, ε P p0, 1q,
D`εα pP Qq :“ supµPBεpP q
DαpµQq; (5.117)
162
for α P p1,8s, ε P p0, 1q,
D´εα pP Qq :“ infµPBεpP q
DαpµQq, (5.118)
where BεpP q :“ tµ nonnegative : E1pP µq ď εu is the ε-neighborhood of P in E1.
The definition of smooth 8-divergence given here agrees with the smooth max
Renyi divergence in [164], although the definitions look different. However, our
smooth 0-divergence is different from the smooth min Renyi divergences in [165]
and [164, Definition 1] except for non-atomic measures.5
Remark 5.2.1 The smooth Renyi divergence is a natural extension of the smooth
Renyi entropy Hεα defined in [53] (which can be viewed as a special case where the
reference measure is the counting measure). In [164] the smooth min and max Renyi
divergences are introduced, which correspond to the α “ 0 and α “ `8 cases of
Definition 5.2.5. Moreover, we have introduced `´ in the notation to emphasize the
difference between the two possible ways of smoothing in (5.117) and (5.118).
The following quantity, which characterizes the binary hypothesis testing error, is
a g-divergence but not a monotonic function of any f -divergence [42]. We will see in
Proposition 5.2.4 that it is a monotonic function of the 0-smooth Renyi divergence.
Definition 5.2.6 For nonnegative finite measures µ and ν on X , define
βαpµ, νq :“ minA:µpAqěα
νpAq. (5.119)
5With the definition of smooth min Renyi divergence in [165] and [164, Definition 1], Proposi-tion 5.2.4-7) would hold with the βα as defined in (5.120) rather than (5.119).
163
Remark 5.2.2 In the literature, the definition of βαpP,Qq is usually restricted to
probability measures and allows randomized tests:
βαpPW , QW q :“ min
ż
PZ|W p1|wqdQW pwq (5.120)
where the minimization is over all random transformations PZ|W : Z Ñ t0, 1u such
that
ż
PZ|W p1|wqdPW pwq ě α. (5.121)
In contrast to Eγ, allowing randomization in the definition can change the value of
βα except for non-atomic measures. Nevertheless, many important properties of βα
are not affected by this difference.
We conclude this section with some inequalities relating those distance measures
we have discussed. These results show the asymptotic equivalence among these dis-
tance metrics: Eγ is essentially equivalent to Fγ, and the value of γ for which they
vanish is essentially the exponential of the relative entropy; the smooth Renyi entropy
is also asymptotically equivalent to the relative entropy.
Proposition 5.2.4 Suppose µ is a finite nonnegative measure and P and Q are
probability measures, all on X , X „ P , ε P p0, 1q, and γ ě 1.
1. For a ą 1,
EγpP Qq ď FγpP Qq ďa
a´ 1E γ
apP Qq. (5.122)
2. D`εα pP Qq is increasing in α P r0, 1s and D´εα pP Qq is increasing in α P r1,8s.
164
3. If P is a probability measure, then
DpP Qq ď
ż 8
0
PrıP QpXq ą τ s dτ ; (5.123)
DpP Qq ě EγpP Qq log γ ´ 2e´1 log e. (5.124)
4. If α P r0, 1q then 6
DαpµQq ď log γ ´1
1´ αlog pµpX q ´ EγpµQqq . (5.125)
If α P p1,8q then
DαpµQq ě log γ `1
α ´ 1logEγpµQq. (5.126)
5. Suppose α P r0, 1q, EγpP Qq ă 1´ ε, then
D`εα pP Qq ď log γ ´1
1´ αlog p1´ ε´ EγpP Qqq . (5.127)
Suppose α P p1,8s, EγpP Qq ą ε, then
D´εα pP Qq ě log γ `1
α ´ 1log pEγpP Qq ´ εq . (5.128)
6. D´ε8 pP Qq ď log γ ô EγpP Qq ď ε. That is, D´ε8 pP Qq “ log inftγ :
EγpP Qq ď εu.
7. D`ε0 pP Qq “ ´ log β1´εpP,Qq.
6The special case of (5.125) for α “ 12 , µpX q “ 1 and γ “ 1 is equivalent to a well-known bound
on a quantity called fidelity using total variation distance, see [166].
165
8. Fix τ P R and ε ě PrıP QpXq ď τ s, where X „ P and P is non-atomic. Then
D`ε0 pP Qq ě τ ` log1
1´ ε. (5.129)
Moreover, D`ε0 pP Qq ě τ holds when P is not necessarily non-atomic.
Proof 1. The first inequality in (5.122) is evident from the definition. For the
second inequality in (5.122), consider the event
A :“ tx P X : ıP Qpxq ą log γu. (5.130)
Then
E γapP Qq
ě P pAq ´ γ
aQpAq
ě PrıP QpXq ą log γs ´γ
a¨
1
γPrıP QpXq ą log γs (5.131)
ěa´ 1
aPrıP QpXq ą log γs (5.132)
and the result follows by rearrangement.
2. Direct from the monotonicity of Renyi divergences in α and the definitions of
their smooth versions.
3. The bound (5.123) follows from DpP Qq “ ErıP QpXqs and (5.124) can be seen
from
DpP Qq ě Er|ıP QpXq|s ´ 2e´1 log e (5.133)
ě log γPrıP QpXq ą log γs ´ 2e´1 log e (5.134)
ě log γEγpP Qq ´ 2e´1 log e (5.135)
166
where (5.133) is due to Pinsker [167, (2.3.2)], (5.134) uses Markov’s inequality,
and (5.135) is from the definition (5.80).
4. By considering the dµdQ
ą γ and dµdQ
ď γ cases separately, we can check the
following (homogeneous) inequalities: for each α ă 1,
ˇ
ˇ
ˇ
ˇ
dµ
dQ´ γ
ˇ
ˇ
ˇ
ˇ
ědµ
dQ´ 2γ1´α
ˆ
dµ
dQ
˙α
` γ, (5.136)
and for each α ą 1,
ˇ
ˇ
ˇ
ˇ
dµ
dQ´ γ
ˇ
ˇ
ˇ
ˇ
ď ´dµ
dQ` 2γ1´α
ˆ
dµ
dQ
˙α
` γ. (5.137)
Integrating with respect to dQ both sides of (5.136), we obtain
|µ´ γQ| ě µpX q ´ 2γ1´α
żˆ
dµ
dQ
˙α
dQ` γ (5.138)
and (5.125) follows by rearrangement. Integrating with respect to dQ on both
sides of (5.137), we obtain
|µ´ γQ| ď ´µpX q ` 2γ1´α
żˆ
dµ
dQ
˙α
dQ` γ (5.139)
and (5.126) follows by rearrangement.
5. Immediate from the previous result, the definition of smooth Renyi divergence,
and the triangle inequality (5.90).
167
6. ñ: By assumption there exists a nonnegative finite measure µ such that
E1pP µq ď ε and µ ď γQ. Then from Proposition 5.2.3,
EγpP Qq ď E1pP µq ` EγpµQq
ď ε` 0.
ð: Define µ by dµdQ
:“ mintdPdQ, γu. Since
!
dPdQą γ
)
“
!
dPdµą 1
)
,
E1pP µq “ P
ˆ
dP
dQą γ
˙
´ µ
ˆ
dP
dQą γ
˙
“ P
ˆ
dP
dQą γ
˙
´ γQ
ˆ
dP
dQą γ
˙
ď EγpP Qq.
Then D8pµQq ď log γ implies that D´ε8 pP Qq ď log γ.
7. “ě”: let A be a set achieving the minimum in (5.119). Let µ be the restriction
of P on A. Then by definition (5.119) we have
E1pP µq “ P pAcq ď ε (5.140)
therefore
D`ε0 pP Qq ě ´ log β1´εpP,Qq. (5.141)
“ď”: fix arbitrary δ ą 0 and let µ be a nonnegative measure satisfying
D`ε0 pP Qq ă D0pµQq ` δ. (5.142)
168
Define A :“ supppµq. Then
P pAcq “ P pAcq ´ µpAcq (5.143)
ď E1pP µq (5.144)
ď ε, (5.145)
hence
D`ε0 pP Qq ´ δ ď D0pµQq (5.146)
“ ´ logQpAq (5.147)
ď ´ log β1´εpP,Qq (5.148)
and the result follows by setting δ Ó 0.
8. In the case of non-atomic P , there exists a set A Ď tx : ıP Qpxq ą τu such that
P pAq “ 1´ ε. Then
β1´εpP,Qq ď QpAq (5.149)
ď expp´τqP pAq. (5.150)
We obtain (5.129) by taking the logarithms on both sides of the above and
invoking Part 7. When P is not necessarily non-atomic, since P pıP Q ą τq ě
1´ ε,
β1´εpP,Qq ď QpıP Q ą τq (5.151)
ď expp´τqP pıP Q ą τq (5.152)
ď expp´τq (5.153)
169
and again the result follows by taking the logarithms on both sides of the above.
5.2.3 Eγ-Resolvability
Given the close connection between Eγ and the smooth Renyi divergence, and the
variational formula (5.88), we expect that Eγ has a great potential in various venues
of the non-asymptotic information theory. However, in this thesis we only discuss the
usage of Eγ in the setting of channel resolvability.
After an introduction to the resolvability problem and the formulation of its ex-
tension to the Eγ setting, we present a general one-shot achievability bound on Eγ-
resolvability, which is the basis of all information-theoretic of Eγ that we will discuss in
this thesis. Then, a refinement of such a result using concentration inequalities leads
to a one-shot tail bound on the approximation error for random codebooks, which
is useful for some problems in the information-theoretic literature [168]. Finally, we
argue the asymptotic tightness of the one-shot bounds in the discrete memoryless
settings using the method of types.
Later, we will discuss the applications of the results on Eγ-resolvability in non-
asymptotic information theory (Sections 5.2.4-5.2.6).
The Resolvability Problem: An Introduction
Channel resolvability, introduced by Han and Verdu [155], is defined as the minimum
randomness rate required to synthesize an input so that its corresponding output dis-
tribution approximates a target output distribution. While the resolvability problem
itself differs from classical topics in information theory such as data compression and
transmission, [155] unveils its potential utility in operational problems through the
solution of the strong converse problem of identification coding [22]. Other applica-
tions of distribution approximation in information theory include common random-
170
ness of two random variables [6], strong converse in identification through channels
[155], random process simulation [169], secrecy [170][171][172][173], channel synthesis
[174][159], lossless and lossy source coding [169][155][160], and the empirical distri-
bution of a capacity-achieving code [175][149]. The achievability part of resolvability
(also known as the soft-covering lemma in [159]) is particularly useful, and coding the-
orems via resolvability have certain advantages over what is obtained from traditional
typicality-based approaches (see e.g. [172]).
If the channel is stationary memoryless and the target output distribution is in-
duced by a stationary memoryless input, then the resolvability is the minimum mutual
information over all input distributions inducing the (per-letter) target output dis-
tribution, no matter when the approximation error is measured in total variation
distance [155][176, Theorem 1], normalized relative entropy [6, Theorem 6.3][155],
or unnormalized relative entropy [173]. In contrast, relatively few measures for the
quality of the approximation of output statistics have been proposed for which the
resolvability can be strictly smaller than mutual information. As shown by Steinberg
and Verdu [169], one exception is the Wasserstein distance measure, in which case
the finite precision resolvability for the identity channel can be related to the rate-
distortion function of the source where the distortion is the metric in the definition
of the Wasserstein distance [169].
We remark that in view of convex duality, channel resolvability in the relative
entropy is equivalent to (our new definition of) the image-size problem in a precise
sense (Proposition 5.1.5): there exist an input distribution supported on a set of a
given size whose output approximates a given output distribution if and only if the
“image” of that set takes up almost all of that output distribution.
In this section we generalize the theory of resolvability by considering a distance
measure, Eγ, defined in Section 5.2.1, of which the total variation distance is a special
171
case where γ “ 1. The Eγ metric7 is more forgiving than total variation distance
when γ ą 1, and the larger γ is, the less randomness is needed at the input for
approximation in Eγ. Various achievability and converse bounds for resolvability in
the Eγ metric will be derived.
• In the case of a discrete memoryless channel with a given stationary memory-
less target output, we provide a single-letter characterization of the minimum
exponential growth rate of γ to achieve approximation in Eγ.
• In addition to achievability results in terms of the expectation of the approx-
imation error over a random codebook, we also prove achievability under the
high probability criteria8 for a random codebook. The implications of the lat-
ter problem in secrecy has been noted by several authors [168][177][178]. Here
we adopt a simple non-asymptotic approach based on concentration inequali-
ties, dispensing with the finiteness or stationarity assumptions on the alphabet
required by the previous proof method [168] based on Chernoff bounds.
• An asymptotically matching converse bound is derived for the stationary memo-
ryless case using the method of types. The analysis is different from the previous
proof for the γ “ 1 (total variation) special case and, in particular, we need to
introduce a slightly new definition of conditional typicality [18] for this proof.
We remark that while the conventional resolvability theory (γ “ 1) also considers
the amount of randomness needed to approximate the worst case output distribution,
we do not discuss the worst case counterpart for general γ in this thesis. However,
upper and lower bounds for the worst case Eγ-resolvability can be found in our journal
version [143].
7“Metric” or “distance” are used informally since, other than nonnegativity, Eγ , in general, doesnot satisfy any of the other three requirements for a metric.
8More precisely, by “the high probability criteria” we that mean that the approximation errorsatisfies a Gaussian concentration result (which ensures a doubly exponential decay in blocklengthwhen the single-shot result is applied to the multi-letter setting).
172
Problem Formulation
The setup in Figure 5.3 is the same as in the original paper on channel resolvability
under total variation distance [155]. Given a random transformation QX|U and a
target distribution πX , we wish to minimize the size M of a codebook cM “ pcmqMm“1
such that when the codewords are equiprobably selected, the output distribution
QXrcM s :“1
M
Mÿ
m“1
QX|U“cm (5.154)
approximates πX . The difference from [155] is that we use Eγ (and other metrics) to
measure the level of the approximation.
m pcmqMm“1
QX|U QXrcM s « πX
Figure 5.3: Setup for channel resolvability.
The fundamental one-shot tradeoff is the minimum M required for a prespecified
degree of approximation. One-shot bounds are general in that no structural assump-
tions on either the random transformation or the target distribution are imposed.
The corresponding asymptotic results can usually be recovered quite simply from the
one-shot bounds using, say, the law of large numbers in the memoryless case. Asymp-
totic limits are of interest because of the compactness of the expressions, and because
good one-shot bounds are not always known, especially in the converse parts.
Unless otherwise stated, we do not restrict the alphabets to be finite or countable.
All one-shot achievability bounds for Eγ-resolvability apply to general alphabets; some
asymptotic converse bounds assume finite input alphabets or discrete memoryless
channels (DMC)9.
9A DMC is a stationary memoryless channel whose input and output alphabets are finite, whichis denoted by the corresponding per-letter random transformation (such as QX|U) in this chapter.
173
Next, we define the achievable regions in the general asymptotic setting (when
sources and channels are arbitrary, see [155][171]) as well as the case of stationary
memoryless channels and memoryless outputs. Boldface letters such as X denote a
general sequence of random variables pXnq8
n“1, and sanserif letters such as X denote
the generic distributions in iid settings.
Definition 5.2.7 Given a channel10 pQXn|Unq8n“1 and a sequence of target distribu-
tions pπXnq8n“1, the triple pG,R,Xq is ε-achievable (0 ă ε ă 1) if there exists pcMnq8n“1,
where cMn :“ pc1, . . . , cMnq and cm P Un for each m “ 1, . . . ,Mn, and pγnq8n“1, so that
lim supnÑ8
1
nlogMn ď R; (5.155)
lim supnÑ8
1
nlog γn ď G; (5.156)
supnEγnpQXnrcMn sπXnq ď ε. (5.157)
Moreover, pG,R,Xq is said to be achievable if it is ε-achievable for all 0 ă ε ă 1.
Define the asymptotic fundamental limits11
GεpR,Xq :“ mintg : pg,R,Xq is ε-achievableu; (5.158)
GpR,Xq :“ supεą0
GεpR,Xq, (5.159)
and
SεpG,Xq :“ mintr : pG, r,Xq is ε-achievableu; (5.160)
SpG,Xq :“ supεą0
SεpG,Xq, (5.161)
10In this setting a “channel” refers to a sequence of random transformations.11We can write min instead of inf in (5.159) and (5.160) since for fixed X the set of pG,Rq such
that pG,R,Xq is ε-achievable is necessarily closed.
174
which, in keeping with [155], we refer to as the resolvability function. In the special
case of QXn|Un “ QbnX|U and target πXn “ πbnX , we may write the quantities in (5.159)
and (5.160) as GpR, πXq and SpG, πXq.
Note that by [155][176], p0, R, πXq is achievable if and only if R ě IpPU, QX|Uq for
some PU satisfying PU Ñ QX|U Ñ πX.
A useful property is that the approximation error (5.157) actually converges uni-
formly. A similar observation was made in the proof of [155, Lemma 6] in the context
of resolvability in total variation distance.
Proposition 5.2.5 If pG,Rq is ε-achievable for pQXn|Unq8n“1, then there exists
pγnq8n“1 and pMnq
8n“1 such that
lim supnÑ8
1
nlog γn ď G; (5.162)
lim supnÑ8
1
nlogMn ď R; (5.163)
supn
supPUn
infcMn
EγnpQXnrcMn sQXnq ď ε, (5.164)
where QUn Ñ QXn|Un Ñ QXn.
Proof Fix arbitrary G1 ą G and R1 ą R. Observe that the sequence
supQUn
infcMn
EexppnG1qpQXnrcMn sQXnq (5.165)
where Mn :“ exppnR1q, must be upper-bounded by ε for large enough n. For if
otherwise, there would be a sequence pQUnq8n“1 such that the infimum in (5.165) is
not upper-bounded by ε for large enough n, which is a contradiction since we can find
pcMnq8n“1 and pγnq8n“1 in Definition 5.2.7 such that Mn ă exppnR1q and γn ď exppnG1q
for n large enough, and apply the monotonicity of Eγ in γ. Finally, since pG1, R1q can
be arbitrarily close to pG,Rq and the ε-achievable region is a closed set, we conclude
that pG,Rq is ε-achievable.
175
One-shot Achievability Bounds
While our goal is to bound Eγ in the resolvability problem, we consider instead the
problem of bounding the complementary relative information spectrum F , which,
according to Proposition 5.2.4, is (up to constant factors) equivalent to bounding Eγ.
First, we study a simple special case where the target distribution matches the
input distribution according to which the codewords are generated.
Theorem 5.2.6 (Softer-covering Lemma) Fix QUX “ QUQX|U . For an arbi-
trary codebook rc1, . . . , cM s, define
QXrcM s :“1
M
Mÿ
m“1
QX|U“cm . (5.166)
Then for any γ, ε, σ ą 0 satisfying γ ´ 1 ą ε` σ and τ P R,
ErFγ`
QXrUM sQX
˘
s ď P rıU ;XpU ;Xq ą logMσs
`1
εPrıU ;XpU ;Xq ą logM ´ τ s
`expp´τq
pγ ´ 1´ ε´ σq2(5.167)
where UM „ QU ˆ ¨ ¨ ¨ ˆQU , pU,Xq „ QUQX|U , and the information density
ıU ;Xpu;xq :“ logdQX|U“u
dQX
pxq. (5.168)
Remark 5.2.3 Theorem 5.2.6 implies (the general asymptotic version of) the soft-
covering lemma based on total variation (see [155],[171] [159]). Indeed if Mn “
exppnRq at blocklength n where R ą IpU; Xq, we can select τn Ðn2pR ´ IpU; Xqq.
Moreover for any γ ą 1 we can pick constant ε, σ ą 0 such that γ ´ 1 ą ε` σ in the
176
theorem, to show that
limnÑ8
ErFγ`
QXnrUnMn sQXn
˘
s “ 0 (5.169)
which, by (5.95) and by taking γ Ó 1, implies that
limnÑ8
E|QXnrUnMn s ´QXn | “ 0. (5.170)
We refer to Theorem 5.2.6 as the softer-covering lemma since for a larger γ it allows
us to use a smaller codebook to cover the output distribution more softly (i.e. approx-
imate the target distribution under a weaker metric).
Proof Define the “atypical” set
Aτ :“ tpu, xq : ıU ;Xpu;xq ď logM ´ τu. (5.171)
Now, let XM be such that pUM , XMq „ QUX ˆ ¨ ¨ ¨ ˆQUX , The joint distribution of
pUM , Xq is specified by letting X „ QXrcM s conditioned on UM “ cM . We perform a
change-of measure step using the symmetry of the random codebook:
ErFγpQXrUM sQXqs “ P„
dQXrUM s
dQX
pXq ą γ
(5.172)
“1
M
ÿ
m
P„
dQXrUM s
dQX
pXmq ą γ
(5.173)
“ P„
dQXrUM s
dQX
pX1q ą γ
(5.174)
where (5.173) is because of (5.166), and (5.174) is because the summands in (5.173)
are equal. Note that X1 is correlated with only the first codeword U1.
177
Next, because of the relation
dQXrcM s
dQX
pxq “1
M
ÿ
m
exppıU ;Xpcm, xqq, (5.175)
we can upper-bound (5.174) by the union bound as
PrexppıU ;XpU1;X1qq ąMσs
` P
«
1
M
Mÿ
m“2
exppıU ;XpUm;X1qq1Acτ pUm, X1q ą ε
ff
` P
«
1
M
Mÿ
m“2
exppıU ;XpUm;X1qq1Aτ pUm, X1q ą γ ´ ε´ σ
ff
(5.176)
where we used the fact that 1Aτ ` 1Acτ “ 1. Notice that the first term of (5.176) may
be regarded as the probability of “atypical” events and accounts for the first term in
(5.167). The second term of (5.176) can be upper-bounded with Markov’s inequality:
1
Mε
Mÿ
m“2
Erexp pıU ;XpUm;X1qq 1Acτ pUm, X1qs
ď1
Mε
Mÿ
m“2
Er1Acτ pU,Xqs (5.177)
ď1
εPrıU ;XpU ;Xq ą logM ´ τ s (5.178)
accounting for the second term in (5.167) where (5.177) is a change-of-measure step
using the fact that pUl, X1q „ QU ˆQX for m ě 2.
178
Finally we take care of the last term in (5.176), again using the independence of
Um and X1 for m ě 2. Observe that for any x P X ,
µ :“ E
«
1
M
Mÿ
m“2
exppıU ;XpUm;xqq1Aτ pUm, xq
ff
(5.179)
ď1
M
Mÿ
m“2
E rexppıU ;XpUm;xqqs (5.180)
“M ´ 1
M(5.181)
ď 1, (5.182)
whereas
Var
˜
1
M
Mÿ
m“2
exppıU ;XpUm;xqq1Aτ pUm, xq
¸
“1
M2
Mÿ
m“2
Var pexppıU ;XpUm;xqq1Aτ pUm, xqq (5.183)
ď1
MVar pexppıU ;XpU ;xqq1Aτ pU, xqq (5.184)
ď1
ME rexpp2ıU ;XpU ;xqq1Aτ pU, xqs (5.185)
ď expp´τqE rexppıU ;XpU ;xqqs (5.186)
“ expp´τq, (5.187)
where the change of measure step (5.186) uses (5.171). It then follows from Cheby-
shev’s inequality that
P
«
1
M
Mÿ
m“2
exppıU ;XpUm;xqq1Aτ pUm, xq ą γ ´ ε´ σ
ff
(5.188)
ď P
«
1
M
Mÿ
m“2
exppıU ;XpUm;xqq1Aτ pUm, xq ´ µ ą γ ´ ε´ σ ´ 1
ff
(5.189)
ďexpp´τq
pγ ´ ε´ σ ´ 1q2. (5.190)
179
For the asymptotic analysis, we are interested in the regime where M and γ are
growing exponentially. In this case, the right hand side of (5.167) can be regarded as
essentially
P rıU ;XpU ;Xq ą logMγs (5.191)
modulo nuisance parameters. This can be seen from the choice of parameters in
Corollary 5.2.8. Thus the sum rate of M and γ has to exceed the sup information
rate in order that the approximation error vanishes asymptotically.
Extending Theorem 5.2.6 to the more general scenario where the target distribu-
tion may not have any relation with the input distribution, we have the following
result, where we allow πX to be an arbitrary positive measure,
Theorem 5.2.7 (Softer-covering Lemma: Unmatched Target Distribution)
Fix πX and QUX “ QUQX|U . For an arbitrary codebook rc1, . . . , cM s, define QXrcM s
as in (5.166). Then for any γ, ε, σ ą 0 satisfying γ ą ε` σ, τ P R and 0 ă δ ă 1,
ErFγ`
QXrUM sπX˘
s
ď PrıQXπX pXq ą logpγ ´ σ ´ εq or ıU ;XpU ;Xq
` ıQXπX pXq ą log δMσs
`γ ´ ε´ σ
εPrıU ;XpU ;Xq ą logM ´ τ s
`expp´τqpγ ´ ε´ σq2
p1´ δq2σ2(5.192)
where UM „ QU ˆ ¨ ¨ ¨ ˆQU and pU,Xq „ QUQX|U .
Proof Sketch Similarly to the proof of Theorem 5.2.6, we first use a symmetry
argument and change-of-measure step so that the random variable of the channel
180
output is correlated only with the first codeword, to obtain
ErFγ`
QXrUM sπX˘
s ď P„
dQXrUM s
dπXpX1q ą γ
. (5.193)
Then in the next union bound step we have to take care of another “atypical” event
that dQXdπX
pX1q ą γ2, where
γ2 :“ γ ´ ε´ σ. (5.194)
More precisely, we have
P„
dQXrUM s
dπXpX1q ą γ
ď PrξpX1q ą γ2 or ηpU1, X1q ą δMσs
` P
«
1
M
Mÿ
m“2
ηpUm, X1q1Acτ pUm, X1q ą ε, ξpX1q ď γ2
ff
` P
«
1
M
Mÿ
m“2
ηpUm, X1q1Aτ pUm, X1q ą γ ´ ε´ δσ, ξpX1q ď γ2
ff
(5.195)
where we have defined
ηpu, xq :“dQX|U“u
dπXpxq; (5.196)
ξpxq :“dQX
dπXpxq. (5.197)
As before the first term of (5.195) may be regarded as the probability of “atypical”
events and accounts for the first term in (5.192). The second and the third terms of
181
(5.195) can be upper-bounded by
P
«
1
M
Mÿ
m“2
ηpUm, X1q
ξpX1q1Acτ pUm, X1q ą
ε
γ2
ff
ď P
«
1
M
Mÿ
m“2
exp pıU ;XpUm;X1qq 1Acτ pUm, X1q ąε
γ2
ff
(5.198)
and
P
«
1
M
Mÿ
m“2
ηpUm, X1q
ξpX1q1Aτ pUm, X1q ą
γ ´ ε´ δσ
γ2
ff
ď P
«
1
M
Mÿ
m“2
exppıU ;XpUm;X1qq1Aτ pUm, X1q ą 1`p1´ δqσ
γ2
ff
. (5.199)
The rest of the proof is similar to that of Theorem 5.2.6 and is omitted.
For the purpose of asymptotic analysis in the stationary memoryless setting, the right
hand side of (5.192) can be regarded as essentially
P“
ıQXπX pXq ą log γ‰
` P“
ıU ;XpU ;Xq ` ıQXπX pXq ą logMγ‰
(5.200)
modulo nuisance parameters.
Remark 5.2.4 By setting τ Ð `8 and letting δ Ò 1, the bound in Theorem 5.2.7
can be weakened in the following slightly simpler form:
ErFγpQXrUM sπXqs ď P„
dQX
dπXpXq ą γ ´ σ ´ ε
` P„
dQX|U
dπXpX|Uq ěMσ
`γ ´ σ ´ ε
ε. (5.201)
182
In fact, assuming τ “ `8 we can simplify the proof of the theorem and strengthen
(5.201) to
ErFγpQXrUM sπXqs ď P„
dQX
dπXpXq ą γ2
` P„
dQX|U
dπXpX|Uq ąMpγ ´ εq
`γ2
ε(5.202)
for any γ2 ą 0 and 0 ă ε ă γ. As we show in Corollary 5.2.8, the weakened
bounds (5.201) and (5.202) are still asymptotically tight provided that the exponent
with which the threshold γ grows is strictly positive. However, when the exponent
is zero (corresponding to the total variation case), we do need τ in the bound for
asymptotic tightness.
Remark 5.2.5 In the case of QX “ πX , we can set γ2 Ð 1 and εÐ γ2
in (5.202) to
obtain the simplification
ErFγpQXrUM sQXqs ď P„
exppıU ;XpU ;Xqq ą logMγ
2
`2
γ. (5.203)
The general one-shot achievability bound in Theorem 5.2.7 implies the following
asymptotic result, which we will later prove to be tight.
Corollary 5.2.8 Fix per-letter distributions πX on X and QUX “ QUQX|U on U ˆX ,
and E,R P p0,8q. For each n, define γn :“ exppnEq and Mn “ texppnRqu; let
UMn “ pU1, . . . , UMnq have independent coordinates each distributed according to QbnU .
183
Given any cMn “ pcmqMnm“1, where each cm P Un, define12
QXrcMn s :“1
Mn
Mnÿ
m“1
QbnX|Up¨|cmq. (5.204)
Then
limnÑ8
ErEγnpQXrUMn s||πbnX qs “ lim
nÑ8ErFγnpQXrUMn sπ
bnX qs (5.205)
“ 0 (5.206)
provided that
E ą DpQX||πXq ` rIpQU, QX|Uq ´Rs`. (5.207)
Proof Choose E 1 such that
E ą E 1 ą DpQX||πXq ` rIpQU, QX|Uq ´Rs`. (5.208)
Set δ “ 12, εn “ exppnEq ´ exppnE 1q and σn “
12pγn ´ εnq “
12
exppnE 1q, and apply
(5.201). Notice that
E„
logdQX|U
dπXpX|Uq
“ nrIpQU, QX|Uq `DpQX||πXqs (5.209)
12We define QbnX|U by QbnX|Up¨|unq :“
śni“1QX|U“ui
for any un. Also note that in this chapter
we differentiates per-letter symbols such as U between one-shot/block symbols such as U (so thatU “ Un in this corollary).
184
where pX,Uq „ QbnXU, QX|U :“ QbnX|U, and πX :“ πbnX . By the law of large numbers,
the first and second terms in (5.201) vanish because
DpQX||πXq ă E 1; (5.210)
IpQU, QX|Uq `DpQX||πXq ă E 1 `R (5.211)
are satisfied.
The QU that minimizes the right hand side of (5.207) generally does not satisfy
QU Ñ QX|U Ñ QX. This means that in the large deviation analysis, for the best
approximation of a target distribution in Eγ, we generally should not generate the
codewords according to a distribution that corresponds to the target through the
channel. This is a remarkable distinction from approximation in total variation dis-
tance, in which case an unmatched input distribution would result in the maximal
total variation distance asymptotically. However, if we stick to matching input code-
word distributions, then a simple and general asymptotic achievability bound can be
obtained. Recall that [155] defined the sup-information rate
IpU; Xq :“ inf
"
R : limnÑ8
P„
1
nıUn;XnpUn;Xn
q ą R
“ 0
*
(5.212)
and the inf-information rate
IpU; Xq :“ sup
"
R : limnÑ8
P„
1
nıUn;XnpUn;Xn
q ă R
“ 0
*
. (5.213)
185
Theorem 5.2.9 For any channel W “ pQXn|Unq8n“1, sequence of inputs U “
pUnq8n“1, and G ą 0, we have
SpG,Xq ď IpU; Xq ´G (5.214)
where X is the output of U through the channel W. As a consequence,
SpGq ď supU
IpU; Xq ´G (5.215)
For channels satisfying the strong converse property, the right hand side of (5.215)
can be related to the channel capacity because of the relations [155][179]
supU
IpU; Xq “ supU
IpU; Xq ” CpWq. (5.216)
We conclude the subsection by remarking that had we used the soft-covering
lemma (see Remark 5.2.3) to bound total variation distance and in turn, bounded
the complementary relative information spectrum with total variation distance, we
would not have obtained Theorem 5.2.9. Indeed, consider Mn “ exppnRq and let
V1, . . . , VMn be i.i.d. according to QUn . Regardless of how fast γn grows, we cannot
conclude from that
ErFγnpQXnrVMn sπXnqs (5.217)
vanishes unless we show that E|QXnrVMn s´πXn | vanishes (the general and tight bounds
between Eγ and the total variation distance are characterized in [143]). Since the total
variation distance vanishes only when R ą IpU; Xq by the conventional resolvability
theorem, the conventional resolvability theorem only gives an upper bound IpU; Xq
which is looser than Theorem 5.2.9 when G ą 0.
186
Tail Bound of Approximation Error for Random Codebooks
For applications such as secrecy and channel synthesis, it is sometimes desirable
to prove that the approximation error vanishes not only in expectation (e.g. Theo-
rem 5.2.7), but also with high probability (see Footnote 8), in the case of a random
codebook [168][177][178]. If the probability that the approximation error exceeds an
arbitrary positive number vanishes doubly exponentially in the blocklength, then the
analyses in these applications carry through because a union bound argument can be
applied to exponentially many events. Previous proofs (e.g. [168]) based on carefully
applying Chernoff bounds to each QXrUM spxq ´ QXpxq and then taking the union
bound over x require finiteness of the alphabets.
Here we adopt a different approach. Using concentration inequalities we can di-
rectly bound the probability that the error EγpQXrUM sQXq deviates from its expec-
tation, without any restrictions on the alphabet and in fact the bound only depends
on the number of codewords. Therefore if the rate is high enough for the approxima-
tion error to vanish in expectation (by Theorem 5.2.7), we can also conclude that
the error vanishes with high probability. The crux of the matter is thus resolved by
the following one-shot result:
Theorem 5.2.10 Fix πX and QUX “ QUQX|U . For an arbitrary codebook
rc1, . . . , cM s, define QXrcM s as in (5.166). Then, for any r ą 0,
PrEγpQXrUM sπXq ´ ErEγpQXrUM sπXqs ą rs ď expp´2Mr2q (5.218)
where the probability and the expectation are with respect to UM „ QX ˆQX .
Proof Consider f : cM ÞÑ EγpQXrcM sπXq. By the definition (5.166) and the triangle
inequality (5.90), we have the following uniform bound on the discrete derivative:
supc,c1PX
|fpci´11 , c, cMi`1q ´ fpc
i´11 , c1, cMi`1q| ď
1
M, @i, cM . (5.219)
187
The result then follows by McDiarmid’s inequality (see e.g. [48, Theorem 2.2.3]).
Remark 5.2.6 If we are interested in bounding both the upper and the lower tails
then the right side of (5.218) gains a factors of 2. Other concentration inequalities
may also be applied here; the transportation method gives the same bound in this
example.
Converse for Stationary Memoryless Outputs
We now establish a matching asymptotic converse about Corollary 5.2.8 for the Eγ-
resolvability rate for stationary memoryless outputs and discrete memoryless chan-
nels.
Theorem 5.2.11 (Resolvability for Stationary Memoryless Outputs) For a
DMC QX|U and a nonnegative finite measure πX,
GεpR, πXq “ minQU
tDpQXπXq ` rIpQU, QX|Uq ´Rs`u (5.220)
where QU Ñ QX|U Ñ QX, for any 0 ă ε ă 1.
Remark 5.2.7 When resolvability was introduced in [155], the resolvability rate (un-
der total variation distance) was formulated for outputs of stationary memoryless
inputs, rather than all the tensor power distributions on the output alphabet, because
otherwise there is no guarantee that the output process can be approximated under
total variation distance even with an arbitrarily large codebook. Here we can extend
the scope because all stationary memoryless distributions on the output alphabet (sat-
isfying the mild condition of being absolutely continuous with respect to some output)
can be approximated under Eγ as long as γ is sufficiently large.
The achievability part of Theorem 5.2.11 is already shown in Corollary 5.2.8. For the
converse, we need a notion of conditional typicality specially tailored for our problem
188
which differs from the definitions of conditional typicality in [25] or [180] (see also
[18]). This can be viewed as an intermediate of the those two definitions.
Definition 5.2.8 (Moderate Conditional Typicality) The δ-typical set of un P
Un with respect to the discrete memoryless channel with per-letter conditional distri-
bution QX|U is defined as
T nrQX|Usδpunq :“
txn : @a, b, | pPunxnpa, bq ´QX|Upb|aq pPunpaq| ď δQX|Upb|aqu (5.221)
where pPunxn denotes the empirical distribution of pun, xnq.
Remark 5.2.8 In addition to its broad interest, Definition 5.2.8 plays an important
role in obtaining the uniform bound in Lemma 35, as well as in Lemma 36. This
definition of conditional typicality is of broad interest because of Lemma 5.2.12 and
Lemma 5.2.13 ahead, and in particular the uniform bound in Lemma 5.2.12. Note
that the definition in [25] corresponds to replacing the term δQX|Upb|aq in (5.221) with
δ, in which case we cannot bound the probability of a sequence in the typical set as in
Lemma 5.2.13. The “robust typicality” definition of [180] (see also [18]) corresponds
to replacing this term with δQX|Upb|aq pPunpaq, which does not give the uniform lower
bound on the probability of conditional typical set as in Lemma 5.2.12.
Lemma 5.2.12 For fixed δ ą 0 and QX|U, there exists a sequence pγnq such that
limnÑ8 γn “ 0 and
QbnX|UpTnrQX|Usδ
punq|unq ě 1´ γn (5.222)
for all un P Un.
189
Proof We show that the statement holds with
γn “|UX |nδ2
ˆ
1
q´ 1
˙
, (5.223)
where
q :“ minpa,bq:QX|Upb|aq‰0
QX|Upb|aq. (5.224)
The number of occurrences Npa, b|un, Xnq of pa, bq P U ˆ X in pun, Xnq, where
Xn „śn
i“1QX|U“ui , is binomial with mean Npa|unqQX|Upb|aq and variance
Npa|unqQX|Upb|aqp1 ´ QX|Upb|aqq. If QX|Upb|aq “ 0 the condition that defines
the set in (5.221) is automatically true. Otherwise, by Chebyshev’s inequality we
have for each pa, bq,
Pr|Npa, b|un, Xnq ´Npa|unqQX|Upb|aq| ą nδQX|Upb|aqs
ďNpa|unqQX|Upb|aqp1´QX|Upb|aqq
n2δ2Q2X|Upb|aq
(5.225)
ď1
nδ2
ˆ
1
q´ 1
˙
(5.226)
and the claim follows by taking the union bound.
Lemma 5.2.13 For each un, and xn P T nrQX|Usδ
punq, we have the bound
QbnX|Upxn|unq ě expp´nrHpQX|U|
pPunq ` δ|U | log |X |sq. (5.227)
Proof Since
QbnX|Upxn|unq “
ź
aPU ,bPXQX|Upb|aq
Npa,b|un,xnq, (5.228)
190
we have
1
nlog
1
QbnX|Upxn|unq
“ÿ
a,b
pPunxnpa, bq log1
QX|Upb|aq(5.229)
ďÿ
a,b
rQX|Upb|aq pPunpaq ` δQX|Upb|aqs log1
QX|Upb|aq(5.230)
“ HpQX|U|pPunq ` δ
ÿ
a
HpQX|U“aq (5.231)
ď HpQX|U|pPunq ` δ|U | log |X |. (5.232)
Lemma 5.2.14 For any type PX and sequence un,
|T nrQX|Usδpunq X TPX
| ď exppnrHrPXsδ ` δ|U | log |X |sq (5.233)
where we have defined
HrPXsδ :“ maxQU:|QX´PX|ďδ|U |
HpQX|U|QUq (5.234)
where QU Ñ QX|U Ñ QX, and the maximum in (5.234) is understood as ´8 if the
set tQU : |QX ´ PX| ď δ|U |u is empty.
Proof For any un, we have the upper bound
|T nrQX|Usδpunq X TPX
| ď |T nrQX|Usδpunq| (5.235)
ď
˜
minxnPTn
rQX|Usδpunq
QbnX|Upxn|unq
¸´1
(5.236)
ď exppnrHpQX|U|pPunq ` δ|U | log |X |sq (5.237)
191
where we used Lemma 5.2.13 in (5.237). Moreover, if un satisfies |PX ´QX|U ˝pPun | ą
δ|U | where QX|U ˝pPun :“
ş
QX|U“ad pPunpaq, then T nrQX|Usδ
punq X TPXis empty, because
| pPxn ´QX|U ˝pPun | “
ÿ
a,b
| pPunxnpa, bq ´QX|Upb|aq pPunpaq| (5.238)
ďÿ
a,b
δQX|Upb|aq (5.239)
“ δ|U | (5.240)
implies that any xn in T nrQX|Usδ
punq does not have the type PX. Therefore the desired
result follows by taking the maximum of (5.237) over type QU satisfying |QX´PX| ď
δ|U |.
Proof [Proof of Converse of Theorem 5.2.11] Fix a codebook pc1, . . . , cMq and type
PX. Define
An :“Mď
m“1
T nrQX|Usδpcmq. (5.241)
Then
πbnX pAn X TPXq
“ÿ
xnPAnXTPX
πbnX pxnq (5.242)
“ expp´nrHpPXq `DpPXπXqsq ¨ |An X TPX| (5.243)
ď expp´nrHpPXq `DpPXπXqsq ¨Mÿ
m“1
|T nrQX|Usδpcmq X TPX
| (5.244)
ď expp´nrHpPXq `DpPXπXqsq
¨M exppnrHrPXsδ ` δ|U | log |X |sq (5.245)
“ expp´nrDpPXπXq `HpPXq ´HrPXsδ ´R ´ δ|U | log |X |sq. (5.246)
192
where (5.245) is from Lemma 5.2.14. Whence (5.246) and the trivial bound
πbnX pAn X TPXq ď πbnX pTPX
q (5.247)
ď expp´nDpPXπXqq (5.248)
ď expp´nrDpPXπXq ´ δ|U | log |X |sq (5.249)
yield the bound
πbnX pAn X TPXq ď expp´nrfpδ, PXq ´ δ|U | log |X |sq, (5.250)
where we have defined the function
fpδ, PXq :“ DpPXπXq ` rHpPXq ´HrPXsδ ´Rs` (5.251)
for δ ą 0 and PX ! πX. Define13
gpδq :“ minPX
fpδ, PXq, (5.252)
Then
πbnX pAnq “ÿ
PX
πbnX pAn X TPXq (5.253)
ďÿ
PX
expp´nrgpδq ´ δ|U | log |X |sq (5.254)
ď pn` 1q|X | expp´nrgpδq ´ δ|U | log |X |sq (5.255)
where the summation is over all type PX absolutely continuous with respect to πX,
and (5.254) is from (5.250). Then for any real number G ă gpδq ´ δ|U | log |X | we
13The reason why we can write minimum in (5.252) is explained in Remark 5.2.9.
193
have
EexppnGqpPXnrcM sπbnX|Uq
ě PXnrcM spAnq ´ exppnGqπbnX|UpAnq (5.256)
ě1
M
Mÿ
m“1
QbnX|UpTnrQX|Usδ
pcmq|cmq ´ exppnGqπbnX|UpAnq (5.257)
ě 1´ γn ´ exppnGqπbnX|UpAnq (5.258)
Ñ 1, nÑ 8. (5.259)
where
• (5.257) uses T nrQX|Usδ
pcmq Ď An, and we used the notation of the tensor power
for the conditional law QbnX|Up¨|unq :“
śni“1QX|U“ui .
• (5.258) is from Lemma 5.2.12,
• (5.259) uses (5.255).
Since δ ą 0 was arbitrary, we thus conclude
GεpR, πXq ě supδą0tgpδq ´ δ|U | log |X |u (5.260)
ě lim infδÑ0
gpδq (5.261)
ě gp0q (5.262)
where (5.262) is from Lemma 5.2.15. Since gp0q is the right side of (5.220), the
converse bound is established.
Lemma 5.2.15 The functions f and g defined in (5.251) and (5.252) are both lower
semicontinuous.
Remark 5.2.9 We can write minimum instead of infimum in (5.252) and hence
(5.220) because of the lower semicontinuity of f .
194
Proof Consider a lower semicontinuous function χ where χpδ, PX, QUq equals
DpPXπXq ` rHpPXq ´HpQX|U|QUq ´Rs` (5.263)
if |QX ´ PX| ď δ|U | and `8 otherwise. Then fpδ, PXq “ minQXχpδ, PX, QUq is lower
semicontinuous, as it is the pointwise infimum of a lower semicontinuous functions over
a compact set (see for example the proof in [181, Lemma 9]). The lower semicontinuity
of g follows for the same reason.
The function GpR, πXq in (5.220) satisfies some nice properties.
Proposition 5.2.16 Denote GpR, πXq as GpR, πX, QX|Uq. Let X and U be arbitrary
alphabets (measurable spaces).
1. The function being minimized in (5.220), denoted as F pQU, R, πX, QX|Uq, is con-
vex in QU.
2. Additivity: for any R ą 0, πXi and QXi|Ui (i “ 1, 2),
GpR, πX1πX2 , QX1|U1QX2|U2q
“ minR1,R2:R1`R2ďR
tGpR1, πX1 , QX1|U1q
`GpR2, πX2 , QX2|U2qu, (5.264)
where we have abbreviated πX1 ˆ πX2 and QX1|U1 ˆ QX2|U2 as πX1πX2 and
QX1|U1QX2|U2.
3. GpR, πX, QX|Uq is continuous in R.
4. GpR, πX, QX|Uq is convex in R.
195
Proof 1. The function of interest is the maximum of the two functions DpQXπXq
and
DpQXπXq ` IpQU, QX|Uq ´R “ E”
ıQX|UπXpX|Uqı
´R (5.265)
where pU,Xq „ QUX, QU Ñ QX|U Ñ QX, and the conditional relative informa-
tion
ıQX|UπXpx|uq :“ logdQX|U“u
dπXpxq, @u, x. (5.266)
The former is convex and the latter is linear in QU.
2. The ď direction is immediate from the single-letter formula (5.220) and the
inequality
ras` ` rbs` ě ra` bs` (5.267)
for any a, b P R. For the ě direction, suppose QU1U2 achieves the minimum in
the single-letter formula of GpR, πX1πX2 , QX1|U1QX2|U2q. Observe that
F pQU2 , R, πX1πX2 , QX1|U1QX2|U2q
´ F pQU1QU2 , R, πX1πX2 , QX1|U1QX2|U2q
“ IpX1;X2q ` rIpQU1U2 , QX1|U1QX2|U2q ´Rs`
´
«
2ÿ
i“1
IpQUi , QXi|Uiq ´R
ff`
(5.268)
ě rIpX1;X2q ` IpQU1U2 , QX1|U1QX2|U2q ´Rs`
´
«
2ÿ
i“1
IpQUi , QXi|Uiq ´R
ff`
(5.269)
“ 0 (5.270)
196
where (5.268) uses DpQX1X2πX1πX2q ´DpQX1πX1q ´DpQX2πX2q “ IpX1;X2q,
and (5.269) follows from (5.267). Therefore
GpR, πX1πX2 , QX1|U1QX2|U2q
“ F pQU1QU2 , R, πX1πX2 , QX1|U1QX2|U2q. (5.271)
But clearly there exists R1 and R2 summing to R such that
R.H.S. of (5.271)
“ F pQU1 , R1, πX1 , QX1|U1q ` F pQU2 , R2, πX2 , QX2|U2q (5.272)
ě F pR1, πX1 , QX1|U1q ` F pR2, πX2 , QX2|U2q (5.273)
and the result follows.
3. Fix any two numbers 0 ě R1 ă R. Choose QU such that
GpR, πX, QX|Uq “ F pQU, R, πX, QX|Uq. (5.274)
Then
0 ď GpR1, πX, QX|Uq ´GpR, πX, QX|Uq (5.275)
ď F pQU, R1, πX, QX|Uq ´ F pQU, R, πX, QX|Uq (5.276)
“ rIpQU, QX|Uq ´R1s`´ rIpQU, QX|Uq ´Rs
` (5.277)
ď rR ´R1s` (5.278)
where (5.275) follows because Gp¨, πX, QX|Uq is non-increasing, and (5.278) uses
(5.267) again. Thus GpR, πX, QX|Uq is actually 1-Lipschitz continuous in R.
197
4. Fix R1, R2 ě 0, α P r0, 1s, and let QUi maximize F p¨, Ri, πX, QX|Uq for i “ 1, 2.
Define
Rα :“ p1´ αqR0 ` αR1; (5.279)
QUα :“ p1´ αqQU0 ` αQU1 . (5.280)
In both IpQUα , QX|Uq ą Rα and IpQUα , QX|Uq ď Rα cases one can explicitly
calculate that
F pQUα , Rα, πX, QX|Uq ď p1´ αqF pQU0 , R0, πX, QX|Uq
` αF pQU1 , R1, πX, QX|Uq (5.281)
and the convexity follows.
5.2.4 Application to Lossy Source Coding
The simplest application of the new resolvability result (in particular, the softer-
covering lemma) is to derive a one-shot achievability bound for lossy source coding,
which is most fitting in the regime of low rate and exponentially decreasing success
probability. The method is applicable to general sources. In the special case of
i.i.d. sources, it recovers the “success exponent” in lossy source coding originally
derived by the method of types [25] for discrete memoryless sources. The achievability
bound in [160] can be viewed as the γ “ 1 special case, which is capable of recovering
the rate-distortion function, but cannot recover the exact rate-distortion-exponent
tradeoff.
Theorem 5.2.17 Consider a source with distribution πX and a distortion function
dp¨, ¨q on U ˆ X . For any joint distribution QUQX|U , γ ě 1, d ą 0 and integer M ,
198
there exists a random transformation πU |X (stochastic encoder) whose output takes at
most M values, and
PrdpU , Xq ď ds ě1
γ
`
PrdpU,Xq ď ds ´ ErEγpQXrUM sπXqs˘
(5.282)
where pU , Xq „ πU |XπX , pU,Xq „ QUX , and UM „ QbMU .
Proof Given a codebook pc1, . . . , cMq P U , let PU be the equiprobable distribution
on pc1, . . . , cMq and set
PUX :“ QX|UPU . (5.283)
The likelihood encoder is then defined as a random transformation
πU |X :“ PU |X (5.284)
so that the joint distribution of the codeword selected and the source realization X
is
πUX “ πXPU |X (5.285)
From Proposition 5.2.2-2) and Proposition 5.2.2-4) we obtain
γπUXpdp¨, ¨q ď dq ě PUXpdp¨, ¨q ď dq ´ EγpPXU ||πXUq (5.286)
“ PUXpdp¨, ¨q ď dq ´ EγpPX ||πXq (5.287)
where pU , Xq „ PUX . Note that PUX and πU |X depend on the codebook cM . Now
consider a random codebook cM Ð UM . Taking the expectation on both sides of
199
(5.287) with respect to UM , we have
γErπUXpdp¨, ¨q ď dqs
ě ErPUXpdp¨, ¨q ď dqs ´ ErEγpQXrUM sπXqs (5.288)
“ PrdpU,Xq ď ds ´ ErEγpQXrUM sπXqs (5.289)
where in (5.289) we used the fact that ErPUXs “ QUX . Finally we can choose one
codebook (corresponding to one πU |X) such that πUXpdp¨, ¨q ď dq is at least its expec-
tation.
Remark 5.2.10 In the i.i.d. setting, let RpπX, dq be the rate-distortion function when
the source has per-letter distribution πX. The distortion function for the block is
derived from the per-letter distortion by
dnpun, xnq :“1
n
nÿ
i“1
dpui, xiq. (5.290)
Let pXn, Unq be the source-reconstruction pair distributed according to πXnUn. If 0 ď
R ă RpπX, dq, the maximal probability that the distortion does not exceed d converges
to zero with the exponent
limnÑ8
1
nlog
1
PrdnpUn, Xnq ď ds“ GpR, dq (5.291)
where
GpR, dq :“ minQrDpQ||P q ` rRpQ, dq ´Rs`s. (5.292)
A weaker achievability result than (5.292) was proved in [182, p168], whereas the final
form (5.292) is given in [25, p158, Ex6] based on method of types. Here we can easily
prove the achievability part of (5.292) using Theorem 5.2.17 and Corollary 5.2.8 by
200
setting QX to be the minimizer of (5.292) and QU|X to be such that
ErdpU,Xqs ď d, (5.293)
IpQU, QX|Uq ď R. (5.294)
Then γn “ exppnEq with
E ą DpQX||πXq ` rIpQU, QX|Uq ´Rs`, (5.295)
ensures that
PrdnpUn, Xnq ď ds ě1
2expp´nEq (5.296)
for n large enough, by the law of large numbers.
Remark 5.2.11 Since the Eγ metric reduces to total variation distance when γ “ 1,
Theorem 5.2.17 generalizes the likelihood source encoder based on the standard soft-
covering/resolvability lemma [160]. In [160], the error exponent for the likelihood
source encoder at rates above the rate-distortion function is analyzed using the expo-
nential decay of total variation distance in the approximation of output statistics, and
the exponent does not match the optimal exponent found in [25]. It is also possible
to upper-bound the success exponent of the total variation distance-based likelihood
encoder at rates below the rate-distortion function by analyzing the exponential con-
vergence to 2 of total variation distance in the approximation of output statistics;
however that does not yield the optimal exponent (5.292) either. This application
illustrates one of the nice features of the Eγ-resolvability method: it converts a large
deviation analysis into a law of large numbers analysis, that is, we only care about
whether Eγ converges to 0, but not the speed, even when dealing with error exponent
problems.
201
5.2.5 Application to Mutual Covering Lemma
Another application of the softer-covering lemma is a one-shot generalization of the
mutual covering lemma in network information theory [156]. The asymptotic mu-
tual covering lemma says, fixing a (per-letter) joint distribution PUV, if enough Un-
sequences and Vn-sequences are independently generated according to PUn and PVn
respectively, then with high probability we will be able to find one pair jointly typical
with respect to PUV. In the one-shot version, the “typical set” is replaced with an ar-
bitrarily high probability (under the given joint distribution) set. The original proof
of the asymptotic mutual covering lemma [156][18] used the second-order moment
method.
The one-shot mutual covering lemma can be used to prove a one-shot version of
Marton’s inner bound for the broadcast channel with a common message14 without
time-sharing, improving the proof in [161] based on the basic covering lemma where
time-sharing is necessary. More discussions about the background and the derivation
of the one-shot Marton’s inner bound can be found in our conference paper [186].
For general discussions on single-shot covering lemmas, see [161][187]. To avoid using
time-sharing in the single-shot setting, [188] pursued a different approach to derive a
single-shot Marton’s inner bound. Moreover, a version of one-shot mutual covering
lemma can be distilled from their approach [189]. We compare their approach and
ours at the end of the section.
In our conference paper [186] we applied the one-shot mutual covering lemma to
derive a one-shot version of Marton’s inner bound for the broadcast channel with a
common message, without using time-sharing/common randomness. In our confer-
ence paper [190] we use a concentration inequality of Talagrand to prove that the
14More precisely, we are referring to the three auxiliary random variables version due to Liangand Kramer [183, Theorem 5] (see also [18, Theorem 8.4]), which is equivalent to an inner boundobtained by Gelfand and Pinsker [184] upon optimization (see [185] or [18, Remark 8.6]).
202
covering error probability is doubly exponentially vanishing in the blocklength. Since
those results are not directly related to Eγ, we exclude them from this thesis.
Now, we proceed to provide a simple derivation of a mutual covering using the
softer-covering lemma.
Lemma 5.2.18 Fix PUV and let
PUMV L :“ PU ˆ ¨ ¨ ¨ ˆ PUlooooooomooooooon
M
ˆPV ˆ ¨ ¨ ¨ ˆ PVlooooooomooooooon
L
. (5.297)
Then
P
«
M,Lč
m“1,l“1
tpUm, Vlq R Fuff
ď PrpU, V q R Fs ` PrıU ;V pU ;V q ě logML´ τ s
`exppτq
maxtM,Lu` e´
12
exppτq. (5.298)
for all τ ą 0 and event F .
Proof Assume without loss of generality that L ěM . For any u P U , define
Fu :“ tv : pu, vq P Fu, (5.299)
and for any uM P UM , define
AuM :“Mď
m“1
Fum . (5.300)
203
Now fix a U -codebook cM and observe that
γPV pAcM q ě PV rcM spAcM q ´ EγpPV rcM sPV q (5.301)
ě1
M
Mÿ
m“1
PV |U“cmpFcmq ´ EγpPV rcM sPV q (5.302)
where we recall that
PV rcM s :“1
M
Mÿ
m“1
PV |U“cm . (5.303)
(5.301) is from the definition of Eγ and (5.302) is because Fcm ĎŤMm“1Fcm “ AcM .
Denote by Γpc1, . . . , cMq the right side of (5.302), which is trivially upper-bounded
by 1. Next, we show that
P
«
M,Lč
m“1,l“1
tpUm, Vlq R Fuˇ
ˇ
ˇ
ˇ
ˇ
UM“ cM
ff
ď 1´ ΓpcMq ` e´Lγ (5.304)
which is trivial when ΓpcMq ă 0. In the case of ΓpcMq P r0, 1s,
P
«
M,Lč
m“1,l“1
tpUm, Vlq R Fuˇ
ˇ
ˇ
ˇ
ˇ
UM“ cM
ff
“ r1´ PV pAcM qsL (5.305)
ď
«
1´ΓpcMqL
γ
L
ffL
(5.306)
ď 1´ ΓpcMq ` e´Lγ (5.307)
where (5.305) is from the definition of AcM , and (5.306) is from (5.302). The last step
(5.307) uses the basic inequality
´
1´pα
M
¯M
ď 1´ p` e´α (5.308)
204
for M,α ą 0 and 0 ď p ď 1, which has been useful in the proofs of the basic covering
lemma (see [3][161][43]). Integrating both sides of (5.304) over cM with respect to
PU ˆ ¨ ¨ ¨ ˆ PU ,
P
«
M,Lč
m“1,l“1
tpUm, Vlq R Fuff
ď PUV pF cq ` ErEγpPV rUM sPV qs ` e´Lγ , (5.309)
where we have used the fact that ErPV |UpFUm |Umqs “ PUV pFq for each m. Applying
the “softer-covering lemma” as in Remark 5.2.5, the middle term on the right hand
side of (5.309) is upper-bounded by
P„
ıV ;UpV ;Uq ě logMγ
2
`2
γ(5.310)
and the result follows by γ Ð 2L expp´τq.
Remark 5.2.12 From the above derivation we see that for the proof of the basic
covering lemma (M “ 1 case) we will need the “softest-covering lemma” (the case
of one codeword) rather than the soft-covering lemma (case of γ “ 1 and L ą 1
codewords). However, it is still possible to prove the basic covering lemma using the
soft-covering lemma using a different argument; see the discussion in [189], which is
essentially based on the idea in [159].
Lemma 5.2.19 below is a strengthened version of the one-shot-mutual covering
lemma, which improves Lemma 5.2.18 in terms of the error exponent. The proof of
Lemma 5.2.19 essentially combines the proof the achievability part of resolvability
and the proof Lemma 5.2.18, and the improvement results from not treating the two
steps separately. The proof is not as conceptually simple as Lemma 5.2.18 since the
complexities are no longer buried under the softer-covering lemma.
205
Lemma 5.2.19 Under the same assumptions as Lemma 5.2.18,
P
«
M,Lč
m“1,l“1
tpUm, Vlq R Fuff
ď P rpU, V q R F or exppıU ;V pU ;V qq ąML expp´γq ´ δs
`mintM,Lu ´ 1
δ` e´ exppγq. (5.311)
for all δ, γ ą 0 and event F .
Proof See Lemma 1 and Remark 4 in the conference version [191].
Remark 5.2.13 An advantage of Lemma 5.2.19 over Lemma 5.2.18 is that the
upper-bound in the former contains a probability of a union of two events, rather
than the sum of the probability of the two events. This yields a strict improvement
in the second order rate analysis. Moreover, by setting δ Ó 0 and M “ 1 we exactly
recover the basic one-shot covering lemma in [161].
In terms of the second order rates, the one-shot Marton’s inner bound for broadcast
obtained from our one-shot mutual covering lemma ([186, Theorem 10]) is equiva-
lent to the achievability bound claimed in [37, Theorem 4] based on the stochastic
likelihood encoder. However, although it is not demonstrated explicitly in [186, The-
orem 10], we can improve the analysis of [186, Theorem 10] by using various nuisance
parameters rather than a single γ, to obtain a one-shot Marton’s bound which gives
strictly better error exponents than [37, Theorem 4]. The reason for such improve-
ment is that the third term in (5.311) is doubly exponential and the second term
converges to zero with a large exponent. On the other hand, the approach of [37,
Theorem 4] has the advantage of being easily extendable to the case of more than
two users (which would correspond to a multivariate mutual covering lemma).
206
5.2.6 Application to Wiretap Channels
Our final application of the Eγ-resolvability (in particular, the softer-covering lemma)
is in the wiretap channel, whose setup is as depicted in Figure 5.4. The receiver and
the eavesdropper observe y P Y and z P Z, respectively. Given a codebook cML,
the input to PY Z|X is cwl where w P t1, . . . ,Mu is the message to be sent, and l
is equiprobably chosen from t1, . . . , Lu to randomize the eavesdropper’s observation.
We call such a cML an pM,Lq-code. Moreover, the eavesdropper’s observation has
the distribution πZ when no message is sent. In this setup, we don’t need to assume
a prior distribution on the message/non-message. We wish to design the codebook
such that the receiver can decode the message (reliability) whereas the eavesdropper
cannot detect whether a message is sent nor guess which message is sent (security).
For general wiretap channels the performance may be enhanced by appending a con-
ditioning channel QX|U at the input of the original channel [171]. In that case the
same analysis can be carried out for the new wiretap channel QY Z|U . Thus the model
in Figure 5.4 entails no loss of generality.
PY Z|XXTransmitter Y Receiver
Z
Eavesdropper
W W
Figure 5.4: The wiretap channel
In Wyner’s setup (see for example [18]), secrecy is measured in terms of the
conditional entropy of the message given the eavesdropper observation. In contrast,
we measure secrecy in terms of the size of the list that the eavesdropper has to declare
for the message to be included with high probability. Practically, the message W is
the compressed version of the plaintext. Assuming that the attacker knows which
compression algorithm is used, the plaintext can be recovered by running each of the
207
items in the eavesdropper output list through the decompressor and selecting the one
that is intelligible.
While there have been previous proofs for wiretap channels using the conventional
resolvability (soft-covering lemma) [171][172], the conventional resolvability based on
the total variation metric is only suitable when the communication rate is low enough
to achieve perfect secrecy. In contrast, Eγ-resolvability yields lower bounds on the
minimum size of the eavesdropper list for an arbitrary rate of communication. This
interpretation of security in terms of list size is reminiscent of equivocation [192], and
indeed we recover the same formula in the asymptotic setting, even though there is
no direct correspondence between both criteria. Moreover, we also consider a more
general case where the eavesdropper wishes to reliably detect whether a message is
sent, while being able to produce a list including the actual message if it decides it is
present. This is a practical setup because “no message” may be valuable information
which the eavesdropper wants to ascertain reliably. The idea is reminiscent of the
stealth communication problem (see [193][194] and the references therein) which also
involves a hypothesis test on whether a message is sent. However, the setup and the
analysis (including the covering lemma) are quite different from [193] and [194]. In
comparison, our results are more suitable for the regime with higher communication
rates and lower secrecy demands. In the discrete memoryless case, we obtain single-
letter expressions of the tradeoff between the transmission rate, eavesdropper list,
and the exponent of the false-alarm probability for the eavesdropper (i.e. declaring
the presence of a message when there is none).
Summary of The Asymptotic Results
We need the following definitions to quantify the eavesdropper ability to de-
tect/decode messages.
208
Definition 5.2.9 For a fixed codebook and channel PZ|X , we say the eavesdropper
can perform pA, T, εq-decoding if upon observing Z, it outputs a list which is either
empty or of size T , such that
• Prlist ‰ H|no messages ď A´1.
• There exists εm P r0, 1s, m “ 1, . . . ,M satisfying ε “ 1M
řMm“1 εm such that
Prm R list|W “ ms ď εm. (5.312)
Although the decoder in Definition 5.2.9 is reminiscent of erasure and list decoding
[21], for the former it is possible that actually no message is sent, and we treat the
undetected and detected errors together.
The logarithm of T can be intuitively understood as the equivocation HpW |Zq
[192]. However, log T can be much smaller than HpW |Zq: a distribution can have
99% of its mass supported on a very small set, and yet have an arbitrarily large
entropy.
The quantity A ą 0 characterizes how well the eavesdropper can detect that no
message is sent, which is related the notion of stealth communication [193] [194]. The
“non-stealth” is measured by DpPZπZq in [193], and is measured by |PZ´πZ | in [194].
Although both the relative entropy and total variation are related to error probability
hypothesis testing, their results cannot be directly compared with ours, since they
are interested in the regime where non-stealth vanishes while the transmission rate
is below the secrecy capacity. In contrast, we are mainly interested in the regime
where A grows exponentially (so that the “non-stealth” in their definition grows) in
the blocklength, but the transmission rate is above the secrecy capacity.
The asymptotic version of the eavesdropper achievability is as follows.
Definition 5.2.10 Fix a sequence of codebooks and a eavesdropper channel
pPZn|Xnq8n“1. The rate pair pα, τq is ε-achievable by the eavesdropper if there
209
exist sequences pAnq and pTnq with
lim infnÑ8
1
nlogAn ě α (5.313)
lim supnÑ8
1
nlog Tn ď τ (5.314)
such that for sufficiently large n, the eavesdropper can achieve pAn, Tn, εq-decoding.
By the diagonalization argument [44, P56], the set of ε-achievable pα, τq is closed.
An pM,L,QXq-random code is defined as the ensemble of the codebook cML where
each codeword cwl is i.i.d. chosen according to QX , w P t1, . . . ,Mu, m P t1, . . . ,Mu.
We shall focus on random codes, for which reliability is guaranteed by channel coding
theorems, so we only need to consider the security condition.
First, we extend the notions of achievability to the case of a random ensemble of
codes by taking the average: we say for a random ensemble of codes the eavesdropper
can perform pA, T, εq-decoding, if there exists εpcMLq such that for each cML the
eavesdropper can perform pA, T, εpcMLqq-decoding (in the sense of Definition 5.2.9),
and the average of εpcMLq with respect to the codebook distribution is upper-bounded
by ε. Similarly, Definition 5.2.10 can be extended to random codes. Then, the
following is our main result which characterizes the set of eavesdropper achievable
pairs for stationary memoryless channels.
Theorem 5.2.20 Fix any QX, R, RL and 0 ă ε ă 1. Consider pexppnRq, exppnRLq, QbnX q-
random codes and stationary memoryless channel with per-letter conditional distri-
bution QZ|X. Then the pair pα, τq is ε-achievable by the eavesdropper if and only
if
$
’
&
’
%
α ď DpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`;
τ ě R ´ rIpQX, PZ|Xq ´RLs`.
(5.315)
where QX Ñ QZ|X Ñ QZ.
210
From the noisy channel coding theorem, the supremum randomization rate RL such
that the sender can reliably transmit messages at the rate R is IpQX, PY|Xq ´ R.
The larger RL the less reliably the eavesdropper can decode, so the optimal encoder
chooses RL as close to this supremum as possible. Thus Theorem 5.2.20 implies the
the following result
Theorem 5.2.21 Given a stationary memoryless wiretap channel with per-letter
conditional distribution PYZ|X, there exists a sequence of codebooks such that messages
at the rate R can be reliably transmitted to the intended receiver and that pα, τq is
not ε-achievable, for any ε P p0, 1q, by the eavesdropper if there exists some QX such
that
R ă IpQX, PY|Xq (5.316)
and either
α ą DpQZπZq ` rIpQX, PZ|Xq ´ IpQX, PY|Xqs` (5.317)
or
τ ă R ´ rIpQX, PZ|Xq ´ IpQX, PY|Xq `Rs`. (5.318)
Remark 5.2.14 In general the sender-receiver want to minimize α and maximize τ
obeying the tradeoff (5.317), (5.318) by selecting QX. In the special case where α has
no importance and R is larger than the secrecy capacity C :“ supQXtIpQX, PY|Xq ´
IpQX, PZ|Xqu, we see from (5.318) that the supremum τ is C. The formula for the
supremum of τ is the same as the equivocation measure defined as 1nHpW |Znq [192],
but technically our result does not follow directly from the lower bound on equivocation,
since it may be possible that the a posterior distribution of W is concentrated on
211
a small list but has a tail spread over an exponentially large set, resulting a large
equivocation.
Next, we need to prove the “only if” and “if” parts of Theorem 5.2.20.
Converse for the Eavesdropper: One-Shot Bounds
The “only if” part (the eavesdropper converse) of Theorem 5.2.20 follows by applying
the following non-asymptotic bounds to different regions and invoking Corollary 5.2.8.
Theorem 5.2.22 In the wiretap channel, fix an arbitrary distribution µZ. Suppose
the eavesdropper can either detect that no message is sent upon observing z P D0 with
µZpD0q ě 1´ A´1 (5.319)
for some A P r1,8q, or outputs a list of T pzq messages upon observing z R D0 that
contains the actual message m P t1, . . . ,Mu with probability at least 1´ εm, for some
εm P r0, 1s. Define the average quantities
T :“1
µZpDc0q
ż
Dc0T pzqdµZpzq, (5.320)
ε :“1
M
Mÿ
m“1
εm. (5.321)
Then for any γ P r1,`8q,
1
Aě
1
γp1´ ε´ EγpPZπZqq , (5.322)
where we recall that πZ is the non-message distribution, PZ :“ 1M
řMm“1 PZ|W“m,
and PZ|W“m is the distribution of the eavesdropper observation for the message m
212
(assuming an arbitrary codebook is used). Moreover,
T
MAě
1
γ
˜
1´ ε´1
M
Mÿ
m“1
EγpPZ|W“mµZq
¸
. (5.323)
We will choose µZ “ PZ when we use Theorem 5.2.22 to prove Theorem 5.2.20,
although (5.323) holds for any µZ .
From the eavesdropper viewpoint, a larger A and a smaller T is more desirable
since it will then be able to find out that no message is sent with smaller error proba-
bility or narrow down to a smaller list when a message is sent. This observation agrees
with (5.322) and (5.323): a smaller γ implies a higher degree of approximation, and
hence higher indistinguishability of output distributions which is to the eavesdropper
disadvantage.
Proof To see (5.322),
1
Aě πZpDc0q (5.324)
ě1
γpPZpDc0q ´ EγpPZπZqq (5.325)
“1
γ
˜
1
M
Mÿ
m“1
PZ|W“mpDc0q ´ EγpPZπZq¸
(5.326)
ě1
γp1´ ε´ EγpPZπZqq . (5.327)
213
To see (5.323), let Dm be the set of outputs z P Z for which the eavesdropper list
contains m P t1, . . . ,Mu. Then
T
MAě
T
MµZpDc0q (5.328)
“1
M
ż
Dc0T pzqdµZpzq (5.329)
“1
M
ż Mÿ
m“1
1tz P DmudµZpzq (5.330)
“1
M
Mÿ
m“1
ż
1tz P DmudµZpzq (5.331)
“1
M
Mÿ
m“1
µZpDmq (5.332)
ě1
Mγ
Mÿ
m“1
`
PZ|W“mpDmq ´ EγpPZ|W“mµZq˘
(5.333)
ě1
γ
˜
1´ ε´1
M
Mÿ
m“1
EγpPZ|W“mµZq
¸
. (5.334)
Next, we particularize Theorem 5.2.22 to the asymptotic setting.
Proof [Proof of “only if” in Theorem 5.2.20]
• Fix an arbitrary
α ą DpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`. (5.335)
We will show that pα, τq is not ε-achievable by the eavesdropper for any τ ą 0
and ε P p0, 1q. Pick σ ą 0 such that
α ą DpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`` 2σ (5.336)
214
and define
$
’
&
’
%
An “ exppnpα ´ σqq
Tn “ exppnpτ ` σqq.(5.337)
Assuming the eavesdropper can perform pAn, Tn, εpcMLqq-decoding for a partic-
ular realization of the codebook cML, then applying Theorem 5.2.22 with
γn “ exppnpDpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`` σqq, (5.338)
we obtain
exppnpDpQZπZq ` rIpQX, PZ|Xq ´R ´RLs`´ α ` 2σqq
“γnAn
(5.339)
ě 1´ εpcMLq ´ EγnpPZnrcMLsπ
bnZ q. (5.340)
From (5.336), the above implies
EγnpPZnrcMLsπbnZ q ě
1´ εpcMLq
2. (5.341)
For sufficiently large n. By Corollary 5.2.8 and (5.338), the average of the left
side converges to zero as n Ñ 8, thus the average of the right side cannot be
lower bounded by 1´ε2
.
• Fix an arbitrary
τ ă R ´ rIpQX, PZ|Xq ´RLs`. (5.342)
215
We will show that pα, τq is not ε-achievable by the eavesdropper for any α ą 0
and ε P p0, 1q. Pick σ ą 0 such that
τ ` 2σ ă R ´ rIpQX, PZ|Xq ´RLs` (5.343)
and again define An and Tn as in (5.337). Assuming the eavesdropper can
perform pAn, Tn, εpcMLqq-decoding for a particular realization of the codebook
cML, then applying Theorem 5.2.22 with
µZ “ QbnZ , (5.344)
γn “ exppnprIpQX, PZ|Xq ´RLs`` σqq, (5.345)
and noting that An ě 1, we obtain
exppnpτ ´R ` rIpQX, PZ|Xq ´RLs`` 2σqq
“TnγnMn
(5.346)
ě 1´ εpcMLq ´
1
M
Mÿ
m“1
EγnpPZn|W“mQbnZ q (5.347)
“ 1´ εpcMLq ´
1
M
Mÿ
m“1
EγnpPZnrcmLsQbnZ q (5.348)
where cmL :“ pcmlq
Ll“1. From (5.343), the above implies
1
M
Mÿ
m“1
EγnpPZnrcmLsQbnZ q ě
1´ εpcMLq
2, (5.349)
for sufficiently large n. Invoking Corollary 5.2.8, we see the average of the right
side with respect to the codebook converges zero as n Ñ 8, and in particular
cannot be lower-bounded by 1´ε2
.
216
Achievability of the Eavesdropper: Ensemble Tightness
The (eavesdropper) achievability part of Theorem 5.2.20 follows by analyzing the
eavesdropper list decoding ability for different cases of the rates pR,RLq. First, con-
sider the following one-shot achievability bounds for channel coding with possibly no
message sent:
Theorem 5.2.23 Consider a random transformation PZ|X and a pM,L,QXq-random
code. Let QX Ñ PY |X Ñ QY , and let πZ be the distribution of the eavesdropper
observation when no message is sent. Define
ıZ;Xpz;xq :“ logdPZ|X“x
dπZpzq; (5.350)
ıZ;Xpz;xq :“ logdPZ|X“x
dQZ
pzq. (5.351)
Let δ, β, A, T ą 0. Then, there exist three list decoders such that for Decoder 1,
ECPrerror|no messages ď1
Aexpp´δq, (5.352)
ECPrerror|message is ms ď PrıZ;XpZ;Xq ď logpLMAq ` δs
` PrıZ;XpZ;Xq ď logLM
T` δs
`1
1` β` e´pβ`1q
` β expp´δq. (5.353)
Here an error in the case of no message means that a non-empty list is produced. An
error in the case of message m means either the list does not contain m, or the list
217
size exceeds T .15 For Decoder 2,
ECPrerror|no messages ď1
A; (5.354)
ECPrerror|message is ms ď P„
ıZ;XpZ;Xq ď logLM
T` δ
` PrıQZπZ pZq ď logAs
`1
1` β` e´pβ`1q
` β expp´δq, (5.355)
where the error events are defined similarly to Decoder 1. Decoder 3 either output an
empty list or a list of all messages, and
ECPrerror|no messages ď1
A; (5.356)
ECPrerror|message is ms ď PrıQZπZ pZq ď logAs. (5.357)
Proof We defer the proof to the end of this Section.
Under various conditions, one out of the three decoders are asymptotically op-
timal. By choosing appropriate parameters δ, β, A, T ą 0, it is clear that Theo-
rem 5.2.23 implies the following:
Corollary 5.2.24 Fix any QX, R, RL and 0 ă ε ă 1. Consider pexppnRq, exppnRLq, QbnX q-
random codes and stationary memoryless channel with per-letter conditional distri-
bution QZ|X.
15Such a decoder is a variable list-size decoder. However, we can add a post processor whichdeclares no message if the list is empty, or outputs a list of fixed size T otherwise (by arbitrarilydeleting or adding messages to the list), resulting a new decoder as considered in Definition 5.2.9,and the two types of error probability for the new decoder (i.e. the best values of 1
A and εm inDefinition 5.2.9) do not exceed the two types of error probability for the original variable list-sizedecoder.
218
• When R`RL ă IpQX, PZ|Xq, the rate pair pα, τq is ε-achievable by a Decoder 116
if
$
’
&
’
%
DpQZπZq ` IpQX, PZ|Xq ą RL `R ` α;
IpQX, PZ|Xq ą R `RL ´ τ.(5.358)
• When R ` RL ě IpQX, PZ|Xq but RL ă IpQX, PZ|Xq, the rate pair pα, τq is ε-
achievable by a Decoder 2 if
$
’
&
’
%
IpQX, PZ|Xq ą RL `R ´ τ ;
α ă DpQZπZq.(5.359)
• When RL ě IpQX, PZ|Xq, the rate pair pα, τq is ε-achievable by a Decoder 3 if
$
’
&
’
%
τ ě R;
α ă DpQZπZq.(5.360)
Proof [Proof Sketch] Consider the first case. To see the achievability of pα, τq
satisfying (5.358), choose
δn :“ n0.9, (5.361)
βn :“ n, (5.362)
An :“ exppnαq, (5.363)
Tn :“ exppnτq. (5.364)
16By which we mean it is possible to choose the δ, β,A, T parameters for the decoder to achievethe desired performance.
219
Then the right sides of (5.352) and (5.353) converges to zero as nÑ 8. The analyses
of the other two cases are similar using the same choice of the parameters as above.
The eavesdropper’s achievability (“if ” part) of Theorem 5.2.20 then follows from
Corollary 5.2.24 and an application of the standard diagonalization argument to show
that the achievable region is closed (see [44]).
Proof of Theorem 5.2.23
• Codebook generation: pcijq1ďiďM,1ďjďL according to QbMLX .
• Decoders: Fix an arbitrary constant δ ą 0. Upon observing z, Decoder 1
outputs as a list all 1 ď i ďM such that there exists 1 ď j ď L satisfying
$
’
&
’
%
ıZ;Xpz; cijq ą logpLMAq ` δ
ıZ;Xpz; cijq ą log LMT` δ
(5.365)
if there is at least one such an i, or declares that no message is sent (i.e. outputs
an empty list) if otherwise. Decoder 2 outputs as a list all 1 ď i ďM such that
there exists 1 ď j ď L satisfying
ıZ;Xpz; cijq ą logLM
T` δ (5.366)
if there exists at least one such i and in addition,
ıQZπZ pzq ą logA, (5.367)
or declares that no message is sent if otherwise. Decoder 3 outputs t1, . . . ,Mu
as the list if (5.367) holds (so that the list size equals M), or otherwise declares
no message.
220
• Error analysis: we denote by L the list of messages recovered by the eavesdrop-
per.
Decoder 1:
PrL ‰ H|no messages
ď P„
max1ďmďM,1ďlďL
ıZ;XpZ;Xmlq ą logpLMAq ` δ
(5.368)
ď1
Aexpp´δq (5.369)
where the probability is averaged over the codebook, pXML, Zq „ QbMLX ˆ πZ ,
and (5.369) used the packing lemma [161]. Moreover
Pr1 R L or L “ H|W “ 1s (5.370)
ď PrıZ;XpZ;Xq ď logpLMAq ` δs
` P„
ıZ;XpZ;Xq ď logLM
T` δ
(5.371)
221
where W denotes the message sent, and pX,Zq „ QXZ . Further,
Pr|L| ě T ` 1|W “ 1s
ď P„
|L| ě T ` 1, LX"
2, . . . ,βM
T` 1
*
“ H
ˇ
ˇ
ˇ
ˇ
W “ 1
` P„
LX"
2, . . . ,βM
T` 1
*
‰ H
ˇ
ˇ
ˇ
ˇ
W “ 1
(5.372)
ď
ˆ
1´βMT
M
˙T
` P„
LX"
2, . . . ,βM
T` 1
*
‰ H
ˇ
ˇ
ˇ
ˇ
W “ 1
(5.373)
ď1
1` β` e´pβ`1q
` P
«
max2ďmďβM
T`1,1ďlďL
ıZ;XpZ;Xmlq ą logLM
T` δ
ff
(5.374)
ď1
1` β` e´pβ`1q
` β expp´δq (5.375)
where
– To see (5.373), note that by the symmetry among the messages 2, . . . ,M ,
for any t ě T ,
P„
LX"
2, . . . ,βM
T` 1
*
“ H
ˇ
ˇ
ˇ
ˇ
W “ 1, |Lzt1u| “ t
(5.376)
“
ˆ
1´βMT
M ´ 1
˙t
(5.377)
ď
ˆ
1´βMT
M
˙T
. (5.378)
– In (5.374) pXML, Zq „ QbMLX ˆQZ , and we used the inequality (5.308).
– (5.375) used the packing lemma [161].
222
In summary,
Prerror|no messages ď1
Aexpp´δq, (5.379)
and for each m “ 1, . . . ,M , by the union bound and by the symmetry in
codebook generation we have
Prerror|W “ ms ď PrıZ;XpZ;Xq ď logpLMAq ` δs
` P„
ıZ;XpZ;Xq ď logLM
T` δ
`1
1` β` e´pβ`1q
` β expp´δq. (5.380)
Decoder 2:
PrL ‰ H|no messages ď PrıQZπZ pZq ą logAs (5.381)
ď1
APrıQZπZ pZq ą logAs (5.382)
ď1
A(5.383)
where Z „ πZ and Z „ QZ , and (5.382) used the change of measure. On the
other hand,
PrL ‰ H, 1 R L|W “ 1s
ď P„
ıZ;XpZ;Xq ď logLM
T` δ
(5.384)
and
PrL “ H|W “ 1s ď PrıQZπZ pZq ď logAs, (5.385)
223
Moreover, as in (5.375), we have
Pr|L| ě T ` 1|W “ 1s
ď1
1` β` e´pβ`1q
` β expp´δq (5.386)
By union bound,
Prerror|no messages ď1
A; (5.387)
and for each m “ 1, . . . ,M ,
Prerror|W “ ms ď P„
ıZ;XpZ;Xq ď logLM
T` δ
` PrıQZπZ pZq ď logAs
`1
1` β` e´pβ`1q
` β expp´δq. (5.388)
• Decoder 3:
The analysis is similar to that of Decoder 2 and the result follows from (5.382)
and (5.385).
224
Chapter 6
The Counterpoint
While the previous chapters have demonstrated the usages of functional formulations
of information measures and functional-analytic methods in the achievability and
converse proofs of operational problems in information theory, the present chapter is
devoted to the antithesis, that is, how entropic formulations and information-theoretic
methods can be used to prove functional inequalities. Most notably, the Gaussian
rotational invariance implies the Gaussian optimality in the Brascamp-Lieb inequal-
ity and its variations, and the proofs appear to be conveniently carried out in the
information-theoretic formulations than in their traditional functional versions [89].
Data processing property, tensorization, and convexity are studied in Section 6.1.
Section 6.2 proves the Gaussian optimality in a generalization of the Brascamp-Lieb
inequality where the deterministic linear maps are generalized by Gaussian random
transformations1. In most cases, we are able to prove the Gaussian extremality and
uniqueness of the minimizer under a certain non-degenerate assumption, while es-
tablishing the Gaussian exhaustibility in full generality2. In Section 6.3 we further
establish the Gaussian optimality in the forward-reverse Brascamp-Lieb inequality.
Section 6.4 discusses several implications of the Gaussian optimality results: some
quantities/rate regions arising in information theory can be efficiently computed by
1That is, a random transformation x ÞÑ Ax `w where A is deterministic and w is a Gaussianvector independent of x.
2See the beginning of Section 6.2 for precise definitions of extremisability and exhaustibility.
225
solving a finite dimensional optimization problem in the Gaussian cases. Examples
include multi-variate hypercontractivity, Wyner’s common information for multiple
variables, and certain secret key or common randomness generation problems. The
relationship between the Gaussian optimality in the forward-reverse Brascamp-Lieb
inequality and the transportation-cost inequalities for Gaussian measures is also dis-
cussed.
6.1 Data Processing, Tensorization and Convexity
Given QX and pQYj |Xq, denote by GBLpQX , pQYj |Xqq the set of pd, pcjqq in Theo-
rem 2.2.3 (forward Brascamp-Lieb inequality) such that (2.14) holds. In this section
we show that some elementary properties of GBLpQX , pQYj |Xqq follows conveniently
from the information-theoretic characterization (2.15).
6.1.1 Data Processing
Loosely speaking, the set GBLpQX , pQYj |Xqq characterizes the level of “uncorrelated-
ness” between X and pY1, . . . , Ymq. The following data processing property captures
this intuition:
Proposition 6.1.1 1. Given QW , QX|W and pQYj |Xqmj“1, assume that QWXYj “
QWQX|WQYj |X for each j. If p0, pcjqq P GBLpQX , pQYj |Xqq, then p0, pcjqq P
GBLpQW , pQYj |W qq.
2. Given QX , pQYj |Xqmj“1 and pQZj |Yjq
mj“1, assume that QXYjZj “ QXQYj |XQZj |Yj
for each j. Then GBLpQX , pQYj |Xqq Ă GBLpQX , pQZj |Xqq.
The proof is omitted since it follows immediately from the monotonicity of the relative
entropy and (2.15).
226
6.1.2 Tensorization
The term ”tensorization” refers to the phenomenon of additivity/multiplicativity in
certain functional inequalities under tensor products. In information theory this is
a central feature of many converse proofs, and is closely related to the fact that
some operational problems admit single-letter solutions. In functional analysis, this
provides a “particularly cute” [195] tool for proving many inequalities in arbitrary
dimensions. As a close example, Lieb’s proof [89] of the Brascamp Lieb inequality
relies on a special case of Proposition 6.1.2 below, where the proof uses the (functional
version of) Brascamp-Lieb inequality and the Minkowski inequality. The original
proof of Brascamp-Lieb inequality [83] is also based on a tensor power construction.
Proposition 6.1.2 Suppose pdpiq, pcjqq P GBLpQpiqX , pQ
piqYj |X
qq for i “ 1, 2. Then
`
dp1q ` dp2q, pcjq˘
P GBL
´
Qp1qX ˆQ
p2qX ,
´
Qp1qYj |X
ˆQp2qYj |X
¯¯
where dp1q ` dp2q is defined as the function
X p1q ˆ X p2q Ñ R; (6.1)
px1, x2q ÞÑ dp1qpx1q ` dp2qpx2q. (6.2)
We provide a simple information-theoretic proof using the chain rules of the relative
entropy. Note that the algebraic expansions here are similar to the ones in the proof
of Gaussian optimality in Section 6.2 or the converse proof for the key generation
problem in Section 6.4.5.
Proof For any arbitrary PXp1qXp2q , define PXp1qXp2qY
p1qj Y
p2qj
:“ PXp1qXp2qQp1qYj |X
Qp2qYj |X
.
Observe that
DpPXp1qXp2qQp1qX ˆQ
p2qX q “ DpPXp1qQ
p1qX q `DpPXp2q|Xp1qQ
p2qX |PXp1qq. (6.3)
227
DpPYp1qj Y
p2qjQ
p1qYjˆQ
p2qYjq “ DpP
Yp1qjQ
p1qYjq `DpP
Yp2qj |Y
p1qjQ
p2qYj|PYp1qjq (6.4)
ď DpPYp1qjQ
p1qYjq `DpP
Yp2qj |Xp1qY
p1qjQ
p2qYj|PXp1qY
p1qjq (6.5)
“ DpPYp1qjQ
p1qYjq `DpP
Yp2qj |Xp1q
Qp2qYj|PXp1qq (6.6)
where (6.5) uses Jensen’s inequality, and (6.6) is from the Markov chain Yp2qj ´ Xp1q´
Yp1qj , wherein pXpiq, Y
piqj q „ P
XpiqYpiqj
for i “ 1, 2, j “ 1, . . . ,m. By the assumption
and the law of total expectation,
DpPXp1qQp1qX q ` ErdpXp1q
qs ě
mÿ
j“1
cjDpPY p1qjQ
p1qYjq; (6.7)
DpPXp2q|Xp1qQp2qX |PXp1qq ` ErdpXp2q
qs ě
mÿ
j“1
cjDpPY p2qj |Xp1qQ
p2qYj|PXp1qq. (6.8)
Adding up (6.7) and (6.8) and applying (6.3) and (6.6), we obtain
DpPXp1qXp2qQp1qX ˆQ
p2qX q ` Erpdp1q ` dp2qqpXp1q, Xp2q
qs ě
mÿ
j“1
cjDpPY p1qj Yp2qjQ
p1qYjˆQ
p2qYjq
(6.9)
as desired.
A functional proof of the tensorization of reverse Brascamp-Lieb inequalities can be
given by generalizing the proof of the tensorization of the Prekopa-Leindler inequality
(see for example [196]). Alternatively, information-theoretic proofs of the tensoriza-
tion of these reverse-type inequalities can be extracted from the proof of the Gaussian
optimality in Theorem 6.3.1 ahead, and we omit the repetition here.
6.1.3 Convexity
Another property which follows conveniently from the information-theoretic charac-
terization of GBLp¨q is convexity:
228
Proposition 6.1.3 If pdi, pcijqq P GBLpQX , pQYj |Xq for i “ 0, 1, then pdθ, pcθjqq P
GBLpQX , pQYj |Xqq for θ P r0, 1s, where we have defined
dθ :“ p1´ θqd0` θd1, (6.10)
cθj :“ p1´ θqc0j ` θc
1j , @j P t1, . . . ,mu. (6.11)
Proof Follows immediately from (2.36) and taking convex combinations.
Note that by taking m “ 2, X “ pY1, Y2q, dip¨q “ 0 and QYj |X to be the projec-
tion to the coordinates, we recover the Riesz-Thorin theorem on the interpolation of
operator norms in the special case of nonnegative kernels. This information-theoretic
proof (for this special case) is much simpler than the common proof of the Riesz-
Thorin theorem in functional analysis based on the three-lines lemma, because the
cj’s only affect the right side of (2.15) as linear coefficients, rather than as tilting of
the distributions or functions.
6.2 Gaussian Optimality in the Forward Brascamp-
Lieb Inequality
In this section we prove the Gaussian extremality in several information-theoretic
inequalities related to the forward Brascamp-Lieb inequality. Specifically, we first
establish this for an inequality involving conditional differential entropies, which im-
mediately implies the variants involving conditional mutual informations or differ-
ential entropies; the latter is directly connected to the Brascamp-Lieb inequality, as
Theorem 2.2.3 showed. These extremal inequalities have implications for certain op-
erational problems in information theory, and quite interestingly, the essential steps
in the proofs of these extremal inequalities follow the same patterns as the converse
proofs for the corresponding operational problems.
229
Of course, in the functional analysis literature there has been a lot of work supply-
ing various proofs of the Brascamp-Lieb inequality (Theorem 2.2.1), based on different
properties of the Gaussian distribution/function which we summarize below:
• [83] The tensor power of a one-dimensional Gaussian distribution is a multidi-
mensional Gaussian distribution, which is stable under Schwarz symmetrization
(i.e. spherically decreasing rearrangement).
• [89][197] Rotational invariance: if f is a one-dimensional Gaussian function,
then
fpxqfpyq “ f
ˆ
x´ y?
2
˙
f
ˆ
x` y?
2
˙
. (6.12)
• [90, Lemma 2] The convolution of Gaussian functions is Gaussian.
• [76] If a real valued random variable is added to an independent Gaussian noise,
then the derivative of the differential entropy of the sum with respect to the
variance of the noise is half the Fisher information (de Bruijn’s identity), and of
course the non-Gaussianness of the sum eventually disappears as the variance
goes to infinity.
Among these, the rotation invariance principle also forms the basis of the proof strat-
egy in this thesis. Rotation invariance is a characterizing property of the Gaussian
distribution: two independent random variables are both Gaussian if their sum is in-
dependent of their difference (i.e. Cramer’s theorem [198]). This rotation invariance
argument3 has been used in establishing Gaussian extremality by Lieb [89], Carlen
[197] and recently in information theory by Geng-Nair [92] [199], Courtade-Jiao [200]
and Courtade [201]. Some related ideas have also appeared in the literature on the
Brascamp-Lieb inequality, such as the observation that convolution preserves the ex-
3This argument was referred to as “Op2q-invariance” in [98] and “doubling trick” in [197].
230
tremizers of Brascamp-Lieb inequality [90, Lemma 2] due to Ball. However, as keenly
noted in [92], applying the rotation invariance/doubling trick on the information-
theoretic formulation has certain advantages. For example, the chain rules provide
convenient tools, and the establishment of the extremality usually follows similar steps
as the converse proofs of the corresponding operational problems in information the-
ory. Since the optimization problems we consider involve many information-theoretic
terms, we introduce a simplification/strengthening of the Geng-Nair approach by per-
turbing the coefficients in the objective function (see Remark 6.2.2), thus giving rise
to some identities which become handy in the proof. A similar idea was used in [200],
and this should be applicable to a wide range of other problems.
In this section, X ,Y1, . . . ,Ym are assumed to be Euclidean spaces of dimensions
n, n1, . . . , nm. To be specific about the notions of Gaussian optimality, we adopt some
terminologies from [98]:
Definition 6.2.1 • Extremisability: a certain supremization/infimization is
finitely attained by some argument.
• Gaussian extremisability: a certain supremization/infimization is finitely at-
tained by Gaussian function/Gaussian distributions.
• Gaussian exhaustibility: the value of a certain supremization/infimization does
not change when the arguments are restricted to the subclass of Gaussian func-
tions/Gaussian distributions.
Most of the times, we will be able to prove Gaussian extremisability in a certain
non-degenerate case, while showing Gaussian exhaustibility in general.
231
6.2.1 Optimization of Conditional Differential Entropies
Fix M ľ 0, c0 P r0,8q, c1, . . . , cm P p0,8q, and Gaussian random transformations
QYj |X for j P t1, . . . ,mu. For each PXU4, define
F pPXUq :“ hpX|Uq ´mÿ
j“1
cjhpYj|Uq ´ c0 TrrMΣX|U s, (6.13)
where X „ PX (the marginal of PXU) and Yj has distribution induced by PX Ñ
QYj |X Ñ PYj. We have defined the differential entropy and the conditional differential
entropies as
hpXq :“ ´DpPXλq; (6.14)
hpX|U “ uq :“ ´DpPX|U“uλq, @u P U ; (6.15)
hpX|Uq :“
ż
hpX|U “ uqdPU , (6.16)
where λ is the Lebesgue measure (with the same dimension as X), and (6.16) is
defined whenever the integral exists. Moveover, we have used the notation ΣX|U :“
ErCovpX|Uqs for the expectation of the conditional covariance matrix.
Definition 6.2.2 We say pQY1|X, . . . , QYm|Xq is non-degenerate if each QYj |X“0 is
a nj-dimensional Gaussian distribution with invertible covariance matrix.
In the non-degenerate case, we can show an extremal result for the following
optimization with a regularization on the covariance of the input.
Theorem 6.2.1 If pQY1|X, . . . , QYm|Xq is non-degenerate, then supPXUtF pPXUq : ΣX|U ĺ
Σu is finite and is attained by a Gaussian X and constant U . Moreover, the covari-
ance of such X is unique.
4In the case of standard Borel space, the conditional distribution PX|UҬ can be uniquely definedfrom the joint distribution PXU , PU -almost surely; see e.g. [43].
232
In Theorem 6.2.1, we assume that the supremum is over PXU such that PX|U“u is
absolutely continuous with respect to the Lebesgue measure (hence having a density
function) for almost every u5. Additionally, we adopt the following convention in all
the optimization problems in Section 6.2 and Section 6.4, unless otherwise specified.
This eliminates situations such as8`8 or a`8 which can be considered as legitimate
calculations but are technically difficult to deal with.
Convention 1 The sup or inf are taken over all arguments such that each term in
the objective function (e.g. (6.13)) is well-defined and finite.
Proof of Theorem 6.2.1 Assume that both PXp1qUp1q and PXp2qUp2q are maxi-
mizers of (6.13) subject to ΣX|U ĺ Σ; the proof of the existence of maximizer is
deferred to Appendix 6.5. Let pU p1q,Xp1q,Yp1q1 , . . . ,X
p1qm q „ PXp1qUp1qQY1|X . . . QYm|X
and pU p2q,Xp2q,Yp2q1 , . . . ,X
p2qm q „ PXp2qUp2qQY1|X . . . QYm|X be mutually independent.
Define
X`“
1?
2
`
Xp1q`Xp2q
˘
X´“
1?
2
`
Xp1q´Xp2q
˘
. (6.17)
Define Y`j and Y´
j similarly for j “ 1, . . . ,m, and put U “ pU p1q, U p2qq. We now
make three important observations:
1. First, due to the Gaussian nature of QYj |X, it is easily seen that Y`j |tX
` “
x`,X´ “ x´, U “ uu „ QYj |X“x` is independent of x´. Thus Y`j |tX
` “
x, U “ uu „ QYj |X“x as well. Similarly, Y´j |tX
´ “ x, U “ uu „ QYj |X“x for
j “ 1, . . . ,m.
5Since the integral in (6.16) may not be well-defined in general, some authors have restrictedthe attention to finite U in the optimization problems. In this chapter, we are allowed to drop thisrestriction as long as the integral in (6.16) is well-defined. These distinctions do not appear to makean essential difference for our purpose; see Footnote 13.
233
2. Second, observe that for each u “ pu1, u2q we can verify the algebra
ΣX`|U“u “ ErpX`´ µX`|UqpX
`´ µX`|Uq
J|U “ us
“1
2ErpXp1q
´ µXp1q|UqpXp1q´ µXp1q|Uq
J|U “ us
`1
2ErpXp2q
´ µXp2q|UqpXp2q´ µXp2q|Uq
J|U “ us
` ErpXp1q´ µXp1q|UqpX
p2q´ µXp2q|Uq
J|U “ us
“1
2ErpXp1q
´ µXp1q|Up1qqpXp1q´ µXp1q|Up1qq
J|U “ us
`1
2ErpXp2q
´ µXp2q|Up2qqpXp2q´ µXp2q|Up2qq
J|U “ us
` ErpXp1q´ µXp1q|Up1qqpX
p2q´ µXp2q|Up2qq
J|U “ us. (6.18)
The last term above vanishes upon averaging over pu1, u2q because of the inde-
pendence pU p1q,Xp1qq K pU p2q,Xp2qq. Thus
ΣX`|U “1
2ΣXp1q|Up1q `
1
2ΣXp2q|Up2q ĺ Σ. (6.19)
By the same token,
ΣX´|U “1
2ΣXp1q|Up1q `
1
2ΣXp2q|Up2q ĺ Σ. (6.20)
These combined with ΣX´|X`U ĺ ΣX´|U justify that both PX`,U and PX´,UX`
satisfy the covariance constraint ΣX|U ĺ Σ in the theorem.
234
3. Third, we have
2ÿ
k“1
«
hpXpkq|U pkqq ´
mÿ
j“1
cjhpYpkqj |U
pkqq
ff
(6.21)
“ hpXp1q,Xp2q|Uq ´
mÿ
j“1
cjhpYp1qj ,Y
p2qj |Uq (6.22)
“ hpX`,X´|Uq ´
mÿ
j“1
cjhpY`
j ,Y´j |Uq (6.23)
“ hpX`|Uq ´
mÿ
j“1
cjhpY`
j |Uq ` hpX´|X`, Uq ´
mÿ
j“1
cjhpY´j |Y
`j , Uq (6.24)
ď hpX`|Uq ´
mÿ
j“1
cjhpY`
j |Uq ` hpX´|X`, Uq ´
mÿ
j“1
cjhpY´j |X
`, Uq, (6.25)
where the final inequality follows from the Markov chain Y´j ´ X`U ´
Y`j , which is because the joint distribution factorizes as PUX`X´Y`j Y´j
“
PUX`X´QYj |XQYj |X.
Thus, we can conclude that
2ÿ
i“1
F pPXpiqUpiqq “
2ÿ
i“1
«
hpXpkq|U pkqq ´
mÿ
j“1
cjhpYpkqj |U
pkqq ´ c0 TrrMΣXpkq|Upkqs
ff
(6.26)
ď hpX`|Uq ´
mÿ
j“1
cjhpY`j |Uq ´ c0 TrrMΣX`|U s
` hpX´|X`, Uq ´
mÿ
j“1
cjhpY´j |X
`, Uq ´ c0 TrrMΣX´|X`,U s (6.27)
ď
2ÿ
i“1
F pPXpiqUpiqq, (6.28)
where
• (6.27) follows from (6.19),(6.20) and (6.25);
235
• (6.28) follows since PX`,U and PX´,UX` are candidate optimizers of (6.13) sub-
ject to the given covariance constraint whereas PXpiq,Upiq are the optimizers by
assumption (i “ 1, 2).
Then, the equalities in (6.26)-(6.28) must be achieved throughout, so both PX`,U and
PX´,UX` (and also PX`,UX´ , by symmetry of the argument) are maximizers of (6.13)
subject to ΣX|U ĺ Σ.
So far, we have considered fixed coefficients cm0 “ pc0, . . . , cmq. The same argument
applies for coefficients on a line:
cm0 ptq :“ tam0 , t ą 0 (6.29)
for any fixed a0 P r0,8q, a1, . . . , am P p0,8q, and we next show several properties for
a dense subset of this line. Applying Lemma 6.2.3 (following this proof) with
ppPXUq Ð hpX|Uq; (6.30)
qpPXUq Ð ´
mÿ
j“1
ajhpYj|Uq ´ a0 TrrMΣX|U s, (6.31)
fptq Ð maxPXU
rppPXUq ` tqpPXUqs, (6.32)
we obtain from the optimality of PX`,U PX´,UX` and PX`,UX´ that
hpX`|Uq “ hpX´
|X`, Uq “ hpX`|X´, Uq (6.33)
for almost all t P p0,8q, where PX`X´U depends implicitly on t. Note that (6.33)
implies that IpX`; X´|Uq “ 0 hence X` and X´ are independent conditioned on
U . Recall the following Skitovic-Darmois characterization of Gaussian distributions
(with the extension to the vector Gaussian case in [92]):
236
Lemma 6.2.2 Let A1 and A2 be mutually independent n-dimensional random vec-
tors. If A1`A2 is independent of A1´A2, then A1 and A2 are normally distributed
with identical covariances.
Using Lemma 6.2.2, we can conclude that for almost all t P p0,8q, Xpiq must be
Gaussian, with covariance not depending on U piq, thus U piq can be chosen as a constant
(i “ 1, 2). Thus for all such t,
fptq “ maxPXU : U“const.,X Gaussian
rppPXUq ` tqpPXUqs. (6.34)
Since both sides of (6.34) are concave in t, hence continuous on p0,8q, we see (6.34)
actually holds for all t P p0,8q. The proof is completed since am0 can be arbitrarily
chosen.
Lemma 6.2.3 Let p and q be real-valued functions on an arbitrary set D. If fptq :“
maxxPDtppxq ` tqpxqu is always attained, then for almost all t, f 1ptq exists and
f 1ptq “ qpx‹q, @x‹ P arg maxxPD
tppxq ` tqpxqu. (6.35)
In particular, for all such t, qpx‹q “ qpx‹q and ppx‹q “ ppx‹q for all x‹, x‹ P
arg maxxPDtppxq ` tqpxqu.
Geometrically, fptq is the support function [94] of the set S :“ tpppxq, qpxqquxPD
evaluated at p1, tq. Hence fp¨q is convex, and the left and the right derivatives are de-
termined by the two extreme points of the intersection between S and the supporting
hyperplane.
Proof of Lemma 6.2.3 The function f is convex since it is a pointwise supremum
of linear functions, and is therefore differentiable almost everywhere. Moreover, f 1ptq
(which is well-defined in the a.e. sense) is monotone increasing by convexity, and is
therefore continuous almost everywhere.
237
Let x‹t denote an arbitrary element of arg maxxPDtppxq` tqpxqu. By definition, for
any s P R, fpsq ě ppx‹t q ` sqpx‹t q. Thus, for δ ą 0,
fpt` δq ´ fptq
δě qpx‹t q and
fptq ´ fpt´ δq
δď qpx‹t q. (6.36)
If f 1ptq exists, that is, the left sides of the two inequalities above have the same limit
f 1ptq as δ Ó 0, then f 1ptq “ qpx‹t q. The second claim of the lemma follows immediately
from the first.
Remark 6.2.1 Let us remark on an interesting connection between the above proof
of the optimality of Gaussian random variable and Lieb’s proof of that Gaussian
functions maximize Gaussian kernels. Recall that [89, Theorem 3.2] wants to show
that for an operator G given by a two-variate Gaussian kernel function and p, q ą 1,
the ratio Gfqfp
is maximized by Gaussian f . First, a tensorization property is proved,
implying that f˚px1qf˚px2q is a maximizer for G b G if f˚ is any maximizer of G.
Then, Lieb made two important observations:
1. By a rotation invariance property of the Lebesgue measure/isotropic Gaussian
measure, f˚px1`x2?2qf˚px1´x2?
2q is also a maximizer of GbG.
2. An examination of the equality condition in the proof of tensorization property
reveals that any maximizer for GbG must be of a product form.
Thus Lieb concluded that f˚px1`x2?2qf˚px1´x2?
2q “ αpx1qβpx2q for some functions α and
β, and f˚ must be a Gaussian function. This is very similar to the above proof, once
we think of f˚ as the density function of P ˚X|U“u in our proof.
Remark 6.2.2 Our proof technique is essentially following ideas of Geng and Nair
[92][199] who established the Gaussian optimality for several information-theoretic
regions. However, we also added the important ingredient of Lemma 6.2.3.6 That is,
6The similar idea of differentiating the coefficients has been used in [200].
238
by differentiating with respect to the linear coefficients, we can conveniently obtain
information-theoretic identities which helps us to conclude the conditional indepen-
dence of X` and X´ quickly. For fixed m, in principle, this may be avoided by trying
various expansions of the two-letter quantities manually (e.g. as done in [199]), but
that approach will become increasingly complicated and unstructured as m increases.
Finally, we also note that a simple rotational invariance argument/doubling trick has
been used for proving that the capacity achieving distribution for an additive Gaussian
channel is Gaussian (cf. [202, P36]), which does not involve Lemma 6.2.2 and whose
extension to problems involving auxiliary random variables is not clear.
If we do not have the non-degenerate assumption and the regularization ΣX|U ĺ
Σ, it is very well possible that the optimization in Theorem 6.2.1 is nonfinite and/or
not attained by any PUX. In this case, we can show that the optimization is exhausted
by Gaussian distributions. To state the result conveniently, for any PX, define
F0pPXq :“ hpXq ´mÿ
j“1
cjhpYjq ´ c0 TrrMΣXs, (6.37)
where pX,Yjq „ PXQYj |X. Apparently, F pPXUq “ F0pPXq when U is constant.
Theorem 6.2.4 In the general (possibly degenerate) case,
1. For any given positive semidefinite Σ,
supPXU ,ΣX|UĺΣ
F pPXUq “ supPX Gaussian,ΣXĺΣ
F0pPXq (6.38)
where the left side of (6.38) follows Convention 1.
2.
supPXU
F pPXUq “ supPX Gaussian
F0pPXq. (6.39)
239
Remark 6.2.3 Theorem 6.2.4 reduces an infinite dimensional optimization problem
to a finite dimensional one. In particular, in the degenerate case we can verify that
the left side of (6.38) to be extremisable (Definition 6.2.1) if the the right side of
(6.38) is extremisable.
Proof Theorem 6.2.4 Note that the Gaussian random transformation QYj |X can
be realized as a linear transformation Bj to Rnj followed by adding an independent
Gaussian noise of covariance Σj. We will assume that Σj is non-degenerate in the
orthogonal complement of the image of Bj, since otherwise both sides of (6.38) are
`8. In particular, if QYj |X is a deterministic linear transform Bj, then it must be
onto Rnj .
1. We can use the argument at the beginning of Appendix 6.5 to show that the
left side of (6.38) equals
supPXU : U“t1,...,Du,ΣX|UĺΣ
F pPXUq (6.40)
where D is a fixed integer (depending only on the dimension of Σ).
Next, we argue that it is without loss of generality to add a restriction to (6.40)
that PX|U“u has a smooth density with compact support for each u P t1, . . . , Du,
in which case each PYj |U“u will have a smooth density by the assumption made
at the beginning of the proof. For this, define F0p¨q as in (6.37), and we will show
that for any PX absolutely continuous with respect to the Lebesgue measure
and any ε ą 0, we can choose a distribution PX2 whose density is smooth with
compact support such that
F0pPX2q ě F0pPXq ´ ε (6.41)
240
and
ΣX2 ĺ p1` εqΣX. (6.42)
First, by conditioning X on a large enough compact set and applying dominated
convergence theorem, there exists PX1 which is supported on a compact set and
whose density fX1 is bounded, such that each term in the definition of F0pPXq
is well-approximated when X is replaced with X1, so that
F0pPX1q ě F0pPXq ´ε
2(6.43)
and
ΣX1 ĺ?
1` εΣX. (6.44)
Second, we can pick PX2 with smooth density with compact support such that
fX2 ´ fX11 can be made arbitrarily small, which implies that fY2j´ fY1j
1
is small by Jensen’s inequality, and that TrrMΣX2s ´ TrrMΣX1s is small by
the fact that the densities are supported on a compact set. Since x ÞÑ x log x
is Lipschitz on a bounded interval, the differential entropy terms can be made
small as well (here we used the boundedness of fX1). As a result, we can ensure
that
F0pPX2q ě F0pP1Xq ´
ε
2(6.45)
and
ΣX2 ĺ?
1` εΣX1 , (6.46)
241
which, combined with (6.43) and (6.44), yield (6.41)-(6.42). This and the ob-
servation in (6.40) imply that
infεą0
supPXUPCε
F pPXUq ď supPXU : ΣX|UĺΣ
F pPXUq (6.47)
where Cε denotes the set of PXU such that U “ t1, . . . , Du, ΣX|U ĺ p1` εqΣ,
and the density of PX|U“u is smooth with compact support for each u. In the
final steps, we write F as FQ to indicate its dependence on pQY1|X, . . . , QYm|Xq,
and define Q the set of non-degenerate pQY1|X, . . . , QYm|X
q where each Yj is
obtained by adding an independent Gaussian noise with covariance Σj to Yj
(here the random transformation from X to Yj is fixed and Σj runs over the
set of positive definite matrices of the given dimension, j “ 1, . . . ,m). Then
supPXUPCε
FQpPXUq “ sup
PXUPCεsupQPQ
F QpPXUq (6.48)
“ supQPQ
supPXUPCε
F QpPXUq (6.49)
“ supQPQ
supPX Gaussian,ΣXĺp1`εqΣ
F Q0 pPXq (6.50)
“ supPX Gaussian,ΣXĺp1`εqΣ
supQPQ
F Q0 pPXq (6.51)
“ supPX Gaussian,ΣXĺp1`εqΣ
FQ0 pPXq (6.52)
where
• (6.48) and (6.52) are because, first, hpYjq ě hpYjq by the entropy power
inequality; second, infΣją0 hpYjq “ hpYjq. The proof of the second claim
is standard, since Yj has smooth density with fast (Gaussian like) decay,
242
and we can obtain pointwise convergence of the density function and apply
dominated convergence theorem.7
• (6.50) is from Gaussian extremality in non-degenerate case.
Then (6.47) and (6.52) gives
supPXU : ΣX|UĺΣ
F pPXUq ď supεą0
supPX Gaussian,ΣXĺp1`εqΣ
FQ0 pPXq (6.53)
“ supPX Gaussian,ΣXĺΣ
FQ0 pPXq (6.54)
where (6.54) is a property that can be verified without much difficulty for Gaus-
sian distributions. Thus the ď part of (6.38) is established. The other direction
is immediate from the definition.
2. First, observe that it is without loss of generality to assume that U is constant,
that is,
supPXU
F pPXUq “ supPX
F0pPXq. (6.55)
Next, by the same argument as 1), for any PX and ε ą 0, there exists PX1 such
that X1 ăM with probability one for some finite M ą 0, and that
F0pPX1q ě F0pPXq ´ ε. (6.56)
7Such a continuity in the variance of the additive noise can fail terribly when fX does not have thedecay properties; in [203, Proposition 4] Bobkov and Chistyakov provided an example of randomvector X with finite differential entropy such that hpX ` Zq “ 8 for each Z independent of Xand having finite differential entropy. In their example, supx:xěr fXpxq “ 1 for every r ą 0, andwe cannot obtain a dominating function for |fX`Z log fX`Z| to apply the dominated convergencetheorem.
243
Then
F0pPX1q ď supΣľ0
supPXU : ΣXĺΣ
F pPXUq (6.57)
ď supΣľ0
supPX Gaussian,ΣXĺΣ
F0pPXq (6.58)
“ supPX Gaussian
F0pPXq (6.59)
where (6.58) was established in (6.54). Finally, (6.55), (6.56), (6.59) and arbi-
trariness of ε give
supPXU
F pPXUq ď supPX Gaussian
F0pPXq. (6.60)
Thus the ď part of (6.39) is established. The other direction is trivial from the
definition.
6.2.2 Optimization of Mutual Informations
Let X,Y1, . . . ,Ym be jointly Gaussian vectors, a0, . . . , am be nonnegative real num-
bers, and M be a positive-semidefinite matrix. We are interested in minimizing
IpU ; Xq over PU |X (that is, U ´ X´Ym must hold) subject to IpU ; Yiq ě ai,
i P t1, . . . ,mu and TrrMΣX|U s ď a0. This is relevant to many problems in informa-
tion theory including the Gray-Wyner network [6] and some common randomness/key
generation problems [101] (to be discussed in Section 6.4.5). We have the following
result for the Lagrange dual of such an optimization problem:
244
Theorem 6.2.5 Fix M ľ 0, positive constants c0, c1, . . . , cm, and jointly Gaussian
vectors pX,Y1, . . . ,Ymq „ QXY1,...,Ym. Define the function
GpPU |Xq :“mÿ
j“1
cjIpYj;Uq ´ IpX;Uq ´ c0 TrrMΣX|U s. (6.61)
Then
1. If pQY1|X, . . . , QYm|Xq is non-degenerate, then supPU |X GpPU |Xq is achieved by
U‹ for which X|tU‹ “ uu is normal for PU‹-a.e. u, with covariance not depend-
ing on u. This can be realized by a Gaussian vector U˚ with the same dimension
as X.
2. In general, supPU |X GpPU |Xq “ supPU|X : pU,Xq is GaussianGpPU |Xq.
Proof Set Σ :“ ΣX. Note that for any PU |X, we have
mÿ
j“1
cjIpYj;Uq ´ IpX;Uq ´ c0 TrrMΣX|U s (6.62)
“
mÿ
j“1
cjhpYjq ´ hpXq `
˜
hpX|Uq ´mÿ
j“1
cjhpYj|Uq ´ c0 TrrMΣX|U s
¸
(6.63)
ď
mÿ
j“1
cjhpYjq ´ hpXq ` supPX1U 1 : ΣX1|U 1ĺΣ
F pPX1U 1q. (6.64)
In the non-degenerate case, by Theorem 6.2.1, the supremum is attained in the
last line by constant U 1 and X1 „ N p0,Σ1q, where Σ1 ĺ Σ. Hence, taking U‹ „
N p0,Σ´Σ1q and PX|U‹“u „ N pu,Σ1q, we have PX „ N p0,Σq as required, and
(6.64) reads as
mÿ
j“1
cjIpYj;Uq ´ IpX;Uq ´ c0 TrrMΣX|U s ď
mÿ
j“1
cjIpYj; U‹q ´ IpX; U‹
q ´ c0 TrrMΣX|U‹s,
(6.65)
245
and 1) follows since PU |X is arbitrary. The claim for the general (possibly degenerate)
case follows by invoking Theorem 6.2.4 and using a similar argument.
6.2.3 Optimization of Differential Entropies
The F0p¨q defined in Section 6.2.1 is a linear combination of differential entropies
and second order moments, which is closely related to the Brascamp-Lieb inequality.
Immediately from Theorem 6.2.1 and Theorem 6.2.4, we have the following result
regarding maximization of F0p¨q:
Corollary 6.2.6 For any Σ ľ 0,
supPX : ΣXĺΣ
F0pPXq “ supPX : Gaussian,ΣXĺΣ
F0pPXq. (6.66)
supPX
F0pPXq “ supPX : Gaussian
F0pPXq, (6.67)
where F0p¨q is defined in (6.37). In addition, in the non-degenerate case (6.66) is
always finite and (up to a translation) uniquely achieved by a Gaussian PX; (6.67) is
finite and (up to a translation) uniquely achieved by a Gaussian PX if M in (6.37) is
positive definite.
Remark 6.2.4 Since the left side of (6.38) is obviously concave as a function of Σ,
(6.38) and (6.66) combined shows that the left side of (6.66) is concave in Σ. This is
not obvious from the definition; in particular it is not clear wether the concavity still
holds for non-gaussian QX and QYj |X.
246
6.3 Gaussian Optimality in the Forward-Reverse
Brascamp-Lieb Inequality
In this section, some notations and terminologies from Sections 2.4 and 6.2 will be
used. Moreover, consider the following parameters/data:
• Fix Lebesgue measures pµjqmj“1 and Gaussian measures pνiq
li“1 on R;
• non-degenerate (Definition 6.2.2) linear Gaussian random transformation
pPYj |Xqmj“1 (where X :“ pX1, . . . , Xlq) associated with conditional expectation
operators pTjqmj“1;
• positive pcjq and pbiq.
Given Borel measures PXi on R, i “ 1, . . . , l, define
F0ppPXiqq :“ infPX
mÿ
j“1
cjDpPYjµjq ´lÿ
i“1
biDpPXiνiq (6.68)
where the infimum is over Borel measures PX that has pPXiq as marginals. The aim
of this section is to prove the following:
Theorem 6.3.1 suppPXi qF0ppPXiqq, where the supremum is over Borel measures PXi
on R, i “ 1, . . . , l, is achieved by some Gaussian pPXiqli“1.
Naturally, one would expect that Gaussian optimality can be established when pµjqmj“1
and pνiqli“1 are either Gaussian or Lebesgue. We made the assumption that the former
is Lebesgue and the latter is Gaussian so that certain technical conditions can be
justified conveniently, while the crux of the matter–the tensorization steps can still
be demonstrated. More precisely, we have the following observation:
247
Proposition 6.3.2 suppPXi qF0ppPXiqq is finite and there exist σ2
i P p0,8q,
i “ 1, . . . , l such that it equals
suppPXi q : ErX
2i sďσ
2i
F0ppPXiqq. (6.69)
Proof when µj is Lebesgue and PYj |X is non-degenerate, DpPYjµjq “ ´hpPYjq ď
´hpPYj |Xq is bounded above (in terms of the variance of additive noise of PYj |X).
Moreover, DpPXiνiq ě 0 when νi is Gaussian, so suppPXi qF0ppPXiqq ă 8. Further,
choosing pPXiq “ pνiq and applying a covariance argument to lower bound the first
term in (6.68) shows that suppPXi qF0ppPXiqq ą ´8.
To see (6.69), notice that
DpPXiνiq “ DpPXiν1iq ` Erıν1iνipXqs (6.70)
“ DpPXiν1iq `Dpν
1iνiq (6.71)
ě Dpν 1iνiq (6.72)
where ν 1i is a Gaussian distribution with the same first and second moments as Xi „
PXi . Thus DpPXiνiq is bounded below by some function of the second moment of Xi
which tends to 8 as the second moment of Xi tends to 8. Moreover, as argued in
the preceding paragraph the first term in (6.68) is bounded above by some constant
depending only on pPYj |Xq. Thus, we can choose σ2i ą 0, i “ 1, . . . , l large enough
such that if ErX2i s ą σ2
i for some of i then F0ppPXiqq ă suppPXi qF0ppPXiqq, irrespective
of the choices of PX1 , . . . , PXi´1, PXi`1
, . . . , PXl . Then these σ1, . . . , σl are as desired
in the proposition.
As we saw in Section 6.2, the regularization ensures that the supremum is achieved,
although it might be possible to prove Gaussian exhaustibility results by taking limits.
248
Proposition 6.3.3 1. For any pPXiqli“1, the infimum in (6.68) is attained.
2. If pPYj |Xlqmj“1 are non-degenerate (Definition 6.2.2), then the supremum in
(6.69) is achieved by some pPXiqli“1.
Proof 1. For any ε ą 0, by the continuity of measure there exists K ą 0 such
that
PXipr´K,Ksq ě 1´ εl, i “ 1, . . . , l. (6.73)
By the union bound,
PXpr´K,Kslq ě 1´ ε (6.74)
wherever PX is a coupling of pPXiq. Now let PpnqX , n “ 1, 2, . . . be a such that
limnÑ8
mÿ
j“1
cjDpPpnqYjµjq “ inf
PX
mÿ
j“1
cjDpPYjµjq (6.75)
where PYj :“ T ˚j PX, j “ 1, . . . ,m. The sequence pPpnqX q is tight by (6.74),
Thus invoking Prokhorov theorem and by passing to a subsequence, we may
assume that pPpnqX q converges weakly to some P ‹X. Therefore P
pnqYj
converges to
P ‹Yj weakly, and by the semicontinuity property in Lemma 6.5.2 we have
mÿ
j“1
cjDpP‹Yjµjq ď lim
nÑ8
mÿ
j“1
cjDpPpnqYjµjq (6.76)
establishing that P ‹X is an infimizer.
249
2. Suppose pPpnqXiq1ďiďl,ně1 is such that ErX2
i s ď σ2i , Xi „ P
pnqXi
, where pσiq is as in
Proposition 6.3.2, and
limnÑ8
F0
´
pPpnqXiqli“1
¯
“ suppPXi q : ΣXi
ĺσ2i
F0ppPXiqli“1q. (6.77)
The regularization on the covariance implies that for each i, pPpnqXiqně1 is a tight
sequence. Thus upon the extraction of subsequences, we may assume that for
each i, pPpnqXiqně1 converges to some P ‹Xi , and a simple truncation and min-max
inequality argument (see e.g. (6.169)) shows that ErX2i s ď σ2
i , Xi „ P ‹Xi . Then
by Lemma 6.5.2,
ÿ
i
biDpP‹Xiνiq ď lim
nÑ8
ÿ
i
biDpPpnqXiνiq (6.78)
Under the covariance regularization and the non-degenerateness assumption,
we showed in Proposition 6.3.2 that the value of (6.69) cannot be `8 or
´8. This implies that we can assume (by passing to a subsequence) that
PpnqXi
! λ, i “ 1, . . . , l since otherwise F ppPXiqq “ ´8. Moreover, since´
ř
j cjDpPpnqYjµjq
¯
ně1is bounded above under the non-degenerateness assump-
tion, the sequence´
ř
i biDpPpnqXiνiq
¯
ně1must also be bounded from above,
which implies, using (6.78), that
ÿ
i
biDpP‹Xiνiq ă 8. (6.79)
In particular, we have P ‹Xi ! λ for each i. Let P ‹X be an infimizer as in Part 1) for
marginals pP ‹Xiq. In view of Lemma 6.3.4, by possibly passing to subsequences
we can assume that each pPpnqXiqli“1 admits a coupling P
pnqX , such that P
pnqX Ñ P ‹X
weakly as n Ñ 8. In the non-degenerate case the output differential entropy
is weakly continuous in the input distribution under the covariance constraint
250
(see for example [92, Proposition 18]), which establishes that
infPX : S˚i PX“P
‹Xi
ÿ
j
cjDpT˚j PXµjq “
ÿ
j
cjDpP‹Yjµjq (6.80)
“ limnÑ8
ÿ
j
cjDpPpnqYjµjq (6.81)
ě limnÑ8
infPX : S˚i PX“P
pnqXi
ÿ
j
cjDpT˚j PXµjq (6.82)
Thus (6.78) and (6.82) show that pP ‹Xiq is in fact a maximizer.
Lemma 6.3.4 Suppose that for each i “ 1, . . . , l (l ě 2), PXi is a Borel measure on
R and PpnqXi
converges weakly to PXi as n Ñ 8. If PX is a coupling of pPXiq1ďiďl,
then, upon extraction of a subsequence, there exist couplings PpnqX for pP
pnqXiq1ďiďl which
converge weakly to PX as nÑ 8.
Remark 6.3.1 Lemma 6.3.4 will be used to show that the infimum of an upper semi-
continuous (w.r.t. the joint distribution) functional over couplings is also upper semi-
continuous (w.r.t the marginal distributions), which will be the key to the proof of
Proposition 6.3.3. Another application is to prove the weak continuity of the optimal
transport cost (the lower semicontinuity part being trivial) in the theory of optimal
transportation, which may also be proved using the “stability of optimal transport” in
[57, Theorem 5.20]. However, we note that the approach in [57, Theorem 5.20] relies
on cyclic monotonicity, and hence cannot be extended to the setting of general upper
semicontinuous functionals such as in Corollary 6.3.5.
Corollary 6.3.5 (Weak stability of optimal coupling) In the case of non-
degenerate pPYj |Xq, suppose for each n ě 1, PpnqXi
is a Borel measure on R, i “ 1, . . . , l,
whose second moment is bounded by σ2i ă 8. Assume that P
pnqX is a coupling of
pPpnqXiq that minimizes
řlj“1 cjDpP
pnqYjµjq. If P
pnqX converges weakly to some P ‹X, then
P ‹X minimizesřlj“1 cjDpP
‹Yjµjq given the marginals pP ‹Xiq.
251
Proof Lemma 6.3.4 and the fact that the differential entropy of the output of a
non-degenerate Gaussian channel is weakly semicontinuous with respect to the input
distribution under moment constraint (see e.g. [92, Proposition 18], [204, Theorem 7],
or [205, Theorem 1, Theorem 2]) imply that there exists a strictly increasing sequence
of positive integers pnkqkě1 such that
minPX : S˚i PX“P
‹Xi
ÿ
j
cjDpT˚j PXµjq ě lim sup
kÑ8min
PX : S˚i PX“Ppnkq
Xi
ÿ
j
cjDpT˚j PXµjq. (6.83)
On the other hand, the assumption on the convergence of the optimal couplings and
Lemma 6.5.2 imply that
ÿ
j
cjDpT˚j P
‹Xµjq ď lim inf
kÑ8min
PX : S˚i PX“Ppnkq
Xi
ÿ
j
cjDpT˚j PXµjq (6.84)
so the left side of (6.84) is no larger than the left side of (6.83).
Proof of Lemma 6.3.4 For each integer k ě 1, define the random variable
Wrksi :“ φkpXiq where φk is the following “dyadic quantization function”:
φk : x P R ÞÑ
$
’
&
’
%
t2kxu |x| ď k, x R 2´kZ;
e otherwise,(6.85)
and let Wrks :“ pWrksi q
li“1. Denote by W rks :“ t´k2k, . . . , k2k ´ 1, eu the alphabet of
Wrksi .
For simplicity of the presentation, we shall assume that the set of “dyadic points”
has measure zero:
PXip8ď
k“1
2´kZq “ 0, i “ 1, . . . , l. (6.86)
252
This is not an essential restriction, since the property of 2´kZ used in the proof is
that it is a countable dense subset of R. In general, PXi can have a positive mass
only on countably many points, and in particular there exists a point of zero mass in
any open interval. Thus, we can always choose a countable dense subset of R with
PXi measure zero (i “ 1) and appropriately modify (6.85) with points in this set for
quantization instead.
Since PpnqXi
Ñ PXi weakly and the assumption in the preceding paragraph pre-
cluded any positive mass on the quantization boundaries under PXi , for each k ě 1
there exists some n :“ nk large enough such that
Ppnq
Wrksi
pwq ě p1´1
kqP
Wrksipwq, (6.87)
for each i and w PW rks. Now define a coupling Ppnq
Wrks compatible with the
ˆ
Ppnq
Wrksi
˙l
i“1
induced by´
PpnqXi
¯l
i“1, as follows:
Ppnq
Wrks :“ p1´1
kqPWrks ` kl´1
lź
i“1
ˆ
Ppnq
Wrksi
´ p1´1
kqP
Wrksi
˙
. (6.88)
Observe that this is a well-defined probability measure because of (6.87), and in-
deed has
ˆ
Ppnq
Wrksi
˙l
i“1
as the marginals. Moreover, by triangle inequality we have the
following bound on the total variation distance
ˇ
ˇ
ˇPpnq
Wrks ´ PWrks
ˇ
ˇ
ˇď
2
k. (6.89)
253
Next, construct8 PpnqX :
PpnqX :“
ÿ
wlPWrksˆ¨¨¨ˆWrks
Ppnq
Wrks
`
wl˘
śli“1 P
pnq
Wrksi
pwiq
lź
i“1
PpnqXi|φ´1k pwiq
. (6.90)
Observe that the PpnqX so defined is compatible with the P
pnq
Wrks defined in (6.88), and
indeed has pPpnqXiqli“1 the marginals. Since n :“ nk can be made increasing in k, we
have constructed the desired sequence pPpnkqX q8k“1 converging weakly to PX. Indeed,
for any bounded open dyadic cube9 A, using (6.89) and the assumption (6.86), we
conclude
lim infkÑ8
PpnkqX pAq ě PXpAq. (6.91)
Moreover, since bounded open dyadic cubes form a countable basis of the topology
in Rl, we see (6.91) actually holds for any open set A (by writing A as a countable
union of dyadic cubes, using the continuity of measure to pass to a finite disjoint
union, and then apply (6.91)), as desired.
Next, we need the following tensorization result:
Lemma 6.3.6 Fix pPXp1qiq, pP
Xp2qiq, pµjq, pTjq, pcjq P r0,8q
m, and let Sj be induced
by coordinate projections. Then
infPXp1,2q
: S˚b2i P
Xp1,2q“P
Xp1qi
ˆPXp2qi
mÿ
j“1
cjDpPY p1,2qjµb2
j q “ÿ
t“1,2
mÿ
j“1
cj infPXptq
: S˚i PXptq“P
Xptqi
DpPYptqjµjq
(6.92)
8We use P |A to denote the restriction of a probability measure P on measurable set A, that is,P |ApBq :“ P pAX Bq for any measurable B.
9That is, a cube whose corners have coordinates being multiples of 2´k where k is some integer.
254
where for each j,
PYp1,2qj
:“ T ˚b2j PXp1,2q (6.93)
on the left side and
PYptqj
:“ T ˚b2j PXptq (6.94)
on the right side, t “ 1, 2.
Proof We only need to prove the nontrivial ě part. For any PXp1,2q on the left side,
choose PXptq on the right side by marginalization. Then
DpPYp1,2qj
µb2j q ´
ÿ
t
DpPYptqjµjq “ IpY
p1qj ;Y
p2qj q (6.95)
ě 0 (6.96)
for each j.
We are now in the position of proving the main result of this section.
Proof of Theorem 6.3.1
1. Assume that pPXp1qiq and pP
Xp2qiq are maximizers of F0 (possibly equal). Let
PX1,2i
:“ PXp1qiˆ P
Xp2qi
. Define
X` :“1?
2
`
Xp1q`Xp2q
˘
; (6.97)
X´ :“1?
2
`
Xp1q´Xp2q
˘
. (6.98)
Define pY `j q and pY ´j q analogously. Then Y `j |tX` “ x`,X´ “ x´u „ QYj |X“x`
is independent of x´ and Y ´j |tX` “ x`,X´ “ x´u „ QYj |X“x´ is independent
of x`.
255
2. Next we perform the same algebraic expansion as in the proof of tensorization:
2ÿ
t“1
F0p
´
PXptqi
¯
q “ infPXp1,2q
: S˚b2j P
Xp1,2q“P
Xp1,2qj
ÿ
j
cjDpPY p1,2qjµb2
j q ´ÿ
i
biDpPXp1,2qiνb2i q
(6.99)
“ infPX`X´ : S˚b2
j PX`X´“PX`jX´j
ÿ
j
cjDpPY `j Y´jµb2
j q ´ÿ
i
biDpPX`i X´iνb2i q
(6.100)
ď infPX`X´ : S˚b2
j PX`X´“PX`jX´j
ÿ
j
cj
”
DpPY `j µjq `DpPY´j |X
`µj|PX`q
ı
´ÿ
i
bi
”
DpPX`i νiq `DpPX´i |X
`iνi|PX`i q
ı
(6.101)
ďÿ
j
cj
”
DpP ‹Y `jµjq `DpP
‹
Y ´j |X`µj|P
‹X`q
ı
´ÿ
i
bi
”
DpP ‹X`iνiq `DpP
‹
X´i |X`νi|P
‹X`q
ı
(6.102)
“ F0p
´
P ‹X`i
¯
q `
ż
F0p
´
P ‹X´i |X
`
¯
qdP ‹X` (6.103)
ď
2ÿ
t“1
F0p
´
PXptqi
¯
q (6.104)
where
• (6.99) uses Lemma 6.3.6.
• (6.101) is because of the Markov chain Y `j ´X` ´ Y ´j (for any coupling).
• In (6.102) we selected a particular instance of coupling PX`X´ , constructed
as follows: first we select an optimal coupling PX` for given marginals
pPX`i q. Then, for any x` “ px`i qli“1, let PX´|X`“x` be an optimal coupling
of pPX´i |X`i “x
`iq. 10 With this construction, it is apparent that X`
i ´X`´
10Here we need to justify that we can select optimal coupling PX´|X`“x` in a way that PX´|X`
is indeed a regular conditional probability distribution, or equivalently, PX´|X`“x` is Borel mea-surable, where we endow the weak topology on the space of probability measures. This is justifiedby two observations: (a) x` ÞÑ pPX´
i |X`“x`q is Borel measurable; (b) the marginalization map
256
X´i and hence
DpPX´i |X`iνi|PX`i q “ DpPX´i |X`νi|PX`q. (6.105)
• (6.103) is because in the above we have constructed the coupling optimally.
• (6.104) is because pPptqXiq maximizes F0, t “ 1, 2.
3. Thus in the expansions above, equalities are attained throughout. Using the
differentiation technique as in the case of forward inequality, for almost all pbiq,
pcjq, we have
DpPX´i |X`iνi|PX`i q “ DpPX`i νiq (6.106)
“ DpPX´i νiq, @i (6.107)
where the last equality is because by symmetry we can perform the algebraic
expansions in a different way to show that pPX´i q is also a maximizer of F0.
Then IpX`i ;X´
i q “ 0, which, combined with IpXp1qi ;X
p2qi q, shows that X
p1qi and
Xp2qi are Gaussian with the same covariance. Lastly, using Lemma 6.3.6 and
the doubling trick one can show that the optimal coupling is also Gaussian.
φ : O Ñś
i PpXiq, PX ÞÑ pPXiq admits a Borel right inverse, where O is the set of optimal couplings(w.r.t. the relative entropy functional) whose marginals satisfy the second moment constraint. Part(a) is equivalent to the regularity of PX´
i |X` , and the existence of such a regular conditional dis-
tribution is guaranteed for joint distributions on Polish spaces (whose measurable space structureis isomorphic to a standard Borel space); see e.g. [43]. Part (b) is justified by measurable selectiontheorems (see e.g. references in [57, Corollary 5.22]). In particular, a similar argument as [57, Corol-lary 5.22] can also be applied here, since Corollary 6.3.5 implies that O is closed and the pre-imagesφ´1ppPXi
qq are all compact, and Proposition 6.3.3.1 justifies that φ is onto.
257
6.4 Consequences of the Gaussian Optimality
In this section, we demonstrate several implications of the entropic Gaussian optimal-
ity results in Sections 6.2-6.3 to functional inequalities, transportation-cost inequali-
ties, and network information theory.
6.4.1 Brascamp-Lieb Inequality with Gaussian Random
transformations: an Information-Theoretic Proof
We give a simple proof of an extension of the Brascamp-Lieb inequality using Corol-
lary 6.2.6. That is, we give an information-theoretic proof of the following result:
Theorem 6.4.1 Suppose µ is either a Gaussian measure or the Lebesgue measure
on X “ Rn. Let QYj |X be Gaussian random transformations where Yj “ Rnj , and
cj P p0,8q for j P t1, . . . ,mu. For any non-negative measurable functions fj : Yj Ñ R,
j P t1, . . . ,mu, define
Hpf1, . . . , fmq “ log
ż
exp
˜
mÿ
j“1
Erlog fjpYjq|X “ xs
¸
dµpxq ´ logmź
j“1
fj 1cj
(6.108)
where the norm fj 1cj
is with respect to the Lebesgue measure and the expectation is
with respect to QYj |X“x. Then
supf1,...,fm
Hpf1, . . . , fmq “ supGaussian f1,...,fm
Hpf1, . . . , fmq. (6.109)
Moreover, if µ is Gaussian, pQYj |Xq is non-degenerate in the sense of Definition 6.2.2,
then (6.109) is finite and uniquely attained by a set of Gaussian functions.
Proof We only prove the case of Gaussian µ, since the proof for the Lebesgue case
is similar. By the translation and scaling invariance, it suffices to consider the case
where µ is centered Gaussian with density expp´xJMxq for some w P R and M ľ 0.
258
Then
ıλµpxq “ xJMx (6.110)
where λ denotes the Lebesgue measure on Rn. Therefore,
DpPX||µq `mÿ
j“1
cjhpPYjq “ ´hpPXq `
mÿ
j“1
cjhpPYjq ` ErıλµpXqs (6.111)
“ ´hpPXq `
mÿ
j“1
cjhpPYjq ` ErXJMXs. (6.112)
where X „ PX. Then (6.109) is established by
supf1,...,fm
Hpf1, . . . , fmq “ ´ infPX
#
DpPX||µq `mÿ
j“1
cjhpPYjq
+
(6.113)
“ ´ infGaussian PX
#
DpPX||µq `mÿ
j“1
cjhpPYjq
+
(6.114)
“ supGaussian f1,...,fm
Hpf1, . . . , fmq (6.115)
where
• (6.113) is from Remark 2.2.3;
• (6.114) is from (6.112) and Corollary 6.2.6;
• (6.115) is essentially a “Gaussian version” of the equivalence of the two inequal-
ities in Theorem 2.2.3, which is easily shown with the same steps in the proof
of Theorem 2.2.3, noting that (2.17) sends Gaussian distributions to Gaussian
functions and (2.24) sends Gaussian functions to Gaussian distributions.
In the case of Gaussian µ and non-degenerate pQYj |Xq, Corollary 6.2.6 implies that
(6.114) is finitely attained by a unique Gaussian PX. By Remark 2.2.3, (6.109) is
finitely attained by a unique set of Gaussian functions.
259
Remark 6.4.1 The proof of the Brascamp-Lieb inequality by Carlen and Erausquin
[76] also relies on dual information-theoretic formulation. However, their proof of the
differential entropy inequality uses a different approach based on superadditivity of
Fisher information. That approach applies to the case where each QYj |X is determin-
istic rank-one linear map, and it requires the problem to be first reduced to a special
case called the geometric Brascamp-Lieb inequality (proposed by K. Ball [206]).
6.4.2 Multi-variate Gaussian Hypercontractivity
In this section we show a multivariate extension of Gaussian hypercontractivity.
An m-tuple of random variables pX1, . . . , Xmq „ QXm is said to be pp1, . . . , pmq-
hypercontractive for pl P r1,8s, l P t1, . . . ,mu if
E
«
mź
l“1
flpXlq
ff
ď
mź
l“1
flpXlqpl (6.116)
for all bounded real-valued measurable functions fl defined on Xl, l P t1, . . . ,mu.
Define the hypercontractivity region11
RpQXmq :“ tpp1, . . . , pmq P Rm` : pX1, . . . , Xmq is pp1, . . . , pmq hypercontractiveu.
(6.117)
By Theorem 2.2.3, the inequality (6.116) is true if and only if
DpPXm ||QXmq ě
mÿ
j“1
1
pjDpPXj ||QXjq (6.118)
11Note that this definition is similar to the hypercontractivity ribbon defined in [9] but withouttaking the Holder conjugate of one of the two exponent, which is more symmetrical and convenientin the multivariate setting.
260
holds for any PXm ! QXm . In fact, [9] showed that (6.118) is equivalent to the
following (which hinges on the fact that the constant term d “ 0 in (6.118)):
IpU ;Xmq ě
mÿ
j“1
1
pjIpU ;Xjq, @U. (6.119)
In the case of Gaussian QXm , by Theorem 6.2.5 the inequality (6.119) holds if it
holds for all U jointly Gaussian with Xm and having dimension at most m. When
restricted to such U , (6.118) becomes an inequality involving the covariance matrices,
and some elementary computations show that:
Proposition 6.4.2 Suppose QXm “ N p0,Σq where Σ is a positive semidefinite ma-
trix whose diagonal values are all 1. Then pm P RpQXmq if and only if
P ě Σ (6.120)
where P is a diagonal matrix with pj, j P t1, . . . ,mu as its diagonal entries.
Proof Let A be the covariance matrix of Xm conditioned on U , and put C :“ P´1.
We will use the lowercase letters such as a1, . . . , am to denote the diagonal entries of
the corresponding matrices. Then in view of (6.119), we see that the goal is to show
that (6.120) is a necessary and sufficient condition for
log|Σ|
|A|ěÿ
j
cj log1
ai, @0 ĺ A ĺ Σ. (6.121)
Let us first assume that Σ is invertible. Define D :“ Σ´A, then rewrites as
|I´Σ´1D| ďź
j
p1´ djqcj , @0 ĺ D ĺ Σ. (6.122)
261
If (6.120), then C ĺ Σ´1, and we have
|I´Σ´1D| ď |I´CD| (6.123)
ďź
j
p1´ cjdjq (6.124)
ďź
j
p1´ djqcj (6.125)
where (6.124) is because t1´cjdju is the diagonal entries of I´CD, which is majorized
by the eigenvalues of that matrix. Inequality (6.125) is because C ĺ Σ´1 and D ĺ Σ
imply the nonnegativity of 1´ cjdj.
Conversely, if (6.122) holds, we can apply Taylor expansions on both sides of
(6.122):
1´ TrpΣ´1Dq ` opDq ď 1´ÿ
j
cjdj ` opDq, (6.126)
where D can be chosen as, say, the trace norm. Comparing the first order terms,
we see that
TrppΣ´1´CqDq ľ 0 (6.127)
must hold for any 0 ĺ D ĺ Σ. Therefore (6.120) holds.
More generally, if Σ is not necessarily invertible, we can consider the inverse of
its restriction Σ|V where V is its column space. Then the above arguments still carry
262
through with trivial modifications, and we can show that (6.121) holds if and only if
Σ|V´1
ľ C|V (6.128)
ðñΣ|V12 C|V Σ|V
12 ĺ I (6.129)
ðñΣ12 CΣ
12 “ Σ
12 P´1Σ
12 ĺ I (6.130)
ðñP´ 12 ΣP´ 1
2 ĺ I (6.131)
ðñΣ ĺ P (6.132)
where (6.131) is because Σ12 P´1Σ
12 and P´ 1
2 ΣP´ 12 have the same set of eigenvalues.
Remark 6.4.2 When m “ 2, Proposition 6.4.2 reduces to Nelson’s hypercontractivity
theorem for a pair of Gaussian scalar random variables X1 and X2, that is, pp1, p2q P
RpQX2q if and only if
pp1 ´ 1qpp2 ´ 1q ě ρ2pX1;X2q, (6.133)
where ρ2 denotes the squared Pearson correlation.
6.4.3 T2 Inequality for Gaussian Measures
Consider the Euclidean space endowed with `2-norm pRn, ¨ 2q. Talagrand [207]
showed that the standard Gaussian measure Q “ N p0, Inq satisfies the T2p1q inequal-
ity (see Definition 2.5.1). Below we give a new proof using the Gaussian optimality in
the forward-reverse Brascamp-Lieb inequality. By the tensorization of T2 inequality
[137], it suffices to prove the n “ 1 case. By continuity, it suffices to prove T2pλq for
any λ ą 1. Moreover, as one can readily check, when λ ą 1 and Q “ N p0, 1q, (2.168)
is satisfied for all Gaussian P (in which case the optimal coupling in (2.168) is also
Gaussian), so it suffices to prove Gaussian extremisability in (2.168).
263
Let λ P p1,`8q, b2 P p0,`8q, and
F0pPX1 , PX2q :“ infPX1X2
Er|X1 ´X2|2s ´ 2λDpPX1Qq ´ b2DpPX2Qq, (6.134)
where the infimum is over coupling PX1X2 of PX1 and PX2 . Using the rotation in-
variance argument/doubling trick in the proof of Theorem 6.3.1, we can show that
if pPX1 , PX2q maximizes F0pPX1 , PX2q, then they must be Gaussian, and the optimal
coupling PX1X2 is also Gaussian12. Letting b2 Ñ 8, we see the Gaussian optimality
in the T2 inequality. To make the above argument rigorous, we need to take care of
two technical issues:
(a) We approximated the functional (2.175) with b2DpPX2Qq and let b2 Ñ 8, but
we want that Gaussian optimality continue to hold when the last term in (6.134)
is exactly (2.175).
(b) We want to show the existence of a maximizer pPX1 , PX2q for (6.134).
While it might be possible to provide a formal justification of the limit argument
in (a), a slicker way is to circumvent it by directly working with (2.175) instead of
b2DpPX2Qq with b2 Ñ 8. From the tensorization of the functional (6.134), it is
relatively easy to distill the tensorization of the T2 inequality (see also [137] for a
direct proof of tensorization of T2 inequality), and then use the rotation invariance
argument/doubling trick to conclude Gaussian optimality.
12In Theorem 6.3.1 the first term of the objective function is the infimum of the relative entropy,rather than the infimum of the expectation of a quadratic cost function. However, the argument inTheorem 6.3.1 also works in the latter case, since the expectation functional has a similar tensoriza-tion property, and the quadratic cost function also has a rotational invariance property.
264
As for (b), note that if b2DpPX2Qq is replaced with (2.175), we want to show the
existence of a maximizer pPX1 , Qq for (6.134). If σ2 :“ Er|X1|2s, then
infPX1X2
Er|X1 ´X2|2s ď σ2
` Er|X2|2s (6.135)
“ σ2` 1. (6.136)
On the other hand,
DpPX1Qq “ DpPX1N p0, σ2qq ` ErıN p0,σ2qQpX1qs “
σ2
2` opσ2
q. (6.137)
Therefore, if λ ą 1 and pPptqX1q8t“1 is a supremizing sequence, then pP
ptqX1q8t“1 must have
bounded second moment, hence must be tight. Thus Prokhorov Theorem implies
the existence of a subsequence weakly converging to some P ‹X1, and a semicontinuity
argument similar to Proposition 6.3.3 shows that P ‹X1is in fact a maximizer.
6.4.4 Wyner’s Common Information for m Dependent Ran-
dom Variables
Wyner’s common information for m dependent random variables X1, . . . , Xm is com-
monly defined as [208]
CpXmq :“ inf
UIpU ;Xm
q (6.138)
where the infimum is over PU |Xm such thatX1, . . . , Xm are independent conditioned on
U . Previously, to the best of our knowledge, the common information for m Gaussian
scalars X1, . . . , Xm could only be obtained in the special case where the correlation
coefficient between Xi and Xj are equal for all 1 ď i, j ď m [208, Corollary 1] via a
different approach.
265
Using Theorem 6.2.5 and setting X Ð Xm and Yj Ð Xj, we immediately obtain
the following characterization of the multivariate common information for Gaussian
sources:
Theorem 6.4.3 The common information of m Gaussian scalar random variables
X1, . . . , Xm with covariance matrix Σ ą 0 is given by
CpXmq “
1
2infΛ
log|Σ|
|Λ|(6.139)
where the infimum is over all diagonal matrices Λ satisfying Λ ĺ Σ. More generally
when Σ is not necessarily invertible, then Σ and Λ in (6.139) should be replaced by
the restrictions Σ|V and Λ|V to the column space V of Σ.
Remark 6.4.3 After we completed a draft of the journal version of this work [93],
Jun Chen and Chandra Nair independently found and showed us a simple proof of
Theorem 6.4.3 without invoking Theorem 6.2.5: using the fact that Gaussian distri-
bution maximizes differential entropy given a covariance constraint and the concavity
of the log-determinant function,
IpU ;Xmq “ hpXm
q ´ hpXm|Uq (6.140)
ě1
2log |Σ| ´ E
„
1
2log |CovpX|Uq|
(6.141)
ě1
2log |Σ| ´
1
2log |ΣX|U |. (6.142)
Since ΣX|U ĺ Σ and ΣX|U is diagonal, we establishes the nontrivial “ě” part of
(6.139) in the case of invertible Σ. This argument is also related to the proof of The-
orem 6.2.5 in the sense that both convert a mutual information optimization problem
(without a covariance constraint) to a conditional differential entropy optimization
problem with a covariance constraint.
266
6.4.5 Key Generation with an Omniscient Helper
T1 T2. . . Tm
T0Xm
K1 K2 Km
K
X1 X2 Xm
W1 W2Wm
Figure 6.1: Key generation with an omniscient helper
As an example of applications in the network information theory, we give a simple
characterization of the achievable rate region for secret key generation with an om-
niscient helper [101] in the case of stationary memoryless Gaussian sources. This
problem can be viewed as the omniscient CR generation problem in Chapter 2
with an additional secrecy constraint. For completeness we formula the problem
below: Let QXm be the per-letter joint distribution of sources X1, . . . , Xm. As in
Figure 6.1, the Terminals T1, . . . ,Tm observe i.i.d. realizations of X1, . . . , Xm, respec-
tively, whereas the omniscient helper T0 has access to all the sources. Suppose the ter-
minals perform block coding with length n. The communicator computes the integers
W1ppXmqnq, . . . ,WmppX
mqnq possibly stochastically and sends them to T1, . . . ,Tm,
respectively. Then, all the terminals calculate integers KppXmqnq, K1ppX1qn,W1q,
. . . ,KmppXmqn,Wmq possibly stochastically. The goal is to make K “ K1 “ ¨ ¨ ¨ “ Km
with high probability and K almost equiprobable and independent of each message
Wl. In other words, we want to minimize the following quantities:
εn “ max1ďlďm
PrK ‰ Kls, (6.143)
νn “ max1ďlďm
tlog |K| ´HpK|Wlqu. (6.144)
267
Thus, the difference with the omniscient helper CR generation problem is that here
K needs to be (asymptotically) independent of Wl, l “ 1, . . . ,m.
An pm ` 1q-tuple pR,R1, . . . , Rmq is said to be achievable if a sequence of key
generation schemes can be designed to fulfill the following conditions:
lim infnÑ8
1
nlog |K| ě R; (6.145)
lim supnÑ8
1
nlog |Wl| ď Rl, l P t1, . . . ,mu; (6.146)
limnÑ8
εn “ 0; (6.147)
limnÑ8
νn “ 0. (6.148)
Notice that a small νn does not imply that K is nearly independent with all the
messages Wm; the problem appears to be harder to solve if a the stronger requirement
that
limnÑ8
plog |K| ´HpK|Wmqq “ 0 (6.149)
is imposed in place of (6.148) [101].
Theorem 6.4.4 [101] The set of achievable rates is the closure of
ď
QU |Xm
$
’
’
’
’
&
’
’
’
’
%
pR,R1, . . . , Rmq :
R ď mintIpU ;Xmq, HpX1q, . . . , HpXmqu;
Rl ě IpU ;Xmq ´ IpU ;Xlq, 1 ď l ď m
,
/
/
/
/
.
/
/
/
/
-
. (6.150)
A priori, computing the rate region from (6.150) requires solving an optimization
with possibly infinite dimensions. However, using Theorem 6.2.5 we easily see that
the problem can be reduced to a matrix optimization in the case of Gaussian sources:
268
Theorem 6.4.5 For stationary memoryless sources where the per-letter joint distri-
bution is jointly Gaussian with non-degenerate covariance matrix Σ, the achievable
region for omniscient helper key generation can be represented as the closure of
ď
0ĺΣ1ĺΣ
$
’
’
’
’
&
’
’
’
’
%
pR,R1, . . . , Rmq :
R ď 12
log |Σ||Σ1|
;
Rl ě12
log |Σ||Σ1|´ 1
2log Σll
Σ1ll, 1 ď l ď m
,
/
/
/
/
.
/
/
/
/
-
. (6.151)
As alluded, omniscient helper CR generation is a related problem where there is
no secrecy (independence) assumption. For CR generation, [16, Theorem 4.2] showed
that the achievable region is similar to that of Theorem 6.4.4, except that the first
bound in (6.150) needs to be replaced by
R ď IpU ;Xmq. (6.152)
Note that such a characterization is equivalent to the supporting hyperplane charac-
terization we gave in (2.40). In particular, for continuous sources which has infinite
Shannon entropy, the achievable region is the same for CR generation and key gen-
eration!
We remark that despite the superficial similarity of the achievable regions for CR
and key generation, the achievability part of Theorem 6.4.4 requires more sophisti-
cated coding technique to guarantee secrecy. We observe that Theorem 6.4.5 also
applies to CR generation from Gaussian sources, because in that case HpXjq “ 8
and hence does not effectively change the bound on R.
As we introduced in Chapter 2, a generalization of the omniscient helper problem
is called the one communicator problem in [101], where in Figure 6.1 the terminal
T0 does not see all the random variables Xm but instead another random variable
Z which can be arbitrarily correlated with Xm. The achievable rate region for one
269
communicator CR generation is known [16, Theorem 4.2] to be the closure of
ď
QU |Z
$
’
’
’
’
&
’
’
’
’
%
pR,R1, . . . , Rmq :
R ď IpU ;Zq;
Rl ě IpU ;Zq ´ IpU ;Xlq, 1 ď l ď m
,
/
/
/
/
.
/
/
/
/
-
(6.153)
and obviously we can use Theorem 6.2.5 to reduce (6.153) to a matrix optimization
problem in the case of Gaussian pZ,X1, . . . , Xmq. The achievable rate region is also
derived for key generation with one communicator in [101]. But the expression of
that region is more complicated involving m` 1 auxiliary random variables, and it is
not immediate to conclude that Gaussian auxiliary random variables suffice merely
using Theorem 6.2.5.
6.5 Appendix: Existence of Maximizer in Theo-
rem 6.2.1
Proposition 6.5.1 In the non-degenerate case, for any Σ ľ 0,
φpΣq :“ supPUX : ΣX|UĺΣ
F pPUXq (6.154)
is finite and is attained by some PUX with |U | ă 8.
Proof First, observe that if we let φp¨q be the supremum in (6.154) with the addi-
tional restriction that |U | ă 8, then φp¨q is a concave function on a convex set of finite
dimension. Hence Jensen’s inequality13 implies that φp¨q ď φp¨q, while φp¨q ď φp¨q is
obvious from the definition. Thus φp¨q “ φp¨q.
13Luckily, φ is defined on a finite dimensional set of matrices (rather than a possibly infinitedimensional set of distributions PX). In the infinite dimensional case without further continuityassumptions, Jensen’s inequality can fail; see the example in [209, equation (1.3)].
270
The set
C :“ď
PX
tpF0pPXq,CovpXqqu (6.155)
lies in a linear space of dimension 1 ` dimpX qpdimpX q`1q2
. By Caratheodory’s theorem
[94, Theorem 17.1], each point in the convex hull of C is a convex combination of at
most D :“ 2` dimpX qpdimpX q`1q2
points in C:
ď
PUX : U finite
tpF pPUXq,ΣX|Uqu “ď
PUX : |U |ďD
tpF pPUXq,ΣX|Uqu (6.156)
hence
φpΣq “ supPUX : U“t1,...,Du,ΣX|UĺΣ
F pPUXq. (6.157)
Now suppose tPUnXnuně1 is a sequence satisfying ΣXn|Un ĺ Σ, |Un| “ t1, . . . , Du for
each n, and
limnÑ8
F pPUnXnq “ (6.157). (6.158)
We can assume without loss of generality that PUn converges to some PU‹ , since
otherwise we can pass to one convergent subsequence instead. Moreover, by the
translation invariance we can assume without loss of generality that
ErXn|Un “ us “ 0 (6.159)
for each u and n.
271
If u P t1, . . . , Du is such that PU‹puq ą 0, then for n sufficiently large, we have
PUn ąPU‹ puq
2and
CovpXn|Un “ uq ĺ2Σ
PU‹puq. (6.160)
Thus tPXn|Un“uuně1 is a tight sequence of measures by Chebyshev’s inequality, and
Prokhorov’s theorem [109] guarantees the existence of a subsequence of tPXn|Un“uuně1
converging weakly to some Borel measure PX‹u . We might as well assume that
tPXn|Un“uuně1 converges to PX‹u since otherwise we pass to a convergent subsequence
instead. This argument can applied to each u P t1, . . . , Du satisfying PU‹puq ą 0
iteratively, hence we can assume the existence of the weak limits
limnÑ8
PXn|Un“u “ PX‹u (6.161)
for all such u. Next, we show that
lim supnÑ8
F0pPXn|Un“uq ď F0pPX‹uq (6.162)
for all such u. Using Lemma 6.5.2 below, we obtain
lim supnÑ8
hpXn|Un “ uq ď hpX‹uq. (6.163)
Because of the moment constraint (6.160), the differential entropy of the output
distribution, which is smoothed by the Gaussian kernel, enjoys weak continuity in the
input distribution (see e.g. [92, Proposition 18], [204, Theorem 7], or [205, Theorem 1,
Theorem 2]):
limnÑ8
hpYjn|Un “ uq “ hpY‹j |U
‹“ uq for each u P t1, . . . , Du (6.164)
272
where pU,Xn,Yjnq „ PXnUnQYj |X and pU‹,X‹,Y‹j q „ PX‹U‹QYj |X. As for the trace
term, consider
lim infnÑ8
TrrM CovpXn|Un “ uqs “ lim infnÑ8
ErTrrMXnXJn s|Un “ us (6.165)
“ lim infnÑ8
supKą0
ErTrrMXnXJn s ^K|Un “ us (6.166)
ě supKą0
lim infnÑ8
ErTrrMXnXJn s ^K|Un “ us (6.167)
ě supKą0
ErTrrMX‹uX
‹Ju s ^Ks (6.168)
“ ErTrrMX‹uX
‹Ju ss (6.169)
where “^” takes the minimum of two numbers, (6.166) and (6.169) are from monotone
convergence theorem, and (6.168) uses the weak convergence (6.161). The proof of
(6.162) is finished by combining (6.163) (6.164) and (6.169).
The final step deals with any u P t1, . . . , Du satisfying PU‹puq “ 0. The variance
constraint implies that
CovpXn|Un “ uq ĺ1
PUnpuqΣ, (6.170)
hence by the fact that Gaussian distribution maximizes the differential entropy under
a covariance constraint, we have the bound
hpXn|Un “ uq ďdimpX q
2logp2πq `
1
2log e`
1
2log
ˇ
ˇ
ˇ
ˇ
1
PUnpuqΣ
ˇ
ˇ
ˇ
ˇ
(6.171)
“dimpX q
2logp2πq `
1
2log e`
1
2log |Σ| `
dimpX q2
log1
PUnpuq.
(6.172)
This combined with the fact that hpYjn|U “ uq ě hpYjn|Xnq is bounded below,
j “ 1, . . . ,m in the non-degenerate case, implies that if PUnpuq converges to zero,
273
then
lim supnÑ8
PUnpuqF0pPXn|Un“uq ď 0. (6.173)
Combining (6.162) and (6.173), we see
F pPU‹X‹q ě lim supnÑ8
F pPUnXnq, (6.174)
where PX‹|U‹“u :“ PX‹u for each u “ 1, . . . , D.
Lemma 6.5.2 Suppose pPXnq is a sequence of distributions on Rd converging weakly
to PX‹, and
ErXnXJn s ĺ Σ (6.175)
for all n. Then
lim supnÑ8
hpXnq ď hpX‹q. (6.176)
The result fails without the condition (6.175). Also, related results when the weak
convergence is replaced with pointwise convergence of density functions and certain
additional constraints was shown in [205, Theorem 1, Theorem 2] (see also the proof of
[92, Theorem 5]). Those results are not applicable here since the density functions of
Xn do not converge pointwise. They are applicable for the problems discussed in [92]
because the density functions of the output of the Gaussian random transformation
enjoy many nice properties due to the smoothing effect of the “good kernel”.
Proof It is well known that in metric spaces and for probability measures, the
relative entropy is weakly lower semicontinuous (cf. [43]). This fact and a scaling
274
argument immediately show that, for any r ą 0,
lim supnÑ8
hpXn|Xn ď rq ď hpX‹|X‹
ď rq. (6.177)
Let pnprq :“ PrXn ą rs, then (6.175) implies
ErXXJ|Xn ą rs ď
1
pnprqΣ. (6.178)
Therefore, since the Gaussian distribution maximizes differential entropy given a sec-
ond moment upper bound, we have
hpXn|Xn ą rq ď1
2log
p2πqde|Σ|
pnprq. (6.179)
Since limrÑ8 supn pnprq “ 0 by (6.175) and Chebyshev’s inequality, the above implies
that
limrÑ8
supnpnprqhpXn|Xn ą rq “ 0. (6.180)
The desired result follows from (6.177), (6.180) and the fact that
hpXnq “ pnprqhpXn|Xn ą rq ` p1´ pnprqqhpXn|Xn ď rq ` hppnprqq. (6.181)
275
Chapter 7
Conclusion and Future Work
Let us recall the main idea of the thesis: information measures (or their linear combi-
nation) generally have two equivalent characterizations: the functional representation
and the entropic representation. The existence of such equivalent characterizations
can be understood in terms of the duality of convex optimizations, even though tech-
nically speaking the optimization problem is not always convex (consider the case
where the linear combination of convex information measures may contain both pos-
itive and negative coefficients). As far as we have seen, the functional representation
is very useful in proving non-asymptotic bounds of operational problems. Especially,
the reverse hypercontractivity argument introduced in Chapter 4 appears to be a
tailor-made for the functional representation. On the other hand, the entropic for-
mulation appears more convenient in proving certain abstract properties such as data
processing, tensorization and the Gaussian optimality, leveraging tools familiar to
information theorists such as the chain rule.
Below, we mention two scenarios of common randomness generation where the
extension of the proposed functional approach appears challenging. These scenarios
are related to some of the author’s publications which have not been discussed in the
main text. The strong converse counterpart of these results appears to be open.
• Rate limited interactive communication. In the literature, a very common pro-
tocol for common randomness generation allows terminals to communicate to
276
each other [210], in contrast to the one-communicator protocol we discussed
(Figure 1.1) where only one terminal is allowed to send messages to others. We
remark that without communication rate constraints, the maximum secret key
rate has known single-letter expressions [210][211] and second-order rate results
[55]. However when the communication rate is limited, we can only characterize
the tradeoff between the communication rate and the secret key (or common
randomness) rate using the same number of auxiliary random variables as the
number of rounds of communications [102]. Does the strong converse hold in
this setting? The challenge appears to be that no corresponding functional in-
equality is known when two or more auxiliary random variables are involved. A
related open problem is the source-channel network with two or more helpers:
not only the strong converse but even the first-order rate region is generally
unknown [8].
• Demanding secrecy. If we further impose that the common randomness gen-
erated in the one-communicator problem (Figure 1.1) is independent of the
public communications (i.e. the secret key generation problem), can we still
show a strong converse or even a second-order result? The secret key genera-
tion problem is formally related to the wiretap channel (see the remark in [18]).
We note that the second-order rate for wiretap channel has been open for non-
degenerate channels [212], in which case the rate region involves an auxiliary
random variable, and is very similar to the rate region of the one-communicator
problem whose second-order rate we solved in Chapter 4.
277
Bibliography
[1] C. E. Shannon, “A mathematical theory of communication,” Bell System Tech-nical Journal, vol. 27, no. 3, pp. 379–423, July 1948.
[2] C. E. Shannon, “The bandwagon,” IRE Trans. Inf. Theory, vol. 2, no. 1, p. 3,1956.
[3] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley& Sons, 2nd ed., 2012.
[4] S. Verdu, “Fifty years of Shannon theory,” IEEE Trans. Inf. Theory, vol. 44,no. 6, pp. 2057–2078, Oct. 1998.
[5] P. Gacs and J. Korner, “Common information is far less than mutual informa-tion,” Problems of Control and Information Theory, vol. 2, no. 2, pp. 149–162,1973.
[6] A. D. Wyner, “The common information of two dependent random variables,”IEEE Trans. Inf. Theory, vol. 21, no. 2, pp. 163–179, Mar. 1975.
[7] R. Ahlswede and P. Gacs, “Spreading of sets in product spaces and hypercon-traction of the Markov operator,” The Annals of Probability, vol. 4, pp. 925–939,1976.
[8] I. Csiszar and J. Korner, Information Theory: Coding Theorems for DiscreteMemoryless Systems. Cambridge University Press, 2 ed., 2011.
[9] V. Anantharam, A. Gohari, S. Kamath, and C. Nair, “On maximal correlation,hypercontractivity, and the data processing inequality studied by Erkip andCover,” arXiv preprint arXiv:1304.6133, 2013.
[10] H. O. Hirschfeld, “A connection between correlation and contingency,” Proc.Cambridge Philosophical Soc., vol. 31, pp. 520–524, 1935.
[11] H. Gebelein, “Das statistische problem der korrelation als variations- undeigenwert-problem und sein zusammenhang mit der ausgleichungsrechnung,”Zeitschrift fur angew. Math. und Mech., vol. 21, pp. 364–379, 1941.
[12] A. Renyi, “On measures of dependence,” Acta Math. Hung., vol. 10, no. 3-4,pp. 441–451, 1959.
278
[13] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,”IRE Int. Conv. Rec., vol. vol. 7, part 4, pp. 142–163, 1959.
[14] U. M. Maurer, “Secret key agreement by public discussion from common infor-mation,” IEEE Trans. Inf. Theory, vol. 39, no. 3, pp. 733–742, Mar. 1993.
[15] R. Ahlswede and I. Csiszar, “Common randomness in information theory andcryptography. I. Secret sharing,” IEEE Trans. Inf. Theory, vol. 39, no. 4,pp. 1121–1132, Apr. 1993.
[16] R. Ahlswede and I. Csiszar, “Common randomness in information theory andcryptography. II. CR capacity,” IEEE Trans. Inf. Theory, vol. 44, no. 1, pp. 225–240, Jan. 1998.
[17] C. E. Shannon, “Two-way communication channels,” in Proc. 4th BerkeleySymp. Math. Statist. Probab., vol. 1, pp. 611–644, 1961.
[18] A. El Gamal and Y.-H. Kim, Network information theory. Cambridge UniversityPress, 2011.
[19] C. Shannon, “The zero error capacity of a noisy channel,” IRE Trans. Inf.Theory, vol. 2, no. 3, pp. 8–19, 1956.
[20] C. E. Shannon, R. G. Gallager, and E. Berlekamp, “Lower bounds to errorprobability for coding on discrete memoryless channels, I,” Information andControl, vol. 10, no. 1, pp. 65–103, 1967.
[21] G. Forney Jr, “Exponential error bounds for erasure, list, and decision feedbackschemes,” IEEE Trans. Inf. Theory, vol. 14, no. 2, pp. 206–220, Mar. 1968.
[22] R. Ahlswede and G. Dueck, “Identification via channels,” IEEE Trans. Inf.Theory, vol. 35, no. 1, pp. 15–29, 1989.
[23] R. Ahlswede, N. Cai, and Z. Zhang, “Erasure, list, and detection zero error ca-pacities for low noise and a relation to identification,” IEEE Trans. Inf. Theory,vol. 42, no. 1, pp. 55–62, Jan. 1996.
[24] J. Korner and A. Orlitsky, “Zero-error information theory,” IEEE Trans. Inf.Theory, vol. 44, pp. 2207–2229, October 1998.
[25] I. Csiszar and J. Korner, Information Theory: Coding Theorems for DiscreteMemoryless Systems. Cambridge University Press, 1 ed., 1981.
[26] J. Wolfowitz, “Notes on a general strong converse,” Information and Control,vol. 12, pp. 1–4, 1968.
[27] R. Ahlswede, P. Gacs, and J. Korner, “Bounds on conditional probabilitieswith applications in multi-user communication,” Z. Wahrscheinlichkeitstheorieverw. Gebiete, vol. 34, no. 2, pp. 157–177, 1976. Correction (1977). ibid, 39(4),353–354.
279
[28] R. M. Fano, “Transmission of information,” Unpublished course notes, Mas-sachusetts Institute of Technology, Cambridge, MA., 1952.
[29] M. Braverman and A. Rao, “Information equals amortized communication,”IEEE Trans. Inf. Theory, vol. 60, no. 10, pp. 6058–6069, 2014.
[30] A. Orlitsky, N. Santhanam, and J. Zhang, “Relative redundancy for large al-phabets,” in Proc. of IEEE International Symposium on Information Theory,pp. 2672–2676, Seattle, WA, July 2006.
[31] B. Kelly, Information Theory With Large Alphabets And Source Coding ErrorExponents. PhD thesis, Cornell University, 2011.
[32] Y. Polyanskiy, “A perspective on massive random-access,” in Proc. IEEE In-ternational Symposium on Information Theory, Aachen, Germany, July 2017.
[33] V. Strassen, “Asymptotische abschatzungen in Shannons informationstheorie,”in Trans. 3rd Prague Conf. Inf. Theory, pp. 689–723, Prague, 1962.
[34] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the finiteblocklength regime,” IEEE Trans. Inf. Theory, vol. 56, no. 5, pp. 2307–2359,May 2010.
[35] M. Hayashi, “Information spectrum approach to second-order coding rate inchannel coding,” IEEE Trans. Inf. Theory, vol. 55, pp. 4947–4966, Nov. 2009.
[36] V. Kostina and S. Verdu, “Fixed-length lossy compression in the finite block-length regime,” IEEE Trans. Inf. Theory, vol. 58, no. 6, pp. 3309–3338, June2012.
[37] M. H. Yassaee, M. R. Aref, and A. Gohari, “A technique for derivingone-shot achievability results in network information theory,” arXiv preprintarXiv:1303.0696, 2013.
[38] S. Watanabe, S. Kuzuoka, and V. Y. F. Tan, “Nonasymptotic and second-order achievability bounds for coding with side-information,” IEEE Trans. Inf.Theory, vol. 61, Apr. 2015.
[39] P. Cuff, Communication in Networks for Coordinating Behavior. PhD thesis,Stanford University, 2009.
[40] V. Y. F. Tan and O. Kosut, “On the dispersions of three network informationtheory problems,” IEEE Trans. Inf. Theory, vol. 60, pp. 881–903, Feb. 2014.
[41] S. Verdu and T. S. Han, “A general formula for channel capacity,” IEEE Trans.Inf. Theory, vol. 40, pp. 1147–1157, July 1994.
[42] Y. Polyanskiy and S. Verdu, “Arimoto channel coding converse and Renyi di-vergence,” in 48th Annual Allerton Conference on Communication, Control,and Computing, pp. 1327–1333, Monticello, Illinois, Oct. 2010.
280
[43] S. Verdu, Information Theory. Princeton University Press (In preparation).
[44] T. S. Han, Information-spectrum Methods in Information Theory. Springer-Verlag Berlin Heidelberg 2003.
[45] Y. Oohama, “Strong converse exponent for degraded broadcast channels atrates outside the capacity region,” in Proc. of IEEE International Symposiumon Information Theory, pp. 939–943, June 2015, Hong Kong, China.
[46] R. Ahlswede, “Coloring hypergraphs: A new approach to multi-user sourcecoding, 1,” Journal of combinatorics, information & system sciences, vol. 4,no. 1, 1979.
[47] A. Ingber and Y. Kochman, “The dispersion of lossy source coding,” in Proc.of the Data Compression Conference (DCC), pp. 53–62, Mar 2011, Snowbird,UT.
[48] M. Raginsky and I. Sason, “Concentration of measure inequalities in informa-tion theory, communications, and coding,” Foundations and Trends in Com-munications and Information Theory, vol. 10, no. 1-2, pp. 1–250, 2014.
[49] A. D. Wyner, “On source coding with side information at the decoder,” IEEETrans. Inf. Theory, vol. 21, pp. 294–300, May 1975.
[50] V. Y. F. Tan, “Asymptotic Estimates in Information Theory with Non-Vanishing Error Probabilities,” Foundations and Trends in Communicationsand Information Theory, vol. 11, no. 1-2, pp. 1–184, 2014.
[51] R. M. Gray and A. D. Wyner, “Source coding for a simple network,” BellSystem Technical Journal, vol. 53, no. 9, pp. 1681–1721, Nov. 1974.
[52] S. Watanabe, “Second-order region for Gray-Wyner network,” IEEE Trans. Inf.Theory, vol. 63, pp. 1006–1018, Feb. 2017.
[53] R. Renner and S. Wolf, “Simple and tight bounds for information reconcilia-tion and privacy amplification,” in Advances in Cryptology-ASIACRYPT 2005,pp. 199–216, Springer, 2005.
[54] M. Hayashi, H. Tyagi, and S. Watanabe, “Secret key agreement: General ca-pacity and second-order asymptotics,” in Proc. IEEE International Symposiumon Information Theory, pp. 1136–1140, Honolulu, HI, June-July 2014.
[55] H. Tyagi and S. Watanabe, “A bound for multiparty secret key agreement andimplications for a problem of secure computing,” in Advances in Cryptology–EUROCRYPT 2014, pp. 369–386, Springer, 2014.
[56] E. Mossel, R. O. Donnell, O. Regev, J. E. Steif, and B. Sudakov, “Non-interactive correlation distillation, inhomogeneous markov chains, and the re-verse bonami-beckner inequality,” Israel Journal of Mathematics, vol. 154, no. 1,2006.
281
[57] C. Villani, Optimal transport: old and new, vol. 338. Springer Science & Busi-ness Media, 2008.
[58] M. Talagrand, “A new look at independence,” The Annals of probability, vol. 24,no. 1, pp. 1–34, 1996.
[59] V. D. Milman, “New proof of the theorem of A. Dvoretzky on intersections ofconvex bodies,” Functional Analysis and Its Applications, vol. 5, no. 4, pp. 288–295, 1971.
[60] W. T. Gowers, “The two cultures of mathematics,” Mathematics: Frontiers andPerspectives (V. Arnold et al., eds), pp. 65–78, 2000.
[61] G. A. Margulis, “Veroyatnostniye characteristiki grafov s bolshoy svyaznostyu.[in russian],” Problems of Information Transmission, vol. 10, no. 2, pp. 174–179,1974.
[62] “Interview with Katalin Marton,” slides available online at http: // media.
itsoc. org/ marton-interview. pdf .
[63] K. Marton, “A simple proof of the blowing-up lemma,” IEEE Trans. Inf. The-ory, vol. 24, pp. 857–866, 1986.
[64] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: ANonasymptotic Theory of Independence. Oxford Univesity Press, 2013.
[65] S. Varadhan, Large Deviations and Applications. Society for Industrial andApplied Mathematics (SIAM), 1984.
[66] M. D. Donsker and S. R. S. Varadhan, “Asymptotic evaluation of certain markovprocess expectations for large time,” I, Comm. Pure Appl. Math., vol. 28, pp. 1–47, 1975.
[67] P. D. Lax, Functional Analysis. John Wiley & Sons, Inc., 2002.
[68] N. Bourbaki, Integration. (Chaps. I-IV, Actualites Scientifiques et Industrielles,no. 1175), Paris, Hermann, 1952.
[69] E. Nelson, “The free Markov field,” J. Funct. Anal., vol. 12, no. 2, pp. 211–227,1973.
[70] L. Gross, “Logarithmic Sobolev Inequalities,” American Journal of Mathemat-ics, vol. 97, no. 4, pp. 1061–1083, 1975.
[71] J. Kahn, G. Kalai, and N. Linial, “The influence of variables on Boolean func-tions,” in 29th Annual Symposium on Foundations of Computer Science, pp. 68–80, 1988.
282
[72] E. Mossel, R. O’Donnell, and K. Oleszkiewicz, “Noise stability of functions withlow influences: Invariance and optimality,” Annals of Mathematics, vol. 171,no. 1, pp. 295–341, 2010.
[73] A. Xu and M. Raginsky, “Converses for distributed estimation via strong dataprocessing inequalities,” in Proc. of the 2015 IEEE International Symposiumon Information Theory (ISIT), Hong Kong, China, pp. 2376–2380, July 2015.
[74] C. Nair, “Equivalent formulations of hypercontractivity using information mea-sures,” IZS Workshop, Feb. 2014.
[75] E. A. Carlen, E. H. Lieb, and M. Loss, “A sharp analog of Young’s inequalityon SN and related entropy inequalities,” The Journal of Geometric Analysis,vol. 14, no. 3, pp. 487–520, 2004.
[76] E. A. Carlen and D. Cordero-Erausquin, “Subadditivity of the entropy andits relation to Brascamp–Lieb type inequalities,” Geometric and FunctionalAnalysis, vol. 19, no. 2, pp. 373–405, 2009.
[77] E. Mossel, K. Oleszkiewicz, and A. Sen, “On reverse hypercontractivity,” Geo-metric and Functional Analysis, vol. 23, no. 3, pp. 1062–1097, 2013.
[78] S. Kamath, “Reverse hypercontractivity using information measures,” inProc. of the 53rd Annual Allerton Conference on Communications, Controland Computing, pp. 627–633, Sept. 30-Oct. 2, 2015, UIUC, Illinois.
[79] S. Beigi and C. Nair, “Equivalent characterization of reverse Brascamp-Liebtype inequalities using information measures,” in Proc. of IEEE InternationalSymposium on Information Theory, pp. 1038–1042, Barcelona, Spain, July2016.
[80] J. Liu, T. A. Courtade, P. Cuff, and S. Verdu, “Brascamp-Lieb inequality and itsreverse: An information theoretic view,” in Proceedings of 2016 IEEE Interna-tional Symposium on Information Theory, (Barcelona, Spain), pp. 1048–1052,July 2016.
[81] M. Ledoux, “Isoperimetry and Gaussian analysis,” in Lectures on probabilitytheory and statistics, pp. 165–294, Springer, 1996.
[82] I. a. L. M. Bobkov, S. G. and Gentil, “Hypercontractivity of Hamilton-Jacobiequations,” J. Math. Pures Appl., vol. 80, no. 7, pp. 669–696, 2001.
[83] H. J. Brascamp and E. H. Lieb, “Best constants in Young’s inequality, itsconverse, and its generalization to more than three functions,” Advances inMathematics, vol. 20, no. 2, pp. 151–173, 1976.
[84] F. Barthe, “The Brunn-Minkowski theorem and related geometric and func-tional inequalities,” in International Congress of Mathematicians, vol. 2,pp. 1529–1546, 2006.
283
[85] F. Avram, N. Leonenko, and L. Sakhno, “Harmonic analysis tools for statisticalinference in the spectral domain,” Dependence in Probability and Statistics,pp. 59–70, 2010.
[86] A. Garg, L. Gurvits, R. Oliveira, and A. Wigderson, “Algorithmic aspects ofBrascamp-Lieb inequalities,” arXiv preprint arXiv:1607.06711, 2016.
[87] A. Dembo, T. M. Cover, and J. A. Thomas, “Information theoretic inequalities,”IEEE Trans. Inf. Theory, vol. 37, no. 6, pp. 1501–1518, Nov. 1991.
[88] F. Barthe, “On a reverse form of the Brascamp-Lieb inequality,” Inventionesmathematicae, vol. 134, no. 2, pp. 335–361, 1998.
[89] E. H. Lieb, “Gaussian kernels have only Gaussian maximizers,” Inventionesmathematicae, vol. 102, no. 1, pp. 179–208, 1990.
[90] F. Barthe, “Optimal Young’s inequality and its converse: a simple proof,” Ge-ometric & Functional Analysis, vol. 8, no. 2, pp. 234–242, 1998.
[91] F. Barthe, D. Cordero-Erausquin, M. Ledoux, and B. Maurey, “Correlation andBrascamp-Lieb inequalities for Markov semigroups,” International MathematicsResearch Notices, pp. 2177–2216, 2011.
[92] Y. Geng and C. Nair, “The capacity region of the two-receiver Gaussian vec-tor broadcast channel with private and common messages,” IEEE Trans. Inf.Theory, vol. 60, no. 4, pp. 2087–2104, 2014.
[93] J. Liu, T. A. Courtade, P. Cuff, and S. Verdu, “Information theoretic perspec-tives on Brascamp-Lieb inequality and its reverse,” arXiv:1702.06260.
[94] R. T. Rockafellar, Convex Analysis. Princeton University Press, 2015.
[95] H. J. Brascamp and E. H. Lieb, “On extensions of the Brunn-Minkowski andPrekopa-Leindler theorems, including inequalities for log concave functions, andwith an application to the diffusion equation,” J. Funct. Anal., vol. 22, no. 4,1976.
[96] S. G. Bobkov and M. Ledoux, “From Brunn-Minkowski to Brascamp-Lieband to logarithmic Sobolev inequalities,” Geom. Funct. Anal., vol. 10, no. 5,pp. 1028–1052, 2000.
[97] D. Cordero-Erausquin, “Transport inequalities for log-concave measures, quan-titative forms and applications,” arXiv preprint arXiv:1504.06147, 2015.
[98] J. Bennett, A. Carbery, M. Christ, and T. Tao, “The Brascamp–Lieb inequal-ities: finiteness, structure and extremals,” Geometric and Functional Analysis,vol. 17, no. 5, pp. 1343–1415, 2008.
[99] R. Gardner, “The Brunn-Minkowski inequality,” Bulletin of the AmericanMathematical Society, vol. 39, no. 3, pp. 355–405, 2002.
284
[100] W. Beckner, “Inequalities in Fourier analysis on Rn,” Proceedings of the Na-tional Academy of Sciences, vol. 72, no. 2, pp. 638–641, 1975.
[101] J. Liu, P. Cuff, and S. Verdu, “Secret key generation with one communica-tor and a one-shot converse via hypercontractivity,” in Proceedings of 2015IEEE International Symposium on Information Theory, (Hong Kong, China),pp. 710–714, June 2015.
[102] J. Liu, P. Cuff, and S. Verdu, “Secret key generation with limited interaction,”IEEE Trans. on Information Theory, accepted, 2017.
[103] J. Lehec, “Representation formula for the entropy and functional inequalities,”arXiv preprint arXiv:1006.3028, 2010.
[104] C. Villani, Topics in Optimal Transportation. No. 58, American MathematicalSoc., 2003.
[105] A. Dembo and O. Zeitouni, Large deviations techniques and applications, vol. 38.Springer Science & Business Media, 2009.
[106] T. Tao, “245B, Notes 12: Continuous functions on locally compactHausdorff spaces.” https://terrytao.wordpress.com/2009/03/02/
245b-notes-12-continuous-functions-on-locally-compact-hausdorff-spaces/,Mar. 2, 2009.
[107] S. M. Lane, Categories for the Working Mathematician. Springer Science andBusiness Media, 1978.
[108] A. Hatcher, “Algebraic topology,” Cambridge University Press, 2002.
[109] Y. V. Prokhorov, “Convergence of random processes and limit theorems inprobability theory,” Theory of Probability & Its Applications, vol. 1, no. 2,pp. 157–214, 1956.
[110] R. Atar and N. Merhav, “Information-theoretic applications of the logarithmicprobability comparison bound,” arXiv preprint arXiv:1412.6769, 2014.
[111] J. Radhakrishnan, “Entropy and counting,” Computational mathematics, mod-elling and algorithms, vol. 146, 2003.
[112] M. Madiman and P. Tetali, “Information inequalities for joint distributionswith interpretations and applications,” IEEE Trans. Inf. Theory, vol. 56, no. 6,pp. 2699–2713, June 2010.
[113] F. Bobkov, S. G. and Gotze, “Exponential integrability and transportation costrelated to Logarithmic Sobolev inequalities,” J. Funct. Anal., vol. 163, no. 1,pp. 1–28, 1999.
[114] E. Erkip and T. M. Cover, “The efficiency of investment information,” IEEETrans. Inf. Theory, vol. 44, no. 3, pp. 1026–1040, Mar. 1998.
285
[115] T. A. Courtade, “Outer bounds for multiterminal source coding via a strongdata processing inequality,” in Proceedings of 2013 IEEE International Sympo-sium on Information Theory, pp. 559–563, Istanbul, Turkey, July 2013.
[116] Y. Polyanskiy and Y. Wu, “Dissipation of information in channels with inputconstraints,” IEEE Trans. Inf. Theory, vol. 62, no. 1, pp. 35–55, Jan. 2016.
[117] Y. Polyanskiy and Y. Wu, “A note on the strong data-processing inequalitiesin Bayesian networks,” http://arxiv.org/pdf/1508.06025v1.pdf.
[118] J. Liu, P. Cuff, and S. Verdu, “Key capacity for product sources with applicationto stationary Gaussian processes,” IEEE Trans. Inf. Theory, vol. 62, pp. 984–1005, Feb. 2016.
[119] S. Kamath and V. Anantharam, “On non-interactive simulation of joint distri-butions,” arXiv preprint arXiv:1505.00769, 2015.
[120] A. Ganor, G. Kol, and R. Raz, “Exponential separation of information andcommunication,” in 2014 IEEE 55th Annual Symposium on Foundations ofComputer Science (FOCS), pp. 176–185.
[121] Z. Dvir and G. Hu, “Sylvester-Gallai for Arrangements of Subspaces,”arXiv:1412.0795, 2014.
[122] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff, “Commu-nication lower bounds for statistical estimation problems via a distributed dataprocessing inequality,” arXiv preprint arXiv:1506.07216, 2015.
[123] M. Talagrand, “On Russo’s approximate zero-one law,” The Annals of Proba-bility, pp. 1576–1587, 1994.
[124] E. Friedgut, G. Kalai, and A. Naor, “Boolean functions whose Fourier trans-form is concentrated on the first two levels,” Advances in Applied Mathematics,vol. 29, no. 3, pp. 427–437, 2002.
[125] J. Bourgain, “On the distribution of the Fourier spectrum of Boolean functions,”Israel Journal of Mathematics, vol. 131, no. 1, pp. 269–276, 2002.
[126] C. Garban, G. Pete, and O. Schramm, “The Fourier spectrum of critical per-colation,” Acta Mathematica, vol. 205, no. 1, pp. 19–104, 2010.
[127] M. J. John C Duchi and M. J. Wainwright, “Local privacy and statistical min-imax rates,” in IEEE 54th Annual Symposium on Foundations of ComputerScience (FOCS), pp. 429–438, 2013.
[128] K. Dvijotham and E. Todorov, “A Unifying Framework for Linearly SolvableControl,” Proc. 27th Conf. on Uncertainty in Artificial Intelligence, pp. 1–8.
286
[129] R. Atar, K. Chowdhary, and P. Dupuis, “Robust bounds on risk-sensitive func-tionals via Renyi divergence,” SIAM J. Uncertainty Quant., vol. 3, pp. 18–33,2015.
[130] T. V. Erven and P. Harremoes, “Renyi divergence and Kullback-Leibler diver-gence,” IEEE Trans. Inf. Theory, vol. 60, pp. 3797–3820, July 2014.
[131] O. Shayevitz, “A note on a characterization of Renyi measures and its relationto composite hypothesis testing,” arXiv:1012.4401v1, 2010.
[132] I. Sason and S. Verdu, “f -divergence inequalities,” IEEE Trans. Inf. Theory,vol. 62, no. 11, pp. 5973–6006, Nov. 2016.
[133] I. Sason and S. Verdu, “Arimoto-Renyi conditional entropy and Bayesian m-aryhypothesis testing,” IEEE Trans. Inf. Theory, Jan. 2018.
[134] L. H. Loomis and H. Whitney, “An inequality related to the isoperimetric in-equality,” Bull. Amer. Math. Soc., vol. 55, pp. 961–962, 1949.
[135] P. F. F. Chung R. Graham and J. Shearer, “Some intersection theorems forordered sets and graphs,” J. Combinatorial Theory Series A, vol. 43, no. 1,pp. 23–37, 1986.
[136] S. T. Rachev, Probability Metrics and the Stability of Stochastic Models. JohnWiley & Sons Ltd., Chichester, 1991.
[137] K. Marton, “A simple proof of the blowing-up lemma,” IEEE Trans. Inf. The-ory, vol. 32, no. 3, pp. 445–446, 1986.
[138] K. Marton, “Bounding d-distance by informational divergence: a method toprove measure concentration,” The Annals of Probability, vol. 24, no. 2, pp. 857–866, 1996.
[139] R. Impagliazzo and D. Zuckerman, “How to recycle random bits,” in Proc. 1989Symposium on Foundations of Computer Science (FOCS), pp. 248–253, 1989.
[140] C. H. Bennett, G. Brassard, C. Crepeau, and U. Maurer, “Generalized privacyamplification,” IEEE Trans. Inf. Theory, vol. 41, no. 6, pp. 1915–1923, 1995.
[141] J. Hastad, R. Impagliazzo, L. A. Levin, and M. G. Luby, “A pseudorandomgenerator from any one-way function,” SIAM Journal on Computing, vol. 28,no. 4, pp. 1364–1396, 1999.
[142] S. Fehr and S. Berens, “On the conditional Renyi entropy,” IEEE Trans. Inf.Theory, vol. 60, pp. 6801–6810, Nov. 2014.
[143] J. Liu, P. Cuff, and S. Verdu, “Eγ-resolvability,” IEEE Trans. Inf. Theory,vol. 63, no. 5, pp. 2629–2658, May 2017.
287
[144] J. Liu, T. A. Courtade, P. Cuff, and S. Verdu, “Smoothing Brascamp-Lieb inequalities and strong converses for common randomness gen-eration (extended),” http: // www. princeton. edu/ ~ jingbo/ preprints/
ISITsmoothBL2016. pdf .
[145] J. Liu, T. A. Courtade, P. Cuff, and S. Verdu, “Smoothing Brascamp-Lieb in-equalities and strong converses for common randomness generation,” in Proc. ofIEEE International Symposium on Information Theory, pp. 1043–1047, July2016, Barcelona, Spain.
[146] Y. Polyanskiy, Channel Coding: Non-asymptotic Fundamental Limits. PhDthesis, Princeton University, Princeton, NJ, 2010.
[147] M. Ledoux and M. Talagrand, Probability in Banach Spaces: isoperimetry andprocesses. Springer Science & Business Media, 2013.
[148] C. Borell, “Positivity improving operators and hypercontractivity,” Mathema-tische Zeitschrift, vol. 180, no. 3, pp. 225–234, 1982.
[149] Y. Polyanskiy and S. Verdu, “Empirical distribution of good channel codeswith nonvanishing error probability,” IEEE Trans. Inf. Theory, vol. 60, no. 1,pp. 5–21, 2014.
[150] J. Liu, P. Cuff, and S. Verdu, “On α-decodability and α-likelihood decoder,” in55th Annual Allerton Conference on Communication, Control, and Computing,Monticello, Illinois, Oct. 2017.
[151] D. Guo, S. Shamai, and S. Verdu, “Mutual information and minimum mean-square error in Gaussian channels,” IEEE Trans. Inf. Theory, vol. 51, no. 4,pp. 1261–1282, 2005.
[152] S. L. Fong and V. Y. Tan, “A proof of the strong converse theorem for Gaus-sian broadcast channels via the Gaussian Poincare inequality,” arXiv preprintarXiv:1509.01380, 2015.
[153] P. Caputo, G. Menz, and P. Tetali, “Approximate tensorization of entropy athigh temperature,” arXiv:1405.0608v3, 2015.
[154] H. Tyagi and P. Narayan, “The Gelfand-Pinsker channel: Strong converse andupper bound for the reliability function,” in Proc. of IEEE International Sym-posium on Information Theory, pp. 1954–1958, June-July 2009, Seoul, Korea.
[155] T. Han and S. Verdu, “Approximation theory of output statistics,” IEEE Trans.Inf. Theory, vol. 39, no. 3, pp. 752–772, May 1993.
[156] A. El Gamal and E. C. van der Meulen, “A proof of Marton’s coding theorem forthe discrete memoryless broadcast channel.,” IEEE Trans. Inf. Theory, vol. 27,no. 1, pp. 120–122, Jan. 1981.
288
[157] Y. Polyanskiy private communication, June 2014.
[158] I. Csiszar, “Information-type measures of difference of probability distributionsand indirect observation,” Studia Sci. Math. Hungar., vol. 2, pp. 229-318, 1967.
[159] P. Cuff, “Distributed channel synthesis,” IEEE Trans. Inf. Theory, vol. 59,no. 11, pp. 7071–7096, Nov. 2013.
[160] E. C. Song, P. Cuff, and H. V. Poor, “The likelihood encoder for lossy compres-sion,” IEEE Trans. Inf. Theory, vol. 62, no. 4, pp. 1836–1849, Apr. 2016.
[161] S. Verdu, “Non-asymptotic achievability bounds in multiuser information the-ory,” in 50th Annual Allerton Conference on Communication, Control, andComputing, pp. 1-8, 2012.
[162] S. Verdu, “Total variation distance and the distribution of relative information,”in Workshop on Information Theory and Applications, UCSD, Feb. 10-15 2014.
[163] A. Renyi, “On measures of entropy and information,” in Proceedings of theFourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1,pp. 547–561, 1961.
[164] N. A. Warsi, “One-shot bounds for various information theoretic problems usingsmooth min and max Renyi divergences,” in Proceedings of 2013 InformationTheory Workshop (ITW), pp. 1–5, Sept. 2013.
[165] L. Wang, R. Colbeck, and R. Renner, “Simple channel coding bounds,” inProceedings of 2009 IEEE International Symposium on Information Theory,pp. 1804–1808, Seoul, South Korea, June-July 2009.
[166] M. M. Wilde, Quantum Information Theory. Cambridge University Press, 2013.
[167] M. S. Pinsker, Information and Information Stability of Random Variables andProcesses. San-Fransisco: Holden-Day, 1964, originally published in Russian in1960.
[168] P. Cuff, “Soft covering with high probability,” in Proceedings of 2016 IEEEInternational Symposium on Information Theory, pp. 2963–2967, Barcelona,Spain, July 10-July 15, 2016.
[169] Y. Steinberg and S. Verdu, “Simulation of random processes and rate-distortiontheory,” IEEE Trans. Inf. Theory, vol. 42, no. 1, pp. 63–86, 1996.
[170] I. Csiszar, “Almost independence and secrecy capacity,” Problems Inf. Trans-mission, vol. 32, no. 1, pp. 40-47, 1996.
[171] M. Hayashi, “General nonasymptotic and asymptotic formulas in channel re-solvability and identification capacity and their application to the wiretap chan-nel,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1562–1575, Apr. 2006.
289
[172] M. Bloch and N. Laneman, “Strong secrecy from channel resolvability,” IEEETrans. Inf. Theory, vol. 59, no. 12, pp. 8077–8098, Dec. 2013.
[173] T. S. Han, H. Endo, and M. Sasaki, “Reliability and security functions of thewiretap channel under cost constraint,” IEEE Trans. Inf. Theory, vol. 60, no. 11,pp. 6819-6843, 2014.
[174] Y. Steinberg and S. Verdu, “Channel simulation and coding with side informa-tion,” IEEE Trans. Inf. Theory, vol. 40, no. 3, pp. 634–646, May 1994.
[175] S. Shamai and S. Verdu, “The empirical distribution of good codes,” IEEETrans. Inf. Theory, vol. 43, no. 3, pp. 836–846, 1997.
[176] S. Watanabe and M. Hayashi, “Strong converse and second-order asymptoticsof channel resolvability,” in Proceedings of 2014 IEEE International Symposiumon Information Theory, pp. 1882–1886, Honolulu, HI, USA, July 2014.
[177] Z. Goldfeld, P. Cuff, and H. H. Permuter, “Semantic-security capacity for wire-tap channels of type ii,” in Proceedings of 2016 IEEE International Symposiumon Information Theory, pp. 2799–2803, Barcelona, Spain, July 10-July 15, 2016.
[178] M. Tahmasbi and M. R. Bloch, “Second-order asymptotics of covert communi-cations over noisy channels,” in Proceedings of 2016 IEEE International Sym-posium on Information Theory, pp. 2224–2228, Barcelona, Spain, July 10-July15, 2016.
[179] S. Verdu and T. S. Han, “A general formula for channel capacity,” IEEE Trans.Inf. Theory, vol. 40, no. 4, pp. 1147–1157, 1994.
[180] A. Orlitsky and J. R. Roche, “Coding for computing,” IEEE Trans. Inf. Theory,vol. 47, no. 3, pp. 903–917, Mar. 2001.
[181] J. Liu, J. Jin, and Y. Gu, “Robustness of sparse recovery via f -minimization: Atopological viewpoint,” IEEE Trans. Inf. Theory, vol. 61, pp. 3996–4014, July2015.
[182] J. K. Omura, “A lower bounding method for channel and source coding prob-abilities,” Information and Control, vol. 27, no. 2, pp. 148–177, 1975.
[183] Y. Liang and G. Kramer, “Rate regions for relay broadcast channels,” IEEETrans. Inf. Theory, vol. 53, pp. 3517–3535, Oct. 2007.
[184] S. I. Gel’fand and M. S. Pinsker, “Capacity of a broadcast channel with onedeterministic component,” Problems Inf. Transmission.
[185] Y. Liang, G. Kramer, and H. V. Poor, “On the equivalence of two achievableregions for the broadcast channel,” IEEE Trans. Inf. Theory, vol. 57, pp. 95–100, Jan. 2011.
290
[186] J. Liu, P. Cuff, and S. Verdu, “One-shot mutual covering lemma and Marton’sinner bound with a common message,” in Proceedings of 2015 IEEE Interna-tional Symposium on Information Theory, pp. 1457–1461, Hong Kong, China,June 2015.
[187] S. Verdu, “Non-asymptotic covering lemmas,” in Proc. 2015 IEEE InformationTheory Workshop, Jerusalem, Israel, April 26-May 1, 2015.
[188] M. H. Yassaee, M. R. Aref, and A. Gohari, “A technique for deriving one-shot achievability results in network information theory,” in Proceedings ofIEEE International Symposium on Information Theory, pp. 1287–1291, Istan-bul, Turkey, Jul. 2013.
[189] M. H. Yassaee, “One-shot achievability via fidelity,” in Proceedings of 2015IEEE International Symposium on Information Theory, pp. 301–305, HongKong, China, June 2015.
[190] M. H. Yassaee, J. Liu, and S. Verdu, “One-shot multivariate covering lemmasvia weighted sum and concentration inequalities,” in Proceedings of the 2017IEEE International Symposium on Information Theory, pp. 669–673, June 25–30, 2017.
[191] J. Liu, P. Cuff, and S. Verdu, “Key capacity with limited one-way communica-tion for product sources,” in Proceedings of 2014 IEEE International Symposiumon Information Theory, pp. 1146–1150, Honolulu, HI, USA, July 2014.
[192] A. D. Wyner, “The wire-tap channel,” The Bell System Technical Journal,vol. 54, no. 8, pp. 1355–1387, 1975.
[193] J. Hou and G. Kramer, “Effective secrecy: Reliability, confusion and stealth,”in Proceedings of 2014 IEEE International Symposium on Information Theory,pp. 601–605, Honolulu, HI, USA, July 2014.
[194] M. Bloch, “A channel resolvability perspective on stealth communications,” inProceedings of 2015 IEEE International Symposium on Information Theory,pp. 601–605, Hong Kong, June 2015.
[195] T. Tao, “Amplification, arbitrage, and the tensor power trick,”https://terrytao.wordpress.com/2007/09/05/amplification-arbitrage-and-the-tensor-power-trick/.
[196] T. Tao, “The Brunn-Minkowski inequality for nilpotent groups,” [On-line]. Available: https: // terrytao. wordpress. com/ 2011/ 09/ 16/
the-brunn-minkowski-inequality-for-nilpotent-groups/ , Sept. 16,2011.
[197] E. A. Carlen, “Superadditivity of Fisher’s information and logarithmic Sobolevinequalities,” Journal of Functional Analysis, vol. 101, no. 1, pp. 194–211, 1991.
291
[198] H. Cramer, “Uber eine Eigenschaft der normalen Verteilungsfunktion,” Mathe-matische Zeitschrift, vol. 41, no. 1, pp. 405–414, 1936.
[199] C. Nair, “An extremal inequality related to hypercontractivity of Gaussian ran-dom variables,” ITA 2014, 2014.
[200] T. Courtade and J. Jiao, “An Extremal Inequality for Long Markov Chains,”arXiv preprint arXiv:1404.6984, 2014.
[201] T. A. Courtade, “Strengthening the entropy power inequality,” in Proc. of IEEEInternational Symposium on Information Theory, pp. 2294–2298, July 2016,Barcelona, Spain.
[202] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory.” MIT (6.441),UIUC (ECE 563), 2012-2014. http://people.lids.mit.edu/yp/homepage/
data/itlectures_v3.pdf.
[203] S. G. Bobkov and G. P. Chistyakov, “Entropy power inequality for the Renyientropy,” IEEE Trans. Inf. Theory, vol. 61, no. 2, pp. 708–714, Feb. 2015.
[204] Y. Wu and S. Verdu, “Functional properties of minimum mean-square error andmutual information,” IEEE Trans. Inf. Theory, vol. 58, pp. 1289–1301, Mar.2012.
[205] M. Godavarti and A. Hero, “Convergence of differential entropies,” IEEE Trans.Inf. Theory, vol. 50, no. 1, pp. 171–176, Jan. 2004.
[206] K. Ball, “Volumes of sections of cubes and related problems,” in Geometricaspects of functional analysis, pp. 251–260, Springer, 1989.
[207] M. Talagrand, “Transportation cost for Gaussian and other product measures,”Geometric & Functional Analysis, vol. 6, no. 3, pp. 587–600, 1996.
[208] G. Xu, W. Liu, and B. Chen, “Wyner’s Common Information: General-izations and A New Lossy Source Coding Interpretation,” arXiv preprintarXiv:1301.2237, 2013.
[209] M. D. Perlman, “Jensen’s inequality for a convex vector-valued function onan infinite-dimensional space,” Journal of Multivariate Analysis, vol. 4, no. 1,pp. 52–65, 1974.
[210] I. Csiszar and P. Narayan, “Secrecy capacities for multiple terminals,” IEEETrans. Inf. Theory, vol. 50, no. 12, pp. 3047–3061, Dec. 2004.
[211] C. Chan and L. Zheng, “Mutual dependnece for secret key agreement,” in Proc.44 Annual Conference on Information Science and Systems (CISS), pp. 1–6,Mar. 2010.
292
[212] W. Yang, R. F. Schaefer, and H. V. Poor, “Finite-blocklength bounds for wire-tap channels,” in Proceedings of the 2016 IEEE International Symposium onInformation Theory, pp. 3087–3091, July 2016.
293