Entropy, Relative Entropy, and MutualInformation
Some basic notions of Information Theory
Radu Trımbitas
October 2012
Outline
Contents
1 Entropy and its Properties 11.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Joint Entropy and Conditional Entropy . . . . . . . . . . . . . . 31.3 Relative Entropy and Mutual Information . . . . . . . . . . . . . 51.4 Relationship between Entropy and Mutual Information . . . . . 61.5 Chain Rules for Entropy, Relative Entropy and Mutual information 8
2 Inequalities in Information Theory 102.1 Jensen inequality and its consequences . . . . . . . . . . . . . . . 102.2 Log sum inequality and its applications . . . . . . . . . . . . . . 132.3 Data-processing inequality . . . . . . . . . . . . . . . . . . . . . . 142.4 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.5 Fano’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1 Entropy and its Properties
1.1 Entropy
Entropy of a discrete RV
• a measure of uncertainty of a random variable
• X a discrete random variable
X ∼(
xipi
)i∈I
, X alphabet of X, p(x) = P(X = x), mass function of X
1
Figure 1: Graph of H(p)
Definition 1. The entropy of the discrete random variable X
H(X) = − ∑x∈X
p(x) log p(x) (1)
H(X) = Ep
(log
1p(x)
)equivalent expression (2)
• measured in bits!
• base 2! Hb(X) entropy in base b; for b = e, measured in nats!
• convention 0 log 0 = 0, since limx↘0 x log x = 0
Entropy - Properties
Lemma 2. H(X) ≥ 0
Lemma 3. Hb(X) = logb aHa(X)
Example 4. Let the RV
X :(
0 11− p p
)H(X) = −p log p− (1− p) log(1− p) =: H(p) (3)
H(X) = 1 bit when p = 12 . Graph in Figure 1
Entropy - Properties II
2
Entropy - Properties III
X :(
a b c d12
14
18
18
)The entropy of X is
H(X) = −12
log212− 1
4log2
14− 1
8log2
18− 1
8log2
18=
74
bits
Problem: Determine the value of X with the minimum number of binaryquestions.
Sol: Is X = a? Is X = b? Is X = c? The resulting expected numberis 7
4 = 1.75 bits. See Lectures on Data Compression: the minimum expectednumber of binary question required to determine X lies between H(X) andH(X) + 1.
1.2 Joint Entropy and Conditional Entropy
Joint Entropy and Conditional Entropy
• (X, Y) a pair of discrete RVs over the alphabets X ,Y
X :(
xipi
)i∈I
, Y :(
yjqj
)j∈J
• joint distribution of X and Y
p(x, y) = P(X = x, Y = y), x ∈ X , y ∈ Y
• (marginal) distribution of X
pX(x) = p(x) = P(X = x) = ∑y∈Y
p(x, y)
• (marginal) distribution of Y
pY(y) = p(y) = P(Y = y) = ∑x∈X
p(x, y)
Definition 5. The joint entropy H(X, Y) of a pair of DRV (X, Y) ∼ p(x, y)
H(X, Y) = − ∑x∈X
∑y∈Y
p(x, y) log p(x, y) (4)
also expressed as H(X, Y) = −E (log p(X, Y))
3
Definition 6. (X, Y) ∼ p(x, y), the conditional entropy H(Y|X)
H(Y|X) = ∑x∈X
p(x)H(Y|X = x) (5)
where
H(Y|X = x) := − ∑y∈Y
p(y|x) log p(y|x)
p(y|x) := P(Y = y|X = x) =P(Y = y, X = x)
P (X = x)︸ ︷︷ ︸conditional probability
=p(x, y)p(x)
• By computation
H(Y|X) = − ∑x∈X
p(x) ∑y∈Y
p(y|x) log p(y|x) (6)
= − ∑x∈X
∑y∈Y
p(x, y) log p(y|x) (7)
= −E (log p (Y|X)) (8)
• naturalness of last two definitions ←− the entropy of a pair of RVs isthe entropy of one plus the conditional entropy of the other – see nexttheorem
Theorem 7 (Chain Rule).
H(X, Y) = H(X) + H(Y|X). (9)
Proof.
H(X, Y) = − ∑x∈X
∑y∈Y
p(x, y) log p(x, y)
= − ∑x∈X
∑y∈Y
p(x, y) log p(x)p(y|x)
= − ∑x∈X
∑y∈Y
p(x, y) log p(x)− ∑x∈X
∑y∈Y
p(x, y)p(y|x)
= − ∑x∈X
p(x) log p(x)− ∑x∈X
∑y∈Y
p(x, y)p(y|x)
= H(X) + H(Y|X).
Equivalently (shorter proof): we can write
log p(X, Y) = log p(X) + log p(Y|X)
and apply E to both sides.
4
Joint Entropy and Conditional Entropy - b
Corollary 8.H(X, Y|Z) = H(X|Z) + H(Y|Z, X).
Example 9. Let (X, Y) have the joint distribution
Y\X 1 2 3 41 1
81
16132
132
2 116
18
132
132
3 116
116
116
116
4 14 0 0 0
• marginal distributions X:( 1
214
18
18
); Y:
( 14
14
14
14
)• H(X) = 7
4 bits, H(Y) = 2bits
• conditional entropy
H(X|Y) =4
∑i=1
p (Y = i) H (X|Y = i)
=14
H(
12
,14
,18
,18
)+
14
H(
12
,14
,18
,18
)+
14
H(
14
,14
,14
,14
)+
14
H (1, 0, 0, 0)
=14·(
74+
74+ 2 + 0
)=
118
bits
• H(Y|X) = 138 bits and H(X, Y) = 27
8 bits.
• Remark. If H(X) 6= H(Y) then H(Y|X) 6= H(X|Y). However H(X)−H(X|Y) = H(Y)− H(Y|X).
1.3 Relative Entropy and Mutual Information
Relative Entropy and Mutual Information
Definition 10. The relative entropy or Kullback-Leibler distance between p(x) andq(x)
D (p ‖ q) = ∑x∈X
p(x) logp(x)q(x)
= Ep
(log
p(x)q(x)
).
• Conventions: 0 log 00 = 0, 0 log 0
q = 0, p log p0 = ∞
• It is not a true distance, since it is not symmetric and does not satisfy thetriangle inequality – sometimes called Kullback-Leibler divergence.
5
Definition 11. (X, Y) ∼ p(x, y), p(x), p(y) mass functions; the mutual informa-tion I(X; Y) is the relative entropy between p(x, y) and p(x)p(y) :
I(X; Y) = D (p(x, y) ‖ p(x)p(y)) (10)
= ∑x∈X
∑y∈Y
p(x, y) logp(x, y)
p(x)p(y)(11)
= Ep(x,y)
(log
p(X, Y)p(X)p(Y)
). (12)
Remark. D(p ‖ q) 6= D(q ‖ p), as the next example shown.
Interpretation. I(X; Y) measures the average reduction in uncertainty of Xthat results from knowing Y.
Example 12. X = {0, 1}, p(0) = 1− r, p(1) = r, q(0) = 1− s, q(1) = s.
D(p ‖ q) = (1− r) log1− r1− s
+ r logrs
D(q ‖ p) = (1− s) log1− s1− r
+ s logsr
If r = s, then D(p ‖ q) = D(q ‖ p), but for r = 12 , s = 1
4
D(p ‖ q) =12
log1234+
12
log1214= 0.207 52 bit
D(q ‖ p) =34
log3412+
14
log1412= 0.188 72 bit
Example - relative entropyD(p ‖ q) = (1− r) log 1−r
1−s + r log rs
1.4 Relationship between Entropy and Mutual Information
Relationship between Entropy and Mutual Information
Theorem 13 (Mutual information and entropy).
I(X; Y) = H(X)− H(Y|X) (13)I(X; Y) = H(Y)− H(X|Y) (14)I(X; Y) = H(X) + H(Y)− H(X, Y) (15)I(X; Y) = I(Y, X) (16)I(X, X) = H(X) (17)
6
Figure 2: Relative entropy (Kullback-Leibler distance) of two Bernoulli RVs
Proof. (13)
I(X; Y) = ∑x∈X
∑y∈Y
p(x, y) logp(x, y)
p(x)p(y)= ∑
x∈X∑
y∈Yp(x, y) log
p(x|y)p(x)
= − ∑x∈X
∑y∈Y
p(x, y)︸ ︷︷ ︸p(x)
log p(x) + ∑x∈X
∑y∈Y
p(x, y) log p(x|y)
= H(X)−(− ∑
x∈X∑
y∈Yp(x, y) log p(x|y)
)= H(X)− H(X|Y)
(14) by symmetry(15) results from (13) and H(X, Y) = H(Y)− H(X|Y); (15)=⇒(16)Finally, I(X; X) = H(X)− H(X|X) = H(X).
Relationship between entropy and mutual informationExample 14. For the joint distribution of Example 9 the mutual information is
I(X; Y) = H(X)− H(X|Y) = H(Y)− H(Y|X) = 0.375 bit
The relationship between H(X), H(Y), H(X, Y), H(X|Y), H(Y|X), and I(X; Y)is depicted in Figure 4. Notice that I(X; Y) corresponds to the intersection ofthe information in X with the information in Y.
Relationship between entropy and mutual information
7
Figure 3: Graphical representation of the relation between entropy and mutualinformation
Relationship between entropy and mutual information - graphical
1.5 Chain Rules for Entropy, Relative Entropy and Mutual in-formation
Chain rules for entropy, relative entropy and mutual information
Theorem 15 (Chain rule for entropy). X1, X2, . . . , Xn ∼ p(x1, x2, . . . , xn)
H (X1, X2, . . . , Xn) =n
∑i=1
H (Xi|Xi−1, . . . , X1)
Proof. Apply repeatedly the two variable expansion rule for entropy
H(X1, X2) = H(X1) + H(X2|X1),H(X1, X2, X3) = H(X1) + H(X2, X3|X1)
= H(X1) + H(X2|X1) + H(X3|X2, X1)
...H (X1, X2, . . . , Xn) = H(X1) + H(X2|X1) + · · ·+ H (Xn|Xn−1, . . . , X1)
=n
∑i=1
H (Xi|Xi−1, . . . , X1) .
8
Figure 4: Venn diagram for the relationship between entropy and mutual in-formation
Definition 16. The conditional mutual information of random variables X and Ygiven Z is defined by
I (X; Y|Z) = H(X|Z)− H(X|Y, Z) (18)
= Ep(x,y,z) logP(X, Y|Z)
P(X|Z)P(Y|Z) (19)
Theorem 17 (Chain rule for information).
I(X1, X2, . . . , Xn; Y) =n
∑i=1
I (Xi; Y|Xi−1, Xi−2, . . . , X1) . (20)
Proof.
I(X1, X2, . . . , Xn; Y)= H (X1, X2, . . . , Xn)− H (X1, X2, . . . , Xn|Y) (21)
=n
∑i=1
H (Xi|Xi−1, . . . , X1)−n
∑i=1
H (Xi|Xi−1, . . . , X1, Y)
=n
∑i=1
I (Xi; Y|X1, X2, . . . , Xi−1) (22)
Definition 18. For joint probability mass functions p(x, y) and q(x, y), the con-
9
ditional relative entropy is
D (p(x, y) ‖ q(x, y)) = ∑x
p(x)∑y
p(y|x) logp(y|x)q(y|x) (23)
= Ep(x,y) logp (Y|X)
q (Y|X). (24)
The notation is not explicit since it omits the mention of the distributionp(x) of the conditioning RV. However it is normally understood from the con-text.
Theorem 19 (Chain rule for relative entropy).
D (p(x, y) ‖ q(x, y)) = D(p(x) ‖ q(x)) + D (p(y|x) ‖ q(y|x)) (25)
Proof.
D(p(x, y)||q(x, y))
= ∑x
∑y
p(x, y) logp(x, y)q(x, y)
= ∑x
∑y
p(x, y) logp(x)p(y|x)q(x)q(y|x)
= ∑x
∑y
p(x, y) logp(x)q(x)
+ ∑x
∑y
p(x, y) logp(y|x)q(y|x)
= D(p(x) ‖ q(x)) + D (p(y|x) ‖ q(y|x)) .
2 Inequalities in Information Theory
2.1 Jensen inequality and its consequences
Jensen inequalityConvexity underlies many of the basic properties of information-theoretic
quantities such as entropy and mutual information.
Definitions 20. 1. A function f (x) is convex∪ over an interval (a, b) if forevery x1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1
f (λx1 + (1− λ) x2) ≤ λ f (x1) + (1− λ) f (x2). (26)
2. f is strictly convex if equality holds only for λ = 0 and λ = 1.
3. f is concave∩ if − f is convex.
10
• A function is convex if it always lies below any chord. A function isconcave if it always lies above any chord.
• Examples of convex functions: x2, |x|, ex, x log x for x ≥ 0.
• Example of concave functions: log x and√
x for x ≥ 0.
• If f ′′ nonnegative (positive) then f is convex (strictly convex)
Theorem 21 (Jensen’s inequality). If f is a convex function and X is a RV
E( f (X)) ≥ f (E(X)). (27)
If f is strictly convex, equality in (27) implies X = E(X) with probability 1 (i.e. X isa constant).
Proof. for discrete RV induction on number of mass points.For a two-mass-point distribution, we apply the definition
f (p1x1 + p2x2) ≤ p1 f (x1) + p2 f (x2)
Suppose true for k− 1; we set p′i = pi/ (1− pk)
f
(k
∑i=1
pixi
)= f
(pkxk + (1− pk)
k−1
∑i=1
p′ixi
)
≤ pk f (xk) + (1− pk) f
(k−1
∑i=1
p′ixi
)
≤ pk f (xk) + (1− pk)k−1
∑i=1
p′i f (xi) =k
∑i=1
pi f (xi) .
Extension to continuous distributions using continuity arguments.
Interpretation of convexity
Consequences of Jensen Inequality
• We will use Jensen to prove properties of entropy and relative entropy.
Theorem 22 (Information inequality, Gibbs’ inequality). p(x), q(x), x ∈ X pmf
D (p ‖ q) ≥ 0 (28)
with equality iff p(x) = q(x), ∀x ∈ X .
11
Proof. Let A := {x : p(x) > 0}
D (p ‖ q) = ∑x∈A
p(x) logp(x)q(x)
= ∑x∈A
p(x)(− log
q(x)p(x)
)
≥ − log
(∑
x∈Ap(x)
q(x)p(x)
)(- log is strictly convex)
= − log ∑x∈A
q(x) = − log 1 = 0
Equality hold iff q(x)p(x) = c, ∀x ∈ X . But, 1 = ∑x∈A q(x) = ∑x∈X q(x) =
c ∑x∈X p(x) = c, so p(x) = q(x), ∀x ∈ X .
Since I(X, Y) = D(p(x, y) ‖ p(x)q(x)) ≥ 0, with equality iff p(x, y) =p(x)q(x) (i.e. X and Y are independent) we obtain
Corollary 23.I(X, Y) ≥ 0, (29)
with equality iff X and Y are independent.
Corollary 24.I(X; Y|Z) ≥ 0, (30)
with equality iff X and Y are conditionally independent given Z.
Any random variable over X has an entropy no greater than log |X |.
12
Theorem 25. H(X) ≤ log |X |, with equality iff X ∼ U(X ).
Proof. p(x) pmf of X, u(x) = 1|X | pmf of uniform distribution over X .
0 ≤ D(p ‖ u) = ∑x∈X
p(x) logp(x)u(x)
= log |X | − H(X).
The next theorem states that conditioning reduces entropy (or informationcannot hurt).
Theorem 26.H(X|Y) ≤ H(X)
with equality iff X and Y are independent.
Proof. 0 ≤ I(X, Y) = H(X)− H(X|Y).
Corollary 27 (Independence bound on entropy).
H(X1, X2, . . . , X) ≤n
∑i=1
H(Xi)
Proof. Chain rule for entropy (Theorem 15)
H(X1, X2, . . . , X) =n
∑i=1
H(Xi|Xi−1, . . . , X1) ≤ H(Xi) ( ≤ from Th. 26)
2.2 Log sum inequality and its applications
Log sum inequality and its applications
Theorem 28 (Log sum inequality). a1, . . . , an and b1, . . . , bn nonnegative numbers
n
∑i=1
ai logaibi≥(
n
∑i=1
ai
)log
∑ni=1 ai
∑ni=1 bi
(31)
with equality iff aibi= const.
Conventions: 0 log 0 = 0, a log a0 = ∞ if a > 0 and 0 log 0
0 = 0 (by continu-ity)
13
Proof. Assume w.l.o.g. ai > 0, bi > 0. Since f (t) = t log t is convex for t > 0,by Jensen ineq.
∑ αi f (ti) ≥ f(∑ αiti
), αi ≥ 0, ∑ αi = 1.
Setting αi =bi
∑ bjand ti =
aibi
, we obtain
∑ai
∑ bjlog
aibi≥∑
ai
∑ bjlog ∑
ai
∑ bj,
the desired inequality.
Homework. Prove Theorem 22 using log sum inequality.Using log sum inequality it is easy to prove convexity and concavity results
for relative entropy, entropy and mutual information. See [1, Section 2.7].
2.3 Data-processing inequality
Data-processing inequality
Definition 29. Random variables X, Y, Z are said to form a Markov chain in thatorder (denoted by X → Y → Z) if the conditional distribution of Z dependsonly on Y and is conditionally independent of X. Specifically, X, Y, and Zform a Markov chain X → Y → Z if the joint probability mass function can bewritten as
p(x, y, z) = p(x)p(y|x)p(z|y). (32)
Consequences:
• X → Y → Z iff X and Z are conditionally independent given Y (i.e.p(x, z|y) = p(x|y)p(z|y). Markovity implies conditional independencebecause
p(x, z|y) = p(x, y, z)p(y)
=p(x, y)p(z|y)
p(y)= p(x|y)p(z|y). (33)
• X → Y → Z =⇒ Z → Y → X, sometimes written as X ←→ Y ←→ Z.(reversibility)
• If Z = f (Y), then X → Y → Z
We will prove that no processing of Y, deterministic or random, can increase theinformation that Y contains about X.
Theorem 30 (Data-processing inequality). If X → Y → Z, then I(X; Y) ≥I(X; Z).
14
Proof. By chain rule (20), we expand mutual information in two different ways:
I(X, Y, Z) = I(X; Z) + I(X; Y|Z) (34)= I(X; Y) + I(X; Z|Y). (35)
X, Z conditionally independent =⇒ I(X; Z|Y) = 0; since I(X; Y|Z) ≥ 0 wehave
I(X; Y) ≥ I(X; Z).
We have equality iff I(X; Y|Z) = 0, that is X → Z → Y forms a Markov chain.Similarly, one can prove that I(Y; Z) ≥ I(X; Z).
Corollary 31. In particular, if Z = g(Y), we have I(X; Y) ≥ I(X; g(Y)).
Proof. X → Y → g(Y) forms a Markov chain.
Functions of the data Y cannot increase the information about X.
Corollary 32. If X → Y → Z, then I(X; Y|Z) ≤ I(X; Y).
Proof. In (34), (35) we have I(X; Z|Y) (by Markovity) and I(X; Z) ≥ 0. Thus
I(X; Y|Z) ≤ I(X; Y). (36)
If X, Y, Z do not form a Markov chain it is possible that I(X; Y|Z) > I(X; Y).For example, if X and Y are independent fair binary RVs and Z = X + Y, thenI(X; Y) = 0, I(X; Y|Z) = H(X|Z) − H(X|Y, Z) = H(X|Z). But, H(X|Z) =P(Z = 1)H(X|Z = 1) = 1
2 bit.
2.4 Sufficient statistics
Sufficient statistics
• We apply data-processing inequality in statistics
• { fθ(x)} family of pmfs, X ∼ fθ(x), T(X) statistics
• θ → X → T(X); data-processing inequality (Theorem 30) implies
I(θ; T(X)) ≤ I(θ; X) (37)
with equality when no information is lost.
• A statistic T(X) is called sufficient for θ if it contains all the informationin X about θ.
Definition 33. A function T(X) is said to be a sufficient statistic relative to thefamily { fθ(x)} if X is independent of θ given T(X) for any distribution on θ(i.e. θ → X → T(X) forms a Markov chain).
15
The definition is equivalent to the condition of equality in data-processinginequality
I(θ; T(X)) = I(θ; X) (38)
for all distributions on θ. Hence sufficient statistics preserve mutual informa-tion and conversely.
Examples(sufficient statistics)
1. X1, X2, . . . , Xn, Xi ∈ {0, 1}, a sequence of i.i.d. Bernoullian variableswith parameter θ = P(Xi = 1). Given n, the number of 1’s is a sufficientstatistics for θ
T(X1, . . . , Xn) =n
∑i=1
Xi.
2. If X ∼ N(θ, 1), that is
fθ(x) =1√2π
e−(x−θ)2
2
and X1, X2, . . . , Xn is a sample of i.i.d. N(θ, 1) RVs, then Xn = 1n ∑n
i=1 Xiis a sufficient statistic.
3. fθ(x) pdf for U(θ, θ + 1) - a sufficient statistic for θ is
T(X1, . . . , Xn) = (min{Xi}, max{Xi}) .
Definition 34. A statistic T(X) is a minimal sufficient statistic relative to { fθ(x)}if it is a function of every other sufficient statistic U,
θ → T(X)→ U(X)→ X.
Hence, a minimal sufficient statistic maximally compresses the informationabout θ in the sample.
2.5 Fano’s inequality
Fano’s inequality
• Suppose we wish to estimate X ∼ p(x)
• We observe Y related to X by the conditional distribution p(y|x). From Ywe calculate g(Y) = X; X is an estimate of X over the alphabet X .
• X → Y → X forms a Markov chain
• Define the probability of error
Pe = P{
X 6= X}
.
16
Theorem 35. For any estimator X such that X → Y → X, with Pe = P{
X 6= X}
H(Pe) + Pe log |X | ≥ H(
X|X)≥ H(X|Y). (39)
This inequality can be weakened to
1 + Pe log |X | ≥ H(X|Y) (40)
or
Pe ≥H(X|Y)− 1
log |X | . (41)
Proof. For the first part we define the RV
E =
{1 if X 6= X,0 if X = X.
We expand H(E, X|X) in two ways using the chain rule
H(E, X|X) = H(X|X) + H(E|X, X)︸ ︷︷ ︸=0
(42)
= H(E|X)︸ ︷︷ ︸≤H(Pe)
+ H(E|X, X)︸ ︷︷ ︸≤Pe log |X |
. (43)
Proof - continuation. • Since E is a function of X and X, H(E|X, X) = 0.
• H(E|X) ≤ H(E) = H(Pe) (conditioning reduce entropy)
• Since for E = 0, X = X and for E = 1 entropy is less than the number ofpossible outcomes
H(E|X, X) = P(E = 0)H(X|X, E = 0) + P(E = 1)H(X|X, E = 1)≤ (1− Pe) · 0 + Pe log |X | (44)
Proof - continuation. Combining these results, we obtain
H(Pe) + Pe log |X | ≥ H(X|X)
X → Y → X Markov chain =⇒ I(X; X) ≤ I(X; Y) =⇒ H(X|X) ≥ H(X|Y).Finally,
H(Pe) + Pe log |X | ≥ H(
X|X)≥ H(X|Y).
17
If we set X = Y in Fano’s inequality, we obtain
Corollary 36. For any two RVs X and Y, let p = P (X 6= Y). Then
H(p) + p log |X | ≥ H(X|Y). (45)
If the estimator g(Y) takes values inX , we can replace log |X | by log (|X | − 1).
Corollary 37. Let Pe = P(X 6= X), and let X : Y → X ; then
H(Pe) + Pe log (|X | − 1) ≥ H(X|Y).
Proof. Like the proof of Theorem 35, excepting that in (44), the range of possibleX outcomes has the cardinal |X | − 1.
Remark. Suppose there is no knowledge of Y. Thus, X must be guessedwithout any information. Let X ∈ {1, 2, . . . , m} and p1 ≥ p2 ≥ · · · ≥ pm. Thenthe best guess of X is X = 1 and the resulting probability error is Pe = 1− p1.Fano’s inequality becomes
H(Pe) + Pe log(m− 1) ≥ H(X).
The pmf
(p1, p2, . . . , pm) =
(1− Pe,
Pe
m− 1, . . . ,
Pe
m− 1
)achieves this bound with equality: Fano’s inequality is sharp!
Next results relates probability of error and entropy. Let X and X′ be i.i.d.RVs with entropy H(X).
P(X = X′) = ∑x
p2(x).
Lemma 38. If X and X′ are i.i.d. RVs with entropy H(X),
P(X = X′) ≥ 2−H(x), (46)
with equality iff X has uniform distribution.
Proof. Suppose X ∼ p(x); Jensen implies
2E(log p(X)) ≤ E(
2log P(X))
2−H(X) = 2∑ p(x) log p(x) ≤∑ p(x)2log p(x) = ∑ p2(x).
Corollary 39. Let X˜p(x), X′ ∼ r(x), independent RVs over X . Then
P(X = X′) ≥ 2−H(p)−D(p‖r) (47)
P(X = X′) ≥ 2−H(r)−D(r‖p). (48)
18
Proof.
2−H(p)−D(p‖r) = 2∑ p(x) log p(x)+∑ p(x) log r(x)p(x)
= 2∑ p(x) log r(x)
From Jensen and convexity of f (y) = 2y it follows
2−H(p)−D(p‖r) ≤∑ p(x)2log r(x)
= ∑ p(x)r(x) = P(X = X′).
References
References
[1] Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, 2nd edi-tion, Wiley, 2006.
[2] David J.C. MacKay, Information Theory, Inference, and Learning Algorithms,Cambridge University Press, 2003.
[3] Robert M. Gray, Entropy and Information Theory, Springer, 2009
19