lecture notes on information theory and codingweb.xidian.edu.cn/bmbai/files/20171126_142523.pdf ·...

Lecture Notes on

Information Theory and Coding

Baoming BaiState Key Lab. of ISN, Xidian University

July 10, 2017

Goals: The goal of this class is to establish an understanding of the instrinsic prop-erties of transmission of information and the relation between coding and the founda-mental limits of information transmission in the context of communication.

Course Outline:

• Entropy and Mutual information (Measure of information)

• Sourse coding

• Channel capacity

• The Gaussian channel

• Coding for a noisy channel (Block coding principles)

• Rate distortion theory

基本内容可以概括为：

IT

通信的基本性能限逼近性能限的方法 – Coding: source coding, channel coding, network coding

Textbook: T. M. Cover and J. A. Thomas, Elements of Information Theory, 2ndedition, New York, Wiley, 2006. (清华影印版，2003).

References:

1. C. E. Shannon, “A mathematical theory of communication”, Bell Syst. Tech. J.,vol. 27, 1948.

2. R. G. Gallager, Information Theory and Reliable Communication, Wiley, 1968.

3. R. J. McEliece, The Theory of Information and coding, 1977. (This is a veryreadable small book)

4. Raymond W. Yeung, Information Theory and Network Coding, Springer, 2008.(中文版，2011)

5. J. Massey, Digital Information Theory, Course notes, ETH.

6. M. Medard, Transmission of Information, Course notes, MIT.

7. 王育民，李晖，梁传甲，信息论与编码理论，高等教育出版社，2005.

8. 仇佩亮，信息论与编码，高等教育出版社，2003.

9. 付祖芸，信息论，电子工业出版社.

10. R. G. Gallager, “Claude E. Shannon: A retrospective on his life, work, andimpact,”IEEE Trans. Inform. Theory, vol.47, no.7, pp. 2681-2695, Nov. 2001.

Chapter 1

Introduction

The fundamental problem of communication is that of reproducing at one point eitherexactly or approximately a message selected at another point. — Shannon

In 1948, Claude E. Shannon published a landmark paper entitled “A MathematicalTheory of Communication”. This paper laid the groundwork of a entirely new scientificdiscipline, “Information Theory”.

Information theory studies the transmission, processing and utilization of informa-tion.

1.1 Relationship of information theory and communica-tion theory

Information theory answers two fundamental questions in communication theory:

1. What is the ultimate data compression? H

2. What is the ultimate transmission rate? C

Information theory also suggests means of achieving these ultimate limits of communi-cation. (e.g. random coding, ML decoding)

通信系统传输的是信号；信号是消息的载体；消息中的不确定成分是信息。

• 狭狭狭义义义信信信息息息论论论(Shannon Theory)Shannon在前人工作的基础上，用概率统计的方法研究通信系统。揭示了通信系统中传送的对象是信息；系统设计的中心问题是在干扰噪声中如何有效而可靠地传送信息。指出可以用编码方法实现这一目标；并在理论上证明了通信系统可达到的最佳性能限。

• 一般信息论：除Shannon理论外，还包括最佳接收理论（信号检测、估计与调制理论），噪声理论等。

• 广义信息论

信信信息息息论论论是是是通通通信信信与与与信信信息息息系系系统统统的的的基基基础础础理理理论论论，，，是是是现现现代代代通通通信信信发发发展展展的的的动动动力力力和和和源源源泉泉泉：：：

3

4 CHAPTER 1. INTRODUCTION

网络与交

换理论等

Figure 1.1: 信息理论和通信理论的关系示意图. Shannon理论主要研究基本限(fundamental limits)

I have often remarked that the transistor and information theory, two Bell Laborato-ries breakthroughs within months of each other, have launched and powered the vehicleof modern digital communications. Solid state electronics provided the engine while in-formation theory gave us the steering wheel with which to guide it. — Viterbi, IT NewsLett., 1998.

• 信源编码定理→ 数据压缩技术→ 无线通信系统从1G变革到2G

• 信道编码定理→ 差错控制编码（Turbo,LDPC）→ 3G

• 数据处理定理→ 软判决译码

• 高斯噪声是最坏的加性噪声+ 多用户信息论→ CDMA、多用户检测

• MIMO容量理论→ 空时编码、预编码→ LTE、4G

• 多用户信息论→ 协作通信、网络编码→ 新一代无线系统

The recent work on the information-theoretic aspects of communication concentratedon: 1) Network information theory, and 2) MIMO systems.

1.2 What is information? (Measure of information)

For Shannon theory, information is what we receive when uncertainty is reduced.

How to measure:

• Amount of information should fulfill I ≥ 0

• Amount of information should depend on probability P (x)

• For independent events: P (X,Y ) = P (X)P (Y )→ I = I(X) + I(Y )

It should has the form of log 1PX(x) . (Self-information of the event X = x)

1.3. APPLICATIONS 5

1.3 Applications

• Data compression: voice coder, MPEG, LZ algorithm.

• Modem

• Deep space communication (and coding was called a “marriage made in heaven”)

• CDMA, MIMO, 4G

• Physical layer security (Information-theoretic securiy)

• Quantum communication

• Stock market

1.4 Historical notes

• Sampling theorem: 1928 by Nyquist

• Hartley’s measure of information (1928)

I(X) = logL,

L=number of possible values of X.

• Information theory: 1948 by ShannonInvestigate how to achieve the efficient and reliable communication

• Why using “entropy”?Shannon 与V. Neuman 讨论时, V. Neuman 建议用“熵”.

1. 你的不确定函数在统计力学中已经被称为熵(entropy).

2. 没有人知道熵到底是什么，所以有争论时你就永远立于不败之地.

• 在Shannon 1948年的原文中，既没有使用“mutual information”也没有用一个特殊符号来记它，而总是使用不确定性之差。术语“mutual information”及符号I(X;Y )是后来由Fano引入的.

• Shannon was born in Michigan, 1916. In 1936, he received B.S. degree in bothelectrical engineering and mathematics from the Univ. of Michigan. Receivedhis M.S. and Ph.D. degree from MIT. In 1941, he joined Bell Lab. In 1958, heaccepted a permanent appointment at MIT. 随后买了大房子, 房子里有很多玩具.

• Shannon的硕士论文是关于布尔代数与交换的，他基于此研究工作发表的第一篇论文won the 1940 Alfred Noble prize for the best paper in engineering publishedby an author under 30. It is widely recognized today as the foundation of theswitching field and as one of the most important Master’s theses ever written.His Ph.D. dissertation, “An Algebra for Theoretical Genetics,”was completedin 1940. This thesis was never published.

• In 1961, Shannon published a pioneering paper ”Two-way Communication Chan-nels”, which established the foundation for the discipline now known as ”multi-user information theory”; and later N. Abramson published his paper ”The Aloha


System - Another Alternative for Computer Communications” in 1970 which in-troduced the concept of multiple access using a shared common channel. Theinformation theoretic approach to multiaccess communication began in 1972 witha coding theorem developed by Ahlswede and Liao. In 1972, T. Cover publisheda paper “Broadcast channels”.

• 1995年，Gallager提出“Combining queueing theory with information theory formultiaccess”; 1998年，Ephremidus发表了论文“Information theory and commu-nication networks: An unconsummated union”; 2000年，R. Alshwede, N. Cai,S.-Y. R. Li, and R. W. Yeung发表了著名论文“Network information flow”,提出了网络编码(Network coding)的思想; 2000年，P. Gupta和P. R. Kumar发表了论文“The Capacity of wireless networks”，提出了传送容量(Transport capacity)的概念；2003之后，研究正向着大规模网络的信息理论发展（Towards an IT oflarge networks）。

1.5 A model of digital communication systems

信源信源编码器 ECC编码器数字调制器信道

干扰信宿信源译码器 ECC译码器数字解调器调制信道

Channel decoder

n(t)

bits

Channel encoder

symbols waveform

Figure 1.2: 数字通信系统示意图

• source: discrete/continuous; memoryless/with memory

• encoder: convert the messages into the signal which is suitable for transmissionover physical channels.

• channel: wireless/cable, disk.

• interference.

• sink: destination.

1.6 Review of Probability

Bayes rule:

P (A|B) =P (A ∩B)

P (B)=

P (B|A)P (A)

P (B)

For a discrete random variable (r.v.),

1.6. REVIEW OF PROBABILITY 7

• Probability Mass Function (PMF)

PX(x) = P (X = x)

denotes the Prob. of the event that X takes on the value x.

For a continuous r.v.,

• Cumulative Distribution Function (CDF)

FX(x) = P (X ≤ x)

• Probability Density Function (PDF)

pX(x) =d

dxFX(x)

Chapter 2

Entropy & Mutual Information(Shannon’s measure ofinformation)

This chapter introduces basic concepts and definitions required for the discussion later.Mainly include: Entropy, Mutual information(互信息), and relative entropy(相对熵).

2.1 Entropy

Let X be a discrete variable with alphabet X and PMF PX(x) =Pr(X = x), x ∈ X . Forconvenience, we will often write simply P (x) for PX(x).

Definition 2.1.1. The entropy of a discrete r.v. X is defined as

H(X) =∑x∈X

P (x) logb1

P (x)

= −∑x∈X

P (x) logb P (x) (2.1)

when b = 2, the unit is called the bit (binary digit); when b = e, the unit is called thenat (natural unit). (Conversion is easy: logb x = logb a loga x⇒ Hb(x) = (logb a)Ha(x)).Unless otherwise specified, we will take all logarithms to base 2, hence all entropies willbe measured in bits.

In the above definition, we use the convention that 0 log 0 = 0. Note that equiva-lently, many books adopt the convention that the summation is taken over the corre-sponding support set. The support set of P (X), denoted by SX , is the set of all x ∈ Xsuch that P (x) > 0; i.e., SX = supp(PX) = x : P (x) > 0.

The entropy H(X) is also called the uncertainty of X, meaning that it is a measureof the randomness of X.

Note that the entropy H(X) depends on the probabilities of different outcomes ofX, but not on the names of the outcomes. For example,

X = Green,Blue,Red Y = Sunday,Monday, FridayP (X) : 0.2, 0.3, 0.5 P (Y ) : 0.2, 0.3, 0.5

9

10CHAPTER 2. ENTROPY&MUTUAL INFORMATION (SHANNON’S MEASURE OF INFORMATION)

H(X) = H(Y )

Remark: The entropy of X can also be interpreted as the expected value of log 1P (X)

(i.e., the average uncertainty):

H(X) = E log1

P (X)(2.2)

where we define E[F (x)] =∑

x∈X PX(x)F (x). Recall that I(x) = log 1PX(x) is the self-

information of the event X = x, so H(X) = E[I(x)] is also referred to as 平均自信息量。

A immediate consequence of the definition is that H(X) ≥ 0.

Example 2.1.1. Let

X =

1 with probability p0 with probability 1-p

Then

H(X) = −p log p− (1− p) log(1− p) (2.3)

Equation (2.3) is often called the binary entropy function, and denoted by H(p). Itsgraph is shown in Fig. 2.1. We can see that H(X) = 1 bit when p = 1

2 .

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

p

H(p

)

Figure 2.1: 二元随机变量的熵函数

Example 2.1.2. Let

X =

a with probability 1/2b 1/4c 1/8d 1/8

Then H(X) = −12 log

12 −

14 log

14 −

18 log

18 −

18 log

18 = 7

4 bits.

On the other hand, if X takes on values in X = a, b, c, d with equal probability,then we have H(X) = −1

4 log14 × 4 = 2 bits=log |X |.

We can see that the uniform distribution over the range X is the maximum entropydistribution over this range. (In other words, the entropy of X is maximized when itsvalues are equally likely.)

2.2. JOINT ENTROPY AND CONDITIONAL ENTROPY 11

2.2 Joint entropy and conditional entropy

We now extend the definition of the entropy of a single random variable to a pair ofrandom variables.

(X,Y ) can be considered to be a single vector-valued random variable.

Definition 2.2.1. The joint entropy H(XY ) of a pair of discrete random variables(X,Y ) with a joint distribution P (x, y) is defined as

H(XY ) , E[− logP (X,Y )]

= −∑x∈X

∑y∈Y

P (x, y) logP (x, y)

= −∑

(x,y)∈SXY

P (x, y) logP (x, y) (2.4)

Definition 2.2.2. The conditional entropy of the discrete random variable X, giventhat the event Y = y occurs, is defined as

H(X|Y = y) = −∑x∈X

P (x|y) logP (x|y)

= E [− logP (X|Y )|Y = y] (2.5)

Definition 2.2.3. If (X,Y ) ∼ P (x, y), then the conditional entropy of the discreterandom variable X, given the discrete random variable Y , is defined as

H(X|Y ) =∑y∈Y

P (y)H(X|Y = y)

= −∑y∈Y

P (y)∑x∈X

P (x|y) logP (x|y)

= −∑x∈X

∑y∈Y

P (x, y) logP (x|y)

= E[− logP (X|Y )] (2.6)

Notice that H(Y |X) = H(X|Y ).

Theorem 2.2.1. H(XY ) = H(X) +H(Y |X) = H(Y ) +H(X|Y )

Proof.

H(XY ) = −∑x

∑y


= −∑x

∑y

P (x, y) [logP (x) + logP (y|x)]

= −∑x

P (x) logP (x)−∑x

∑y

P (x, y) logP (y|x)

= H(X) +H(Y |X) (2.7)

Corollary 2.2.2. H(XY |Z) = H(X|Z) +H(Y |XZ)


We now generalize the above theorem to a more general case.

Theorem 2.2.3 (Chain rule for entropy). Let X1, X2, . . . , XN be discrete random vari-ables drawn according to P (x1, x2, . . . , xN ). Then

H(X1, X2, . . . , XN ) =

N∑n=1

H(Xn|X1, X2, . . . , Xn−1)

Proof.

H(X1, X2, · · · , XN ) = E[− logP (X1, X2, · · · , XN )]

= E

[− log

N∏n=1

P (Xn|X1, · · · , Xn−1)

](2.8)

=N∑

n=1

E[− logP (Xn|X1, · · · , Xn−1)]

=

N∑n=1

H(Xn|X1, · · · , Xn−1)

If Xn are independent of each other, then

H(X1, X2, · · · , XN ) =

N∑n=1

H(Xn). (2.9)

Similarly, we have

H(X1, X2, · · · , XN |Y ) =N∑

n=1

H(Xn|X1, · · · , Xn−1, Y ).

2.3 Properties of the entropy function

Let X be a discrete random variable with alphabet X = xk, k = 1, 2, · · · ,K. Denotethe pmf of X by pk = Pr(X = xk), xk ∈ X . Then the entropy H(X) of X can bewritten as

H(p) = −K∑k=1

pk log pk

where p = (p1, p2, · · · , pK) is a K-vector of probabilities.

Property 2.3.1. H(p) ≥ 0 (Non-negativity of entropy)

Proof. Since 0 ≤ pk ≤ 1, we have log pk ≤ 0. Hence, H(p) ≥ 0.

Definition 2.3.1. A function f(x) is said to be convex-∪

over an interval (a, b) if forevery x1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1,

f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2)

A function f is said to be strictly convex-∪

if equality holds only when λ = 0 or λ = 1.

2.3. PROPERTIES OF THE ENTROPY FUNCTION 13

凸区域：若对于区域D中任意两点α 和β, α ∈ D, β ∈ D, 有

λα+ (1− λ)β ∈ D, 0 ≤ ∀λ ≤ 1

则称D是凸区域。

凸函数：若定义在凸区域D上的函数f(x)满足

f(λα+ (1− α)β) ≤ λf(α) + (1− λ)f(β), ∀α, β ∈ D, 0 ≤ λ ≤ 1.

则称函数f(x)为凸-∩函数.

Theorem 2.3.1. If the function f has a second derivative which is non-negative (resp.positive) everywhere (f ′′(x) ≥ 0), then the function is convex-

∪(resp. strictly convex).

Theorem 2.3.2 (Jensen’s inequality). If f is a convex-∪

function and X is a randomvariable, then

E[f(x)] ≥ f(E[X])

Proof. For a two mass point distribution, the inequality becomes

p1f(x1) + p2f(x2) ≥ f(p1x1 + p2x2)

which follows directly from the definition of convex function.

Suppose that the theorem is true for distributions with k − 1 mass points. Thenwriting pi

′ = pi1−pk

for i = 1, 2, . . . , k − 1, we have

k∑i=1

pif(xi) = pkf(xk) + (1− pk)

k−1∑i=1

pi′f(xi)

≥ pkf(xk) + (1− pk)f(k−1∑i=1

pi′xi)

≥ f

(pkxk + (1− pk)

k−1∑i=1

p′ixi

)

= f(

k∑i=1

pixi) (2.10)

Property 2.3.2. H(p) is the convex-∩

function of p.

IT-inequality: For a positive real number r,

log r ≤ (r − 1) log e (2.11)

with equality if and only if r = 1.

Proof. Let f(r) = ln r − (r − 1). Then we have

f ′(r) = 1r − 1 and f

′′(r) = − 1

r2< 0.


r−1

1

lnr

0 r

Figure 2.2: IT-inequality曲线

We can see that f(r) is convex-∩

in the interval r > 0. Moreover, the maximize valueis zero, which is achieved with r = 1. Therefore,

ln r ≤ r − 1

with equality iff r = 1. Equation (2.11) follows immediately.

Theorem 2.3.3. If the discrete r.v. X has K possible values, then H(X) ≤ logK, withequality iff P (x) = 1

K for all x ∈ X .

Proof.

H(X)− logK = −∑x∈Sx

P (x) logP (x)− logK

=∑x

P (x)[log1

P (x)− logK]

=∑x

P (x) log1

KP (x)

(by IT − inequality) ≤∑x

P (x)

[1

KP (x)− 1

]log e (2.12)

= [∑x

1

K−∑x

P (x)] log e

= [∑x∈X

1

K− 1] log e

= (1− 1) log e = 0

where equality holds in (2.12) iff KP (x) = 1 for all x ∈ Sx.

Theorem 2.3.4 (Conditioning reduces entropy). : For any two discrete r.v.’s X andY ,

H(X|Y ) ≤ H(X)

with equality iff X and Y are independent.

2.4. RELATIVE ENTROPY AND MUTUAL INFORMATION 15

(We can also use the relationship I(X;Y ) = H(X)−H(X|Y ) ≥ 0 to obtain H(X) ≥H(X|Y ))

Proof.

H(X|Y )−H(X) = −∑

(x,y)∈Sxy

P (x, y) logP (x|y) +∑x∈Sx

P (x) logP (x)

=∑x,y

P (x, y) logP (x)

P (x|y)

=∑x,y

P (x, y) logP (x)P (y)

P (x, y)

(by IT − inequality) ≤∑x,y

P (x, y)

(P (x)P (y)

P (x, y)− 1

)log e

=

(∑x,y

P (x)P (y)−∑x,y

P (x, y)

)log e

≤

∑x∈X ,y∈Y

P (x)P (y)− 1

log e

= (1− 1) log e

= 0

Note that, however H(X|Y = y) may exceed H(X).

Theorem 2.3.5 (Independence bound on entropy). Let X1, . . . , XN be drawn accordingto P (X1, X2, . . . , XN ). Then

H(X1, X2, . . . , XN ) ≤N∑

n=1

H(Xn)

with equality iff the Xn are independent.

Proof. By the chain rule for entropies,

H(X1, X2, . . . , XN ) =

N∑n=1

H(Xn|X1, . . . , Xn − 1)

≤N∑

n=1

H(Xn) (2.13)

2.4 Relative entropy and mutual information

2.4.1 Relative entropy

The relative entropy is a measure of the “distance” between two distributions.


Definition 2.4.1. The relative entropy (or Kullback Leiber distance) between two pmfP (x) and Q(x) is defined as

D(P ||Q) =∑x∈X

P (x) logP (x)

Q(x)= Ep[log

P (x)

Q(x)]

It is sometimes called the information divergence/cross-entropy/Kullback entropy be-tween P (x) and Q(x).

• In general, D(P ||Q) = D(Q||P ), so that relative entropy does not have the sym-metry required for a true “distance” measure.

• The divergence inequality: D(P ||Q) ≥ 0with equality iff P (x) = Q(x) for all x ∈ X .

Proof. Let SX = x : P (x) > 0 be the support set of P (x). Then

−D(P ||Q) =∑x∈SX

P (x) logQ(x)

P (x)(2.14)

(by IT − inequality) ≤∑x

P (x)

[Q(x)

P (x)− 1

]log e

= [∑x

Q(x)−∑x

P (x)] log e

≤ [∑x∈X

Q(x)−∑x∈X

P (x)] log e

= 0

2.4.2 Mutual information

Mutual information is a measure of the amount of information that one r.v. containsabout another r.v. It is the reduction in the uncertainty of one r.v. due to the knowledgeof the other.

Definition 2.4.2. Consider two random variables X and Y with a joint pmf P (x, y)and marginal pmf P (x) and P (y). The (average) mutual information I(X;Y ) betweenX and Y is the relative entropy between P (x, y) and P (x)P (y), i.e.,

I(X;Y ) = D(P (x, y)||P (x)P (y))

=∑x∈X

∑y∈Y


P (x)P (y)

= Ep(x,y)

[log

P (x, y)

P (x)P (y)

]=∑x

∑y

P (x, y) logP (x|y)P (x)

(2.15)

Properties of mutual information:


1. Non-negativity of mutual information: I(X;Y ) ≥ 0with equality iff X and Y are independent.

Proof. I(X;Y ) = D(P (x, y)||P (x)P (y)) ≥ 0, with equality iff P (x, y) = P (x)P (y).

2. Symmetry of mutual information: I(X;Y ) = I(Y ;X)

3.

I(X;Y ) =∑x

∑y

P (x, y) logP (x|y)P (x)

= −∑x,y

P (x, y) logP (x)− (−∑x,y

P (x, y) logP (x|y))

= H(X)−H(X|Y ) (2.16)

By symmetry, it also follows that I(X;Y ) = H(Y ) −H(Y |X). Since H(XY ) =H(X) +H(Y |X), we have

I(X;Y ) = H(X) +H(Y )−H(XY )

H(X) H(Y)

H(XY)

I(X;Y) H(Y|X)H(X|Y)

Figure 2.3: mutual information and conditional entropy

4. I(X;X) = H(X)−H(X|X) = H(X). Hence, entropy is sometimes referred to asaverage self-information.


2.4.4 Data processing inequality

The data processing inequality can be used to show that no clever manipulation of thedata can improve the inferences that can be made from the data.

Definition 2.4.4. Random variables X,Z, Y are said to form a Markov Chain in thatorder (denoted by X → Z → Y ) if the conditional distribution of Y depends only on Zand is conditionally independent of X. Specifically,

X → Z → Y ifP (x, z, y) = P (x)P (z|x)P (y|z)

Theorem 2.4.2 (Data processing inequality). If X → Z → Y , then

I(X;Z) ≥ I(X;Y )

Proof. By the chain rule, we obtain

I(X;Y Z) = I(X;Y ) + I(X;Z|Y )

= I(X;Z) + I(X;Y |Z)

Since X and Y are independent given Z, we have I(X;Y |Z) = 0. Thus,

I(X;Z) = I(X;Y ) + I(X;Z|Y ) (2.21)

From the non-negativity of mutual information, I(X;Z|Y ) ≥ 0. Thus, (2.21) impliesthat

I(X;Z) ≥ I(X;Y )

From the symmetry, it follows that

I(Y ;Z) ≥ I(X;Z)

This theorem demonstrates that no processing of Z, deterministic or random, canincrease the information that Z contains about X.

Corollary 2.4.3. In particular, if Y = f(Z), we have I(X;Z) ≥ I(X; f(Z)).

Proof. X → Z → f(Z) forms a Markov Chain.

Corollary 2.4.4. If X → Z → Y , then I(X;Z|Y ) ≤ I(X;Z).

Proof. Using the fact I(X;Y ) ≥ 0, the corollary follows immediately from (2.21).

Expressing the data processing inequality in terms of entropies, we have

H(X)−H(X|Z) ≥ H(X)−H(X|Y )

⇒ H(X|Z) ≤ H(X|Y ) (2.22)

The average uncertainty H(X|Z) about the input of a channel given the output iscalled the equivocation on the channel, and thus the above inequality yields the intu-itively satisfying result that this uncertainty or equivocation can never decrease as wego further from the input on a sequence of cascaded channels.


Example 2.4.1. Suppose that the random vector [X,Y, Z] is equally likely to take onvalues in [000], [010], [100], [101]. Then

H(X) = h(1

2) = 1 bit

Note that PY |X(0|1) = 1 so that

H(Y |X = 1) = 0

Similarly, PY |X(0|0) = 12 , so that

H(Y |X = 0) = h(1

2) = 1 bit

Thus, H(Y |X) = 12 × 1 = 1

2 bit.

H(Z|XY ) =∑x,y

P (x, y)H(Z|X = x, Y = y)

=1

4(0) +

1

4(0) +

1

2(1)

=1

2bit

and H(XY Z) = H(X) +H(Y |X) +H(Z|XY ) = 1 + 12 + 1

2 = 2 bits.

Because PY (1) =14 , PY (0) =

34 , we have

H(Y ) = h(1

4) = 0.811 bits

we see that H(Y |X) = 12 < H(Y ). However, H(Y |X = 0) = 1 > H(Y ).

Furthermore, I(X;Y ) = H(Y )−H(Y |X) = 0.811− 0.5 = 0.311 bitsIn words, the first component of the random vector [X,Y, Z] gives 0.311 bits of informa-tion about the 2nd component, and vice versa.

2.5 Sufficient Statistics

Suppose we have a family of pmfs fθ(x) indexed by θ, and let X be a sample from adistribution in this family. Let T (X) be any statistic (like the sample mean or variance).Then

θ → X → T (X), and I(θ;T (X)) ≤ I(θ;X)

A statistic T (X) is called sufficient for θ if it contains all the information in X about θ.

Definition 2.5.1. A function T (X) is said to be a sufficient statistic relative to thefamily fθ(x) if X is independent of θ given T (X); i.e., θ → X → T (X) forms aMarkov Chain.

It means that I(θ;X) = I(θ;T (X)) for all distributions on θ.

Example 2.5.1. Let X1, . . . , Xn, Xi ∈ 0, 1, be an i.i.d sequence of coin tosses of acoin with unknown parameter θ = Pr(Xi = 1). Given n, the number of 1’s is a sufficientstatistics for θ. Here, T (X1, . . . , Xn) =

∑ni=1Xi.

2.6. FANO’S INEQUALITY 21

Example 2.5.2. If X ∼ N(0, 1), i.e., if

fθ(x) =1√2π

e−(x−θ)2

2 = N(θ, 1)

and X1, . . . , Xn are drawn independently according to this distribution, then Xn =T (X1, . . . , Xn) =

1n

∑ni=1Xi is a sufficient statistic for θ.

2.6 Fano’s inequality

Suppose that we wish to estimate a r.v. X based on a correlated r.v. Y . Fano’sinequality relates the probability of error in estimating X to its conditional entropyH(X|Y ).

Theorem 2.6.1. For any r.v.’s X and Y , we try to guess X by X = g(Y ). The errorprobability Pe = P (X = X) satisfies

H(Pe) + Pe log(|X | − 1) ≥ H(X|Y )

Proof. Applying chain rule on H(X,E|Y ), where error r.v.

E =

1, if X = X

0, if X = X

we have

H(E,X|Y ) = H(X|Y ) +H(E|XY ) = H(X|Y )

H(E,X|Y ) = H(E|Y ) +H(X|EY )

≤ H(E) +H(X|EY )

= H(Pe) + P (E = 1)H(X|Y,E = 1) + P (E = 0)H(X|Y,E = 0)

= H(Pe) + PeH(X|Y,E = 1) + (1− Pe)H(X|Y,E = 0)

= H(Pe) + PeH(X|Y,E = 1) + 0

≤ H(Pe) + Pe log(|X | − 1) X取除X之外的其它|X | − 1个值

≤ 1 + Pe log |X | (2.23)

物物物理理理解解解释释释：观察到Y（即已知Y）的条件下，对X还存在的不确定性可分为两部分：

• 判断猜测结果是否正确，其不确定性为H(Pe)；

• 如果判决是错的（with probability Pe），则这时X可能取除X之外的其它|X | −1个值。为了确定是哪一个，所需的信息量≤ log(|X | − 1).

2.7 Convex functions (互互互信信信息息息的的的凸凸凸性性性)

2.7.1 Concavity of entropy

Theorem 2.7.1. Entropy H(X) is concave in P (x).


That is, if X1, X2 are r.v.’s defined on X with distribution P1(x) and P2(x), respec-tively. For any θ ∈ [0, 1], consider a r.v. X with

PX(x) = θP1(x) + (1− θ)P2(x) ∀x

thenH(X) ≥ θH(X1) + (1− θ)H(X2)

Proof. Let Z be a binary r.v. with P (Z = 0) = θ. Let

X =

X1 if Z = 0X2 if Z = 1

Then

H(X) ≥ H(X|Z)

= θH(X|Z = 0) + (1− θ)H(X|Z = 1)

= θH(X1) + (1− θ)H(X2) (2.24)

orH(θP1 + (1− θ)P2) ≥ θH(P1) + (1− θ)H(P2)

2.7.2 Concavity of mutual information

Theorem 2.7.2. For a fixed transition probability P (y|x), I(X;Y ) is a concave (convex-∩) function of P (x).

Proof. Construct X1, X2, X, and Z as above. Consider

I(XZ;Y ) = I(X;Y ) + I(Z;Y |X)

= I(Z;Y ) + I(X;Y |Z) (2.25)

Conditioned on X, r.v.’s Y and Z are independent, i.e., P (y|x, z) = P (y|x). UsingI(Y ;Z|X) = 0, we have

I(X;Y ) ≥ I(X;Y |Z)

= θI(X;Y |Z = 0) + (1− θ)I(X;Y |Z = 1) (2.26)

= θI(X1;Y ) + (1− θ)I(X2;Y ) (2.27)

Theorem 2.7.3. For a fixed input distribution P (x), I(X;Y ) is convex-∪

in P (y|x).

Proof. Consider a r.v. X and two channels with P1(y|x) and P2(y|x). When feed withX, the outputs of the two channels are denoted by Y1 and Y2.

Now let one channel be chosen randomly according to a binary r.v. Z that is inde-pendent of X, and denote the output by Y .

I(X;Y Z) = I(X;Z) + I(X;Y |Z)

= I(X;Y ) + I(X;Z|Y ) (2.28)

Thus, I(X;Y ) < I(X;Y |Z) = θI(X;Y1) + (1− θ)I(X;Y2).

2.7. CONVEX FUNCTIONS (互信息的凸性) 23

.

.

.

X

Channel

P(Y|X)

0

1

K-1

.

.

.

Y

0

1

J-1

Z

Figure 2.4: 给定信道下互信息的凸性

2.7.3 互互互信信信息息息的的的凸凸凸性性性—另另另一一一证证证明明明方方方法法法

Theorem 2.7.4. Let (X,Y ) ∼ P (x, y) = P (x)P (y|x). The mutual information I(X;Y )is a convex-

∩function of P (x) for fixed P (y|x) and a convex-

∪function of P (y|x) for

fixed P (x).

Proof.

I(X;Y ) = H(Y )−H(Y |X)

= H(Y )−∑x

P (x)H(Y |X = x) (2.29)

if P (y|x) is fixed, then P (y) is a linear function of P (x)H(Y ) is a convex-

∩function of P (y)

⇒ H(Y ) is a convex-∩

function of P (x)The 2nd term in (2.29) is a linear function of P (x)

⇒

The difference is a convex-∩

function of P (x).

To prove the 2nd part, we fix P (x) and consider two different conditional distribu-tions P1(y|x) and P2(y|x).

LetPλ(y|x) = λP1(y|x) + (1− λ)P2(y|x)

ThenPλ(x, y) = λP1(x, y) + (1− λ)P2(x, y)

andPλ(y) = λP1(y) + (1− λ)P2(y)

where P1(x, y) = P1(y|x)P (x) and P2(x, y) = P2(y|x)P (x).

If we let Qλ(x, y) = P (x)Pλ(y), then we have

Qλ(x, y) = P (x)[λP1(y) + (1− λ)P2(y)]

= λQ1(x, y) + (1− λ)Q2(x, y) (2.30)

Since I(X;Y ) = D(Pλ||Qλ) and D(P ||Q) is a convex function of (P,Q). it followsthat I(X;Y ) is a convex-

∪function of P (y|x).


2.8 Differential Entropy

Differential entropy is the entropy of a continuous random variable.

LetX be a continuous random variable with cumulative distribution function F (x) =Pr(X ≤ x). The probability density function of X (if it exists) is given by

p(x) =dF (x)

dxx ∈ (−∞,+∞)

The set of x where p(x) > 0 is called the support set of X.

• Definition: The differential entropy H(X) of a continuous random variable Xwith a density p(x) is defined by

H(x) =

∫Sp(x) log

1

p(x)dx = −

∫Sp(x) log p(x)dx (2.31)

where S is the support set of the random variable.

• Motivation for the definition: Think of quantizing the X axis into intervals oflength x, then the quantized outcomes form a discrete ensemble with

Pr(xi < x ≤ xi +x) ≈ p(xi)x

.

The self-information of an X is given by log 1p(xi)x .

The entropy of the quantized r.v. is Hx(X) = −∑+∞

i=−∞ p(xi)x log p(xi)x

As x→ 0, we have

limx→0

Hx(x) = −∫ ∞

−∞p(x) log p(x)dx− lim

x→0logx

∫ ∞

−∞p(x)dx

= −∫ ∞

−∞p(x) log p(x)dx− lim

x→0logx (2.32)

It is seen that the 2nd term in the RHS of 2.31 approaches ∞ as x → 0. Forthe purpose of calculation, the first term is often used to define the entropy of acontinuous ensemble.

Example 2.5.1(Uniform distribution): Consider a random variable X distributeduniformly over (a, b). Its differential entropy is

h(x) = −∫ b

a

1

b− alog

1

b− adx = log(b− a)

• Notice that the differential entropy are not necessarily positive, not necessarilyfinite.

Example 2.5.2(Gaussian distribution): X ∼ N (0, σ2). The p.d.f of X is

p(x) =1√2πσ

exp

(− x2

2σ2

)

2.8. DIFFERENTIAL ENTROPY 25

The differential entropy is then given by

h(x) = −∫ ∞

−∞p(x)

[− ln√2πσ − x2

2σ2

]dx

=1

2ln 2πσ2 +

1

2σ2

∫ ∞

−∞p(x)x2dx

=1

2ln 2πσ2 +

1

2

=1

2ln 2πσ2 +

1

2ln e

=1

2ln 2πeσ2 nats (2.33)

=1

2log2 2πeσ

2 bits

(log2Φ = lnΦ · log2 e)

From 2.33,the variance of X can be expressed as

σ2 =1

2πee2h(X)

which is sometimes called the “熵功率”.

2.8.1 Joint and conditional differential entropy

• Definition: The differential entropy of a set X1, X2, . . . , Xn of random variableswith density p(x1, x2, . . . , xn) is defined as

h(X1, X2, . . . , Xn) = −∫

p(x1, x2, . . . , xn) log p(x1, x2, . . . , xn)dx1dx2 . . .dxn

• Definition: If X,Y have a joint pdf p(x, y), then the conditional differentialentropy is

h(X|Y ) = −∫∫

p(x, y) log p(x|y)dxdy

• Since p(x, y) = p(x|y)p(y), we can obtain

h(XY ) = h(Y ) + h(X|Y )

= h(X) + h(Y |X) (2.34)

Theorem 2.8.1. Let X1, X2, . . . , Xn have a multivariate normal distribution withmean m and covariance matrix K. Then

h(X1, X2, . . . , Xn) =1

2log ((2πe)n|K|) bits (2.35)

Proof. The pdf of X1, X2, . . . , Xn is

p(X) =1

(√2π)n|K|

12

e−12(x−m)TK−1(x−m)


Then

h(X1, X2, . . . , Xn) = −∫

p(x)

[−1

2(x−m)TK−1(x−m)− ln(

√2π)n|K|

12

]dx

=1

2E

∑i

∑j

(xi −mi)(K−1)ij(xj −mj)

+1

2ln(2π)n|K|

=1

2

∑i

∑j

E [(xi −mi)(xj −mj)] (K−1)ij +

1

2ln(2π)n|K|

=1

2

∑j

∑i

Kji(K−1)ij +

1

2ln(2π)n|K|

=1

2

∑j

(KK−1)jj +1

2ln(2π)n|K|

=1

2

∑j

Ijj +1

2ln(2π)n|K|

=n

2+

1

2ln(2π)n|K|

=1

2ln(2πe)n|K| nats

=1

2log2 ((2πe)

n|K|) bits

(K是对称矩阵: Kij = Kji)

2.8.2 Relative entropy and mutual information

• Definition: The relative entropy D(p||q) between two densities p and q is definedby

D(p||q) =∫

p logp

q

Note that D(p||q) is finite only if the support set of p is contained in the supportset of q.

• Definition: The mutual information I(X;Y ) between two random variables withjoint density p(x, y) is defined as

I(X;Y ) =

∫ ∞

−∞

∫ ∞

−∞p(x, y) log

p(x, y)

p(x)p(y)dxdy (2.36)

It is clear that

I(X;Y ) = h(X)− h(X|Y )

= h(Y )− h(Y |X) = h(X) + h(Y )− h(XY ) (2.37)

and I(X;Y ) = D (p(x, y)||p(x)p(y))

• The (average) conditional mutual information is given by

I(X;Y |Z) =

∫∫∫p(x, y, z)

p(x, y|z)p(x|z)p(y|z)

dxdydz


2.8.3 Properties

(1) D(p||q) ≥ 0 with equality if p = q almost everywhere (a.e.).

Proof. Let S be the support set of p. Then

−D(p||q) =∫Sp log

q

p≤∫Sp

(q

p− 1

)log e =

∫Sq − 1 ≤ 0

Corollary 2.8.2. I(X;Y ) ≥ 0 with equality iff X and Y are independent.

(2) h(X|Y ) ≤ h(X) with equality iff X and Y are independent.

(3) Chain rule for differential entropy

h(X1, X2, . . . , XN ) =N∑

n=1

h(Xn|X1, X2, . . . , Xn−1) (2.38)

Proof. Follows directly from the definitions.

(4) h(X + c) = h(X)h(aX) = h(X) + log |a|

Proof. Let Y = aX. Then pY (y) =1|a|pX

(ya

), and pY (y) =

pX(x)|f ′(x)| ,x = y

a

h(aX) = −∫

pY (y) log pY (y)dy

= −∫

1

|a|pX

(ya

)log

1

|a|pX

(ya

)dy

= −[∫

1

|a|pX

(ya

)log pX

(ya

)− log |a|

∫1

|a|pX

(ya

)dy

]= −

∫pX(x) log pX(x) + log |a|

= h(X) + log |a|

Similarly, h(AX) = h(X) + log |A|, where |A| is the absolute value of the determi-nant.

• In general, if Y = f(X), where f is a one-to-one correspondence between Xand Y , then

h(Y ) = h(X) + ln

∣∣∣∣dydx∣∣∣∣

= h(X)− E

[log

∣∣∣∣dxdy∣∣∣∣]


(5) Data processing theorem:If X is the input to a channel, Z the output, and Y a transformation of Z (i.e.,Y = f(Z)). Then

I(X;Z) ≥ I(X;Y )

Suppose that y is a reversible transformation of Z,so that z = f−1(y).

Then I(X;Y ) ≥ I(X;Z)

Combining these two equations, we have I(X;Y ) = I(X;Z).

This implies that the average mutual information between two ensembles is invariantto any reversible transformation of one of the outcomes.

Channel (z)y f=

Z YX

Figure 2.5: 信息处理定理

(6)

I(X;Y Z) = I(X;Y ) + I(X;Z|Y ) (2.39)

= I(X;Z) + I(X;Y |Z) (2.40)

(7) The Gaussian distribution has the largest differential entropy of any distributionwith a specified mean and variance. In other words, if the random variable X hasmean m and variance σ2 <∞ then

h(X) ≤ 1

2log 2πeσ2

Proof. Suppose that q(x) is the density of a N(m,σ2) variable X. Then

log q(x) = log

(1√2πσ

e−(x−m)2

2σ2

)= − log

√2πσ − (x−m)2

2σ2

Let p(x) be any density of a random variable with mean mp and variance σ2p. We

have

−∫ ∞

−∞p(x) log q(x)dx = log

√2πσ +

1

2σ2

∫ ∞

−∞p(x)(x−m)2dx

= log√2πσ +

1

2σ2

∫ ∞

−∞p(x) [(x−mp) + (mp −m)]2 dx

= log√2πσ +

1

2σ2

[σ2p + (mp −m)2

]Thus, if p(x) and p′(x) have the same mean and the same variance, then

−∫ ∞

−∞p(x) log q(x)dx = −

∫ ∞

−∞p′(x) log q(x)dx


With this equation, we obtain

0 ≤ D(p||q)

=

∫ ∞

−∞p(x) log p(x)dx−

∫ ∞

−∞p(x) log q(x)dx

= −h(p)−∫ ∞

−∞q(x) log q(x)dx

= −h(p) + h(q)

A summery:given σ2, the Gaussian 分布 has 最大熵given h(X), the Gaussian 分布 has the smallest σ2(average power)

• The above argument can be extended to a random vector.

Theorem 2.8.3. Let the random vector X ∈ Rn has zero mean and covarianceK = E

[XXT

]i.e., Kij = E [XiXj ] , 1 ≤ i, j ≤ n. Then

h(X) ≤ 1

2log(2πe)n|K|

with equality iff X ∼ N (0,K).

(8) If the random variable X takes on values in the interval [−M,M ], i.e., x ∈ [−M,M ],then H(X) ≤ log 2M , with equality iff X has a uniform distribution over [−M,M ].

Proof. See [textbook, pp.30] —–Method 1

Method 2: Let p(x) be the uniform density over [a, b] satisfying∫ ba p(x)dx = 1. Let

q(x) be an arbitrary density with∫ ba q(x)dx = 1. Then

h [x, q(x)]− h [x, p(x)] = −∫ b

aq(x) log q(x)dx+

∫ b

ap(x) log p(x)dx

= −∫ b

aq(x) log q(x)dx+

∫ b

ap(x) log

(1

b− a

)dx

= −∫ b

aq(x) log q(x)dx+

[log

(1

b− a

)∫ b

ap(x)dx

]= −

∫ b

aq(x) log q(x)dx+

[log

(1

b− a

)∫ b

aq(x)dx

]= −

∫ b

aq(x) log q(x)dx+

∫ b

aq(x) log p(x)dx

=

∫ b

aq(x) log

p(x)

q(x)dx ≤ log

[∫ b

aq(x)

p(x)

q(x)dx

]= 0

其中 “ ≤ ” by Jensen′s inequality

Chapter 3

The Asymptotic EquipartitionProperty (渐渐渐进进进等等等同同同分分分割割割性性性)

3.1 预预预备备备知知知识识识：：：Convergence of random variables

limn→∞

1

n

n∑i=1

Xi → E[X] for i.i.d. r.v.’s

3.1.1 Type of convergence

• Almost sure convergence (also called convergence with probability 1):

P(limn→∞

Yn(w) = Y (w))= 1

write Yn −−→a.s.

Y

• Mean-square convergence:

limn→∞

E[|Yn − Y |2] = 0

• Convergence in probability: ∀ε > 0

limn→∞

P (|Yn(w)− Y (w)| > ε) = 0

• Convergence in distribution: The CDF Fn(y) = Pr(Yn ≤ y) satisfy:

limn→∞

Fn(y)→ FY (y)

at all y for which F is continuous.

3.1.2 Weak law of large numbers (WLLN)

X1, X2, . . . , Xn are i.i.d. r.v.’s having finite mean µ and variance σ2.

Sn =X1 + · · ·+Xn

n

31

32CHAPTER 3. THE ASYMPTOTIC EQUIPARTITION PROPERTY (渐进等同分割性)

Weak LLN:

P

(∣∣∣∣∣ 1nn∑

i=1

Xi − µ

∣∣∣∣∣ ≥ ε

)≤ σ2

nε2, δ

.

3.2 Typical sequence

考虑一个长n = 20的二元序列（信源输出）with

PX(0) = 3/4 and PX(1) = 1/4

下面给出3个序列，其中一个是信源实际输出的，猜测哪一个是“real” 输出序列：

(1) 全 1 序列：1, 1, . . . , 1, 1

(2) 10101000000000110001

(3) 全 0 序列：0, 0, . . . , 0, 0

直观是（2）。

• 大数定律says roughly that if some event has probability p, then 在多次独立试验中该事件出现的次数（比例）will be close to p with high probability. 序列（2）中“1”出现的次数比例是6/20 = 0.3，该序列更符合WLLN。

• 1948年，Shannon spoke of “typical long sequences”, but gave no precise defini-tion of such sequences. 此后许多人给出了定义，但这些定义是方便的for somepurpose, for另一目的就不方便。所以，讨论基于典型序列的arguments时，我们应首先明确选择哪种典型序列定义。

• ε-letter-typical:

(1− ε)PX(a) ≤ Na(xn)

n≤ (1 + ε)PX(a)

• Entropy-typical sequences（弱典型）

3.3 Weakly typical sequences

Theorem 3.3.1 (AEP). If X1, X2, . . . , Xn are i.i.d. ∼ P (x), then

− 1

nlogP (x1, x2, · · · , xn)→ H(X) in Probability

Proof. Create r.v. Yi = logP (Xi). Then apply the WLLN to Y :

− 1

nlogP (x1, x2, . . . , xn) = −

1

n

n∑i=1

logP (Xi)

= − 1

n

n∑i=1

Yi → −E[Y ] in Prob.

= H(X) (3.1)

3.3. WEAKLY TYPICAL SEQUENCES 33

i.e., for all ε > 0,

limn→∞

Pr

(∣∣∣∣− 1

nlogP (x1, x2, · · · , xn)−H(X)

∣∣∣∣ ≤ ε

)= 1

Definition 3.3.1. The typical set Tε with respect to P (x) is defined as

Tε =

x , (x1, x2, · · · , xn) ∈ X n :

∣∣∣∣− 1

nlogP (x)−H(X)

∣∣∣∣ ≤ ε

Thus, Tε is the set of sequence (x1, x2, · · · , xn) ∈ X n for which the sample average ofthe log pmf is within ε of its mean H(X).

It can be rewritten as

Tε =x ∈ X n : 2−n[H(X)+ε] ≤ P (x) ≤ 2−n[H(X)−ε]

which implies that the sequences in Tε are approximately equiprobable.

• As a consequence of AEP, we have

P (x ∈ Tε) > 1− δ → 1 (高概率集合)

Theorem 3.3.2 (size of the typical set). The number of elements, |Tε|, in the typicalset Tε satisfies

(1− ε)2n[H(X)−ε] ≤ |Tε| ≤ 2n[H(X)+ε]

Proof. Since P (x) ≥ 2−n[H(X)+ε] for each x ∈ Tε, we have

1 =∑x∈Xn

P (x) ≥∑x∈Tε

P (x) ≥∑Tε

2−n[H(X)+ε] = |Tε|2−n[H(X)+ε]

This implies that |Tε| ≤ 2n[H(X)+ε] .

Conversely, since Pr(Tε) > (1− σ2

nε ) = 1− δ = 1− ε, we have

1− ε <∑x∈Tε

P (x) ≤∑Tε

2−n[H(X)−ε] = |Tε|2−n[H(X)−ε]

which implies |Tε| > (1− ε)2n[H(X)−ε]

Compare to |X n| = 2n log |X |: Let

α , |Tε||X n|

≤ 2−n[log |X |−H(X)−ε] as n ↑−−−→ 0

即典型序列的数目远比非典型序列少.

Summary:

We conclude that for large n, the typical set Tε has aggregate probability approx-imately 1 and contains approximately 2n[H(X)] elements, each of which has probailityapproximately 2−nH(X). That is, asymptotically for very large n, the r.v. Xn resemblesan equiprobable source with alphabet size 2nH(X).


:n nKc c = elements

atypical set Te¢

[ ]( ): 2

n H X

Te

e

+elements

Figure 3.1: 典型序列集示意图

Example 3.3.1. Consider binary r.v.’s Xi, i.i.d. with P (X = 0) = p and P (X = 1) =1− p.

A ”typical” sequence of length n has roughly np 0’s and n(1−p) 1’s, the probabilityfor that to happen is

pnp(1− p)n(1−p) = 2n[p log p+(1−p) log(1−p)] = 2−nH(X)

How many ”typical” sequences are there?(nnp

)=

n!

(np)!(n(1− p))!≈ nne−n

(np)npe−np(n(1− p))n(1−p)e−n(1−p)

=1

pnp(1− p)n(1−p)= 2nH(X) (3.2)

(Note: Stirling formula: n! ≈ nne−n√2πn)

3.4 Using the typical set for data compression

Motivated by the AEP, we divided all sequence in X n into two sets: Tε and T cε (its

complement).

3.4.1 Source encoding method

• We order all elements in the Tε and T cε according to some order.

• Then we can represent each sequence in Tε by giving the index of length ≤n[H(X) + ε] + 1 bits (correction of 1 bit because of integrality)

• We prefix all these sequences by a 0⇒ total length ≤ n[H(X) + ε] + 2

• Similarly, we can index each sequence not in Tε by using no more than n log |X |+1bits. Prefix these indices by 1.

• Thus, we have a code for all the sequences in X n.

3.4. USING THE TYPICAL SET FOR DATA COMPRESSION 35

3.4.2 The average length of codeword

Let l(x) = length of the binary codeword corresponding x. Then the expected lengthof the codeword is

E[l(x)] =∑x∈Tε

P (x)l(x) +∑x∈T c

ε

P (x)l(x)

≤∑Tε

P (x)[n(H(X) + ε) + 2] +∑T cε

P (x)[n log |X |+ 2]

≤ (1− ε)[n(H(X) + ε) + 2] + ε(n log |X |+ 2)

= n(H(x) + ε) + 2− ε[n(H(X) + ε) + 2] + εn log |X |+ 2ε

≤ n(H(X) + ε) + εn log |X |+ 2

= n(H(X) + ε′). (3.3)

where ε′ = ε+ ε log |X |+ 2n .

So E[ 1n l(x)] ≤ H(X) + ε.

Chapter 4

Entropy Rates of a StochasticProcess

4.1 Stochastic processes

• A stochastic process is an indexed sequence of random variables X1, X2, . . . , amap from Ω to X∞ .

• A stochastic process is characterized by the joint PMF:

PX1X2...Xn(x1, x2, . . . , xn)

• The entropy of a stochastic process

H(X1, X2, . . . ) = H(X1) +H(X2|X1) + · · ·+H(Xi|X1 . . . Xi−1) + . . .

• Difficulties: Sum to infinity.All terms are different in general.

4.2 Entropy rate

• The entropy rate of a stochastic process Xi is defined by

H(X ) = limn→∞

1

nH(Xn)

if it exists.

4.3 Entropy rate of stationary processes

• Chain rule:1

nH(X1, X2, . . . , Xn) =

1

n

n∑i=1

H(Xi|X1 . . . Xi−1)

37

38 CHAPTER 4. ENTROPY RATES OF A STOCHASTIC PROCESS

• For a stationary process,

H(Xn+1|X1n) ≤ H(Xn+1|X2

n)

= H(Xn|X1n−1). (4.1)

Therefor, the sequence H(Xn|X1n−1) is non-increasing and non-negative, so limit

exists.

•

Theorem 4.3.1. For a stationary process, the entropy rate

limn→∞

1

nH(X1

n) = limn→∞

H(Xn|X1n−1)

• Markov Chain: A discrete stochastic process is a Markov Chain if

PXn|X0...Xn−1(xn|x0, . . . , xn−1) = PXn|Xn−1

(xn|xn−1)

for n = 1, 2, . . . , and all (x0, . . . , xn) ∈ X n+1

• Denote pij = P (Xn+1 = j|Xn = i)

• The entropy rate of Markov Chain:

limn→∞

1

nH(X1

n) = limn→∞

1

nH(Xn|Xn−1) = −

∑i,j

πipij log pij

• Significance of the entropy rate of a stochastic process arises from the AEP for astationary ergodic process:

− 1

nlogP (X1, X2, . . . , Xn)→ H(X ) with probability 1.

Chapter 5

Coding for Discrete Sources

5.1 Introduction

Three important classes of sources:

• Discrete sources (离散时间、离散信号取值集合)

– The output of a discrete source is a sequence of letters from a given discretealphabet X .

– A discrete alphabet is either a finite set of letters or a countably infinite setof letters.

• Analog sources (also called continuous-time, continuous-amplitude sources)

– The output is an analog waveform.

• Discrete-time continuous-amplitude sourcesThe output is a sequence of values which could be real number or multi-dimensionalreal numbers.

• 其它分类: 无记忆源有记忆源

高斯源Markov源

平稳源各态历经源

• Discrete memoryless source (DMS): DMS is a device whose output is a semi-infinite i.i.d. sequence of random variables X1, X2, . . . , drawn from the finite setX .

• An important parameter of a source is the rate Rs [source letters/s].

5.2 Coding a single random variable

•

Definition 5.2.1. A source code C for a random variable X is a mapping fromX to D∗, the set of finite length strings of symbols from a D-ary alphabet.

• The same definition applies for sequence of r.v.’s, Xn.

39

40 CHAPTER 5. CODING FOR DISCRETE SOURCES

• x (or xn) – source symbol (string) ∈ X .D – set of coded symbols.C(x) – codeword corresponding to x.ℓ(x) – length of C(x).

• For example, X = red, blue, C(red) = 00, C(blue) = 11 is a source code withalphabet D = 0, 1.

• Without loss of generality, we will assume that D = 0, 1, . . . , D − 1.

•

Definition 5.2.2. The expected length, L(C), of a code is given by

L(C) =∑x∈X

PX(x)ℓ(x) = E[ℓ(x)]

• Goal: For a given source, find a code to minimize the expected length (per sourcesymbol).

5.3 Fixed-length source codes (等等等长长长编编编码码码)

• Convert each source letter individually into a fixed-length block of ℓ D-ary symbols.

• The number of letters in the source alphabet, K = |X |, satisfies K ≤ Dℓ, then adifferent D-ary sequence of length ℓ may be assigned to each letter x ∈ X . Theresulting code is uniquely decodable.

• For example, for X = a, b, c, d, e, f, g, K = 7, D = 0, 1, there exists an invert-ible mapping for X to binary 3-tuples:

a→ 000, b→ 001, . . . , g → 110

b

a 000

001

( )C xX

Figure 5.1: ℓ = 3的信源编码

• We can see that this coding method requires ℓ = ⌈log |x|⌉ bits to encode eachsource letter.

• If we want to encode blocks of n source symbols at a time, the resulting sourcealphabet is the n-fold Cartesian product X n = X × X × · · · × X , which has size|X n| = Kn.

5.3. FIXED-LENGTH SOURCE CODES (等长编码) 41

• Using fixed-length source coding, we can encode each block of n source symbolsinto ℓ = ⌈log2Kn⌉ bits. The rate R = ℓ

n of coded bits required per source symbolis then

R =⌈log2Kn⌉

n≥ n log2K

n= log2K

R =⌈log2Kn⌉

n<

n log2K + 1

n= log2K +

1

n

If n is sufficiently large, then R→ log2K.

5.3.1 等等等长长长编编编码码码定定定理理理

Goal: Minimize the average rate R = E[l(xn)]/n

• 令M为待编码消息序列个数（或码字总数，codebook size），则对于等长编码，有R = 1

n log2M

• Encoder “compresses” xn into an index w ∈ 1, 2, · · · , 2nR. That is, the encodersends nR bits for every source sequence xn.

• 若采用D元等长编码，码长为L，则D元码字个数M = DL，R = Ln log2D

编编编码码码方方方法法法：：：Data compression by AEP

• Since |Tε| ≤ 2n(H+ε), we use n(H + ε) + 1 bits to index all sequence in Tε.

• Use an extra bit to indicate Tε.

• We have shown that

E[ℓ(xn)] =∑xn

P (xn)ℓ(xn) ≤ n[H(X) + ε′]⇒ R =1

nE[ℓ(xn)] ≤ H(X) + ε′

.

另一方面，by Fano inequality,

nR ≥ H(Xn) (∵最多有2nR个不同序列xn)

≥ I(Xn; Xn)

= H(Xn)−H(Xn|Xn)

= nH(X)−H(Xn|Xn)

≥ n

[H(X)− H2(Pe)

n− Pe log2 |X |

](5.1)

(Fano不等式：H2(Pe) + Pe log(|X | − 1) ≥ H(Xn|Xn))

∵要求Pe → 0 with n,

∴ R ≥ H(X) (=⇒ Ln ≥

H(X)log2 D

).

Theorem 5.3.1 (Fixed-to-fixed source coding theorem,无扰编码定理). 若R > H(X)，则R是可达的；若R < H(X)，则R不可达。

• 编码效率：η = H(X)R ≤ 1.

• Example: 见中文教材的例3.2.3。


5.4 Variable-length source codes (不不不等等等长长长编编编码码码)

• A variable-length source code maps each source letter x ∈ X to a codeword C(x)of length ℓ(x).For example,

X = a, b, c,D = 0, 1

C(a) = 0, C(b) = 10, C(c) = 11

• The major property that we usually require for any variable-length code is that ofunique decodability. This means that the input sequence of source letters can bereconstructed unambiguously from the encoded symbol sequence.

• Clearly, unique decodability requires that C(x) = C(x′) for x = x′.

• Definition: The extension of a code C is the code for finite strings of X given by theconcatenation of the individual codewords: C(x1, x2, . . . , xn) = C(x1)C(x2) . . . C(xn).For example,

C(x1) = 00C(x2) = 11

⇒ C(x1x2) = 0011

– A code is called non-singular if

xi = xj ⇒ C(xi) = C(xj)

– A code is called uniquely decodable if its extension is non-singular.

For example, C(a) = 0, C(b) = 10, C(c) = 11 is prefix-free and uniquely decodable.However, the code C′ defined by

C ′(a) = 0, C ′(b) = 1, C ′(c) = 01

is not uniquely decodable.

5.4.1 Prefix-free codes

Checking whether a code is uniquely decodable can be quite complicated. However,there is a good class of uniquely decodable codes called prefix-free codes.

Definition 5.4.1. A code is said to be prefix-free if no codeword is a prefix of any othercodeword.

Advantages

easy to check whether a code is prefix-free and therefore uniquely decodable.can be decoded with no delay (instantaneous code).

• Any fixed-length code is prefix-free.

• Classes of codes:

5.4. VARIABLE-LENGTH SOURCE CODES (不等长编码) 43

XUniquely decodable,

But not prefix-freePrefix-free

a

b

c

d

10

00

11

110

0

10

110

111

Figure 5.2: 唯一可译码的分类及编码方式

all codes

prefix-free codes

Figure 5.3: 码的分类

a

01

b

c

1

00

a

b

c

0

11

101

Figure 5.4: The binary code tree


5.4.2 Code tree

• The digits in the codewords are represented as the labels on the branches of arooted tree.

• For prefix-free codes, each codeword corresponds to a leaf node.

• A prefix-free code will be called full if no new codeword can be added withoutdestroying the prefix-free property.

a

01

b

c

1

0

a

b

c

0

11

10

Figure 5.5: full tree

• D-ary tree: A D-ary tree is a finite rooted tree such that D branches stem outwardfrom each (intermediate) node. D branches are labeled with the D different D-aryletters.

• Why the prefix-free condition guarantees unique decodability?

5.5 Kraft Inequality

The Kraft inequality tells us whether it it possible to construct a prefix-free code for agiven source alphabet X with a given set of codeword length ℓ(x) .

Theorem 5.5.1 (Kraft Inequality). There exists a D-ary prefix-free code with codewordlengths ℓ1, ℓ2, . . . , ℓk if and only if

k∑i=1

D−ℓi ≤ 1 (5.2)

Every full prefix-free code satisfies (5.1) with equality.

Proof. First assume that C is a prefix-free code with codeword lengths ℓ1, ℓ2, . . . , ℓk.Let ℓmax = max

iℓi. Consider constructing a D-ary tree for the code C by pruning the

full D-ary tree of length ℓmax at all nodes corresponding to codewords:

a) A codeword at depth ℓi has Dℓmax−ℓi descendants (leaves) at depth ℓmax. [Each ofthese descendant sets must be disjoint]

b) Begin with i = 1, we find the node Xi corresponding to a codeword.

c) We prune the tree to make this node a leaf at depth ℓi.

d) By this process, we delete Dℓmax−ℓi leaves from the tree. None of these leaves couldhave previously been deleted because of the prefix-free condition.

5.5. KRAFT INEQUALITY 45

e) But there are only Dℓmax leaves that can be deleted. So we have∑i

Dℓmax−ℓi ≤ Dℓmax

or ∑i

D−ℓi ≤ 1

Next, conversely, suppose that we are given the set of codeword lengths ℓ1, ℓ2, . . . , ℓk forwhich (5.1) is satisfied.

• Without loss of generality, assume that we have ordered these lengths so thatℓ1 ≤ ℓ2 ≤ · · · ≤ ℓk.

• Consider the following algorithm:

a) Start with the full D-ary tree of length ℓmax, and i← 1.

b) Choose xi as any surviving node at depth ℓi (not yet used as a codeword),and remove its descendants from the tree. Stop if there is no such survivingnode.

c) If i = k,stop. Otherwise i = i+ 1 and goto b).

We now show that we can indeed choose xi in step b) for all i < k. Suppose thatx1, x2, . . . , xi−1 has been chosen. The number of surviving leaves at depth ℓmax notstemming from any codeword is

Dℓmax − (i−1∑j=1

Dℓmax−ℓj ) = Dℓmax(1−i−1∑j=1

D−ℓj ) > 0

with condition (3.1). There must be (unused) surviving nodes at depth ℓi < ℓmax. Sinceℓ1 ≤ · · · ≤ ℓi−1 ≤ ℓi, no already chosen codeword can stem outward from such a surviv-ing node and hence this surviving node may be chosen as xi.

Example:Construct a binary prefix-free code with lengths ℓ1 = ℓ2 = ℓ3 = 2, ℓ4 = 3 and ℓ5 = 4.Since

∑5i=1 2

−ℓi = 1516 < 1, such a prefix-free code exists.

u4

0 1

u1

u2

1

0

1

1

1

0

0

0

u3

u5

Figure 5.6: a binary prefix-free code tree


• Note: Just because a code has lengths that satisfy (5.1), it does not follow thatthe code is prefix-free, or even uniquely decodable.

• The same theorem holds for uniquely decodable codes. (这意味着对某一码字长度集合，存在唯一可译码和无前缀码。So why use any code other than a prefix-freecode?)

Theorem 5.5.2 (Extended Kraft Inequality). For any prefix-free code over an alphabetof size D, the codeword length satisfy

∞∑i=1

D−ℓi ≤ 1

Conversely, for any given set of codeword lengths that satisfy the inequality, we canconstruct a prefix-free code with these lengths.

Proof. Consider a codeword y1y2 . . . yℓi , where yj ∈ 0, 1, . . . , D−1 , D. Let 0.y1y2 . . . yℓi =∑ℓij=1 yiD

−j ∈ [0, 1]. This codeword corresponds to an interval

(0.y1y2 . . . yℓi , 0.y1y2 . . . yℓi +1

Dℓi)

Prefix-free code implies the intervals are disjoint. Hence the sum of their lengths≤ 1.

5.6 Optimal codes

5.6.1 Problem formulation and Shannon codes

Let X = a1, a2, . . . , ak be the source alphabet, and pi = PX(X = ai) > 0. Supposethat we encode each source symbol into prefix-free codeword. Denote by C(ai) thecodeword for ai and by ℓi the length of C(ai).

• Optimal code is defined as code with smallest possible L(C) with respect to PX .

• We now consider the problem of finding the prefix code with minimum expectedlength: L =

∑ki=1 piℓi =

∑x∈X P (x)ℓ(x).

• Mathematically, this is a standard optimization problem:

Minimize L =∑i

piℓi

subject to∑i

D−ℓi ≤ 1

and ℓi = ℓ(x) are integers.

• We first ignore the integer constraint on ℓi. With real variables, we may assumethat

∑iD

−ℓi = 1.

• Using a Lagrange multiplier λ, we want to minimize

J =∑i

piℓi + λ(∑i

D−ℓi − 1)

5.6. OPTIMAL CODES 47

Setting ∂J∂ℓi

= 0, we obtain

∂J

∂ℓi= pi − λD−ℓi lnD = 0 ((ax)′ = (ax) ln a)

equivalently, D−ℓi = piλ lnD

Since∑

i pi = 1 and D−ℓi = 1, we have λ = 1lnD and hence

pi = D−ℓi

This yields optimal lengths ℓ∗i = − logD piThe expected codeword length

Lmin(non− integer) = −∑i

pi logD pi

=∑i

pi log2 pi/ log2D

= H(X)/ log2D

Theorem 5.6.1 (Entropy bounds for prefix-free codes). Let Lmin be the minimumexpected codeword length over all D-ary prefix-free code. Then

H(X)

log2D≤ Lmin <

H(X)

log2D+ 1

Proof. ¬ Let ℓ1, . . . , ℓk be the codeword lengths of an arbitrary prefix-free code.

H(X)

logD− L =

1

logD

∑i

pi log1

pi−∑i

piℓi

=1

logD

∑i

pi log1

pi−∑i

pi logD D−ℓi

=1

logD[∑i

pi logD−ℓi

pi]

(by IT inequality) ≤ 1

logD

[log e

∑i

pi(D−ℓi

pi− 1)

](∗)

=log e

logD

(∑i

D−ℓi −∑i

pi

)(by Kraft inequality) ≤ 0

(5.3)

(*) is satisfied with equality iff D−ℓi

pi= 1; i.e., pi = D−ℓi .

We now show that there exists a prefix-free code with L(C) < H(X)logD + 1.

Let us choose the codeword length to be ℓi = ⌈− logD pi⌉. Then

−logDpi ≤ ℓi︸︷︷︸⇓

< −logDpi + 1

D−ℓi ≤ pi∑D−ℓi ≤

∑pi = 1⇒ Kraft inequality is satisfied

(∗∗)


Thus, a prefix-free code exists with the above length.From the RHS of (**),

L =∑i

piℓi <∑

pi(− logD pi + 1) =H(X)

log2D+ 1

® Therefore, H(X)log2 D

≤ Lmin ≤ L < H(X)log2 D

+ 1

Summary [Shannon code]:

• Ideal codeword length ℓi = − logD pi.This is optimal when − logD pi is an integer for any i.

• For general distribution, setℓi = ⌈− logD pi⌉

• Bounds for the codeword length:

− logD pi ≤ ℓi < − logD pi + 1

Average codeword length

HD(X) ≤ L < HD(X) + 1

• Example: PX(x) = 13 ,13 ,

14 ,

112.

Then

H(X) = 1.8554

ℓi = ⌈− logD pi⌉ = (2, 2, 2, 4)

E[ℓ(x)] =13

6= 2.1667

Comparing to the obvious codeword length assignment (2,2,2,2,), loss 0.1667 bitper source symbol.

5.6.2 Improvement

Coding over multiple i.i.d. source symbols: View (X1, X2, . . . , Xn) as one super-symbolfrom X n. Apply the bounds derived above,

H(X1 . . . Xn) ≤ E[ℓ(Xn1 )] < H(X1 . . . Xn) + 1

Since X1, . . . , Xn are i.i.d, H(Xn1 ) =

∑iH(Xi) = nH(X), we have

H(X) ≤ 1

nE[ℓ(Xn

1 )] < H(X) +1

n

Theorem 5.6.2 (Prefix-free source coding theorem). For any DMS with entropy H(X),there exists a D-ary prefix-free coding of source blocks of length n such that the expectedcodeword length per source symbol Ln satisfies

H(X)

logD≤ Ln <

H(X)

logD+

1

n

From this theorem, H(X) is the minimum expected codeword length per sourcesymbol required to describe the source.

Usually, R = Ln log2D[bits/source symbol] is called the rate of a prefix-free code.

5.7. HUFFMAN CODES 49

5.6.3 Unknown distribution

If assign the codeword length as

ℓi = ⌈− log q(x)⌉

and the true distribution of X is PX(i) = pi, then

H(p) +D(p||q) ≤ Ep[ℓ(X)] < H(p) +D(p||q) + 1

Proof.

Ep[ℓ(X)] =∑x

P (x)⌈log 1

q(x)⌉

<∑x

P (x)[log1

q(x)+ 1]

=∑x

P (x) log(p(x)

q(x)

1

p(x)) + 1

=∑x

P (x) logp(x)

q(x)+∑x

P (x) log1

p(x)) + 1

= D(p||q) +H(p) + 1

• Penalty of D(p||q) bits per source symbol due to the wrong distribution.

• Discussion

– For any n, any code over i.i.d sequence Xn1 ,

1nE[ℓ(Xn

1 )] ≥ H(X).

– We can achieve this when n→∞, AEP code, Shannon code.

limn→∞

1

nE[ℓ(Xn

1 )] = H(X)

– True or False: for finite n,

∗ Shannon code is “optimal”?

∗ A code with codeword length ℓi = − logPX(ai),∀i, is optimal

∗ Any prefix code must satisfy ℓi ≥ − log pi,∀i∗ The optimal code must satisfy ℓi ≤ ⌈− log pi⌉, ∀i

5.7 Huffman codes

The optimal prefix code (in the sense of minimal L) for a given distribution can beconstructed by a simple algorithm discovered by Huffman in 1950 (as a term paper inFano’s IT class at MIT).

• Huffman’s trick, in today’s jargon, was to “think outside the box”. He ignored theKraft inequality, and looked at the binary code tree to establish properties thatan optimal prefix-free code should have.


C(1)

01

C(2)

C(3)

1

0

X c(x)

0.6

0.3

0.1

0

11

10

1

2

2

P

1

2

3

Figure 5.7: a simple optimal code

• Example:A simple optimal code

Lemma 5.7.1. Optimal codes have the property that if ℓi > ℓj, then pi ≤ pj.

Proof. Let C be the optimal code. Consider a code C′, with the codewords i and j of Cinterchanged. Then

L(C′)− L(C) =∑k

pkℓ′k −

∑k

pkℓk

= [piℓj + pjℓi]− [piℓi + pjℓj ] = (pj − pi)(ℓi − ℓj)

Note that ℓi − ℓj > 0, and since C is optimal, L(C′) − L(C) ≥ 0. Hence we must havepj ≥ pi.

Lemma 5.7.2. Optimal prefix-free codes have the property that the two longest code-words have the same length.

Proof. Otherwise, one can delete the last bit of the longer one, preserving the prefix-freeproperty and achieving lower codeword length.

Lemma 5.7.3. The two longest codewords differ only in the last bit and correspond tothe two least likely symbols.

Proof. By Lemma(5.7.1), the longest codewords must belong to the least probable sourcesymbols.If there is a maximal length codeword without a sibling, then we can delete the last bitof codeword and still satisfy the prefix-free property. This reduces the average codewordlength.

• The Huffman algorithm chooses an optimal code tree by starting at leaves for theleast likely symbols and working in.

• Example 1:

Optimal length:

− log pi = 1.32, 2.32, 2.74, 2.74, 3.32≈ the length1, 3, 3, 3, 3of the optimal code

5.8. SHANNON-FANO CODING 51

Symboli

1

2

3

4

5

0.15

0.4

0.2

0.15

0.1

step 4

step 3

step 1

step 2

11

1

1

0

0

0

0

0.35

0.25

0.6

The reduced set of probabilities=0.4,0.2,0.15,0.25

the two least likely symbols

Figure 5.8: Huffman tree

H(X) = 2.15bits/symbol

L(C) = 2.2bits/symbol

• Example 2:

Symbol pi codeword1 0.35 112 0.2 013 0.2 004 0.15 1015 0.1 100

• Let X be a r.v. over X with p, and X ′ be a r.v. with reduced set of probabilitiesp′. Let L′ be the expected codeword length of any code for X ′. Then

L = L′ + (pk + pk−1) (∵ L =

k−2∑i=1

piℓi + pk−1(ℓ′k−1 + 1) + pk(ℓ

′k−1 + 1))

Lmin = L′min+pk+pk−1 ⇒ finding the optimal reduced code yields the optimal final code.

• Note that there are many optimal codes. The Huffman algorithm produce onesuch optimal code.

• D-ary Huffman code: |X | = (D − 1)θ +D, K = |X |

第一次合并的消息个数 = [(K − 2)mod(D − 1)] + 2

= RD−1(K − 2) + 2

5.8 Shannon-Fano Coding

• Order the K probabilities in decreasing order, and divide source symbols into Dsets of almost equal probability. (将源符号近似等概地划分为 D 个子集)

— Choose m such that∣∣∣∑m

i=1 pi −∑K

i=m+1 pi

∣∣∣ is minimized. (for binary codes)


• Assign 0 for the first bit of the upper set and 1 for the lower set.

• Example: P (X) = 1/3, 1/9, 1/9, 1/9, 1/9, 1/9, 1/27, 1/27, 1/27

5.9 Shannon-Fano-Elias Coding

• Modified CDF:

F (x) =∑a<x

P (a) +1

2P (x)

It is the midpoint of the step corresponding to x.

0 1 2 3 4 5

0.25

0.75

1

( )F x

( )F x

x

Figure 5.9: P (x) = 0.25, 0.5, 0.125, 0.125

• Using the value of F (x) as a code for x: the first ℓ(x) bits of F (x)F (x) in binary → codeword. Then

F (x)− ⌊F (x)⌋ℓ(x) <1

2ℓ(x)

<P (x)

2(∵ ℓ(x) = ⌈log 1

P (x)⌉+ 1 > log2

1

P (x)+ 1)

= F (x)− F (X − 1)

That is, ⌊F (x)⌋ℓ(x) lies within the step corresponding to x. Thus ℓ(x) bits sufficeto describe x.

• Each codeword z1z2 . . . zℓ 是将F (x)截取前ℓ(x)位得到的，因此它不是代表一个点，it is considered to represent the interval [0.z1z2 . . . zℓ, 0.z1z2 . . . zℓ +

12ℓ]. 因为

对应的区间互不相交→ the code is prefix-free.

• The expected length of this code is

L =∑x

P (x)ℓ(x) =∑x

P (x)[⌈log 1

P (x)⌉+ 1] <

∑x

[log1

P (x)+ 2] = H(X) + 2

• Example:

5.10. ARITHMETIC CODING 53

x P (x) F (x) F (x) in binary ℓ(x) = ⌈log 1P (x) + 1⌉ C(x)

1 0.25 0.125 0.001 3 0012 0.5 0.5 0.10 2 103 0.125 0.8125 0.1101 4 11014 0.125 0.9375 0.1111 4 1111

L(C) = 2.75 bits

H(X) = 1.75 bits

code efficiency η , H(X)R = H(X)

L ≤ 1

5.10 Arithmetic Coding

• Calculate the pmf P (xn) and the CDF F (xn) for sequence xn.

• Using Elias coding, we can use a number in [(F (xn)− P (xn)), F (xn)] as the codefor xn.

• F (ai)将区间[0, 1]分为了K个子区间，每个区间的长度为P (ai).

• Example: 对二进制序列0100进行算术编码。假定PX(0) = P0 = 34 , PX(1) =

P1 =14

( )nF x

1

3

4

0

A(1)

0

A(00)

1 0 0

A(011) A(0101)

C

nx

A(0)

A(0100)A(010)A(01)

Figure 5.10: 算术编码

– 输入区间A = [0, 1]

– 输入第一位“0”后，区间A按概率分割为两个区间：A(0) = A · P0, A(1) =[P0, 1] = AP1（长度）并取A(0)为下次编码区间，即新的A+A(0)。

– 输入第二位“1”后，区间A = A(0)分割为：

A(00) = A · P0, A(01) = AP1

并取A = A(01)为下次分割区间。

– 可以看出，A区间在不断变小，编码区间底线C要么不变，要么上升。

– 编码过程：

¬ 初始化：C = 0, A = 1


输入高概率符号“0”，C = 0,A = AP0 =34

® 输入低概率符号“1”，C = C +AP0 =916 ,A = AP1 =

316

¯ 输入高概率符号“0”，C = 916 ,A = 3

16 ×34 = 9

64

° 输入低概率符号“0”，

C =9

16A =

9

64× 3

4=

27

256= (

16

256) +

11

256= 0.1001 = 0.00011011 (binary)

最后码字区间为：0.1001 ≤ C(x) < 0.10101011在该区间中任取一点并适当截断，得到码字: 101

• 算术编码中的码字长度：Cover书中取：⌈log 1P (xn)⌉+ 1 bits

西电书中取小数点后l位：l = ⌈log 1P (xn)⌉

• Example: X = a1, a2, a3, a4, P (a1) = 1/2, P (a2) = 1/4, P (a3) = 1/8, P (a4) =1/8. xn = a2a1a1a3a4a1a2a1.

5.11 Lempel-Ziv Universal Source Coding

已知信源的概率分布，我们能用Huffman算法构造最佳码。对于未知概率分布的源，itis desirable to have a one-pass (or online) algorithm to compress the data that “learns”the probability distribution of the data and uses it to compress the incoming symbols.

• Universally optimal: asymptotic compression rate approached the entropy rate ofthe source for any stationary ergodic source.

• Lempel-Ziv (LZ) algorithms do not require prior knowledge of the source statistics.

• LZ77: sliding window LZ algorithm, it uses string-matching on a sliding window.(它的实现晚一点，典型实现：MS Windows)

• LZ78: tree-structured LZ algorithm, it uses an adaptive dictionary. (典型实现：UNIX的compress)

The key idea of LZ algorithm is to parse the string into phrases and to replacephrases by pointers to where the string has occurred in the past.

5.11.1 LZ78算算算法法法

• Parse a string into phrase, where each phrase is the shortest phrase not seen earlier,This algorithm can be viewed as building a dictionary in the form of a tree, wherethe nodes correspond to phrases seen so far.(分段规则：尽可能取最短的符号串，并保证每段都不相同)

• Example: The string is a b b a b b a b b b a a b a b a aWe parse it as: a, b, ba, bb, ab, bba, aba, baa，这就构成分段字典

• Code this phrases by giving the location of the prefix (经常用段号表示) and thevalue of the last symbol.∴ 码字= 段号+ the last symbol的编码。(单字符的码字段号为0)

5.11. LEMPEL-ZIV UNIVERSAL SOURCE CODING 55

• Example: X = a1, a2, a3, a4. 源序列为xn = a1 a2 a1 a3 a2 a4 a2 a4 a3 a1 a1 a4.

– 可分段为：a1, a2, a1a3, a2a4, a2a4a3, a1a1, a4

– 编码表：

段号 phrases codeword

1 a1 000︸︷︷︸段号

... 00︸︷︷︸a1的编码

2 a2 000...01

3 a1a3 001... 10︸︷︷︸

a3

4 a2a4 010...11

5 a2a4a3 100...10 −→ a2a4段的段号4

6 a1a1 001...00

7 a4 000...11

5.11.2 LZ77算算算法法法（（（Ref: Gallager’08）））

• The algorithm keeps the w most recently encoded source symbols in memory. Thisis called a sliding window of size w.

• Suppose the source symbols xP1 have been encoded. The encoder looks for thelongest match (长为n) between the not-yet-encoded n-string xP+n

P+1 and a stored

string xP+n−uP+1−u starting in the window of length w.

• With LZ77, this string of n symbols is encoded simply by encoding the integers nand u.

– 初始化: Encode the first w symbols in a fixed-length code without compres-sion, using ⌈log |X |⌉ bits per symbol.

– Set P = w

– Find the largest n ≥ 2 such that xP+nP+1 = xP+n−u

P+1−u for some u in the range1 ≤ u ≤ w

– Encode the integer n into a codeword from the unary-binary code.

– If n > 1, encode u ≤ w using a fixed-length code of length logw bits.

– Set P to P + n.

• If a match is not found in the window，the next character is sent uncompressed.To distinguish between these two cases, a flag bit is needed.Phrase types: (flag f , 匹配位置 u（反向计数）, 匹配长度 n) or (f , c (未压缩字符))

• Example:w︷︸︸︷

a c d b c d a c b

match︷︸︸︷a b a c d b c︸︷︷︸

u=7

n=3︷︸︸︷a b a b d c a · · ·


• Note that the target of a (u, n) pair could extend beyond the window, so that itoverlaps with the new phrase.

• Example: w = 4, the string is abbabbabbbaababaPhrase表示：

Chapter 6

Channel Capacity

6.1 Categories of channels (channel models)

•

discrete channel: discrete-valued inputs and outputs

continuous channel: continuous-valued inputs and outputs

semi-continuous channel: discrete inputs and continuous outputs

waveform channel.

•

single-user channel(point-to-point)

multiuser channel: multiaccess channel; broadcast channel

•

memoryless channel

channel with memory

•

time-invariant channel: 恒参信道

time-varying channel: 随参信道

• physical channel: 卫星，光纤，电缆

6.2 Discrete memoryless channels and capacity

A DMC is specified by 3 quantities:

1) the input alphabet X = 0, 1, . . . ,K − 1 or Ax = a1, a2, . . . , aK.

2) the output alphabet Y = 0, 1, . . . , J − 1 or Ay = b1, b2, . . . , bJ.

3) the conditional probability distribution PY |X(y|x) over Y for each x ∈ X :

P (yN |x1 . . . xN−1, xN , y1 . . . yN−1) = P (yN |xN ) (6.1)

（平稳信道，时不变：PYn|Xn不依赖于n）

That is, the probability distribution of the output depends only on the input at thattime and is conditionally independent of previous channel inputs and outputs.

57

58 CHAPTER 6. CHANNEL CAPACITY

Example 6.1 BSC

P (y|x) =

1− p, if y = x

p, if y = x

X Y

1-P

P

1-P

P

0

1

0

1

Figure 6.1: BSC

Example 6.2 BEC (Binary erasure channel)

X Y

0

1

0

1

1-a

1-a

a

a

e

Figure 6.2: BEC

Theorem 6.2.1. When a DMC is used without feedback,

PN (yN |xN ) = PN (y1 . . . yN |x1 . . . xN ) =

N∏n=1

P (yn|xn) (6.2)

Proof.

P (x1 . . . xN , y1 . . . yN ) =

N∏n=1

P (xn|x1 . . . xn−1, y1 . . . yn−1)P (yn|x1 . . . xn−1, y1 . . . yn−1)

(无反馈) (定义)

=N∏

n=1

P (xn|x1 . . . xn−1)PY |X(yn|xn)

=

N∏n=1

P (xn|x1 . . . xn−1)

N∏i=1

PY |X(yi|xi)

= P (x1 . . . xN )

N∏i=1

PY |X(yi|xi)

Dividing P (x1 . . . xN ), we obtain (6.2).

Definition 6.2.1 (Information channel capacity). The capacity of a DMC is defined as

C , maxP (x)

I(X;Y ) = maxP (x)

[H(Y )−H(Y |X)]

where the maximum is taken over all possible input distribution P (x).

6.2. DISCRETE MEMORYLESS CHANNELS AND CAPACITY 59

• Operational interpretation: The capacity is the maximum rate of information thatcan be transferred over the channel reliably. 更正式地，有下列定义：

Definition 6.2.2 (Operational channel capacity). The capacity is the supremum of allachievable rates.

• 可达: 存在(⌈2nR, n⌉)码序列，使得最大（或平均）错误概率Pe → 0 as n→∞.

• ε-可达：Pe ≤ ε.

Theorem 6.2.2 (Noisy channel coding theorem). All rates below C = max I(X;Y ) areachievable.

• 它建立了Operational channel capacity = Information channel capacity for DMCs，即C 就是容量。

• 数学意义：解决了capacity的计算问题(at least in principle)。

• 证明：见后面的信道编码定理。

Our goal now is to find P (x) to maximum I(X;Y ), and later use this distribution toachieve the maximum communication rate.

Example 6.3 Consider a BSC channel.

I(X;Y ) = H(Y )−H(Y |X)

= H(Y )−∑x

P (x)H(Y |X = x)

= H(Y )−∑x

P (x)H(P )

= H(Y )−H(P )

≤ 1−H(P )

∴ C = 1−H(P )bits/symbol

The optimal input distribution: PX(0) = PX(1) = 12

Example 6.4 BEC

• 方法一：Let E be a indicator variable that is 1 if there is an error and is 0otherwise. Then

C = maxP (x)

[H(Y )−H(Y |X)]

= maxP (x)

[H(Y,E)−H(Y |X)]

= maxP (x)

[H(E) +H(Y |E)−

∑x

P (x)H(Y |X = x)

]= max

P (x)[H(E) +H(Y |E)−H(α)]

∵ H(E) = H(α)

H(Y |E) = P (E = 0)H(Y |E = 0) + P (E = 1)H(Y |E = 1)︸︷︷︸=0

= H(X)− αH(X)

= (1− α)H(X)


0 0.5 1 P

1

C

Figure 6.3: BSC capacity

∴ C = maxP (x)[(1− α)H(X)] = 1− α

where capacity is achieved by PX(0) = PX(1) = 12 .

• 方法二：令π = P (X = 0)表示X = 0 的概率，则

H(Y ) = H(π(1− α), α, (1− π)(1− α))

= −π(1− α) log π(1− α)− α logα− (1− π)(1− α) log[(1− π)(1− α)]

= −(1− α)[−π log π − (1− π) log(1− π)]− α log π − (1− α) log(1− α)

= H(α) + (1− α)H(X)

• 方法三：

I(X;Y ) = H(X)−H(X|Y )

= H(X)− P (Y = 0)H(X|Y = 0)− P (Y = 1)H(X|Y = 1)− P (Y = e)H(X|Y = e)

= H(X)− αH(X|Y = e)

Thus, C = 1− α.

We can also consider the capacity for n uses of the channel,

C(n) =1

nmaxP (xn)

I(Xn;Y n)

Theorem 6.2.3. Let P (xn) be an arbitrary joint probability distribution on sequence ofn inputs for a DMC. Then

I(Xn;Y n) ≤n∑

i=1

I(Xi;Yi) (6.3)

I(Xn;Y n) ≤ nC (6.4)

6.2. DISCRETE MEMORYLESS CHANNELS AND CAPACITY 61

Proof.

I(Xn;Y n) = H(Y n)−H(Y n|Xn)

=

n∑i=1

H(Yi|Y1 . . . Yi−1)−n∑

i=1

H(Yi|Y1 . . . Yi−1, Xn)

(by memoryless assumption) =

n∑i=1

H(Yi|Y1 . . . Yi−1)−n∑

i=1

H(Yi|Xi)

≤n∑

i=1

H(Yi)−n∑

i=1

H(Yi|Xi) (6.5)

=n∑

i=1

I(Xi;Yi)

(From the definition of capacity) ≤ nC

Alternative one:

H(Y n|Xn) =∑x,y

P (xn, yn) log1

P (yn|xn)

= E[∑n

log1

P (yi|xi)]

=∑n

[E(log1

P (yi|xi))]

Equality in (6.5) is achieved when the output letters are statistically independent. Ifthe input letters are independent; i.e., P (xn) =

∏ni=1 P (xi), then

P (yn) =∑xn

n∏i=1

P (xi)P (yi|xi) (∵ P (yn) =∑xn

P (xn)P (yn|xn))

=

n∏i=1

∑xi

P (xi)P (yi|xi)

=n∏

i=1

P (yi)→ yi统计独立

Note: It should not be concluded from this theorem that statistical dependence betweeninputs is sometimes to be avoided.

进进进一一一步步步讨讨讨论论论：：：

• Does “式(6.5)取等号”（输出独立–¿输入独立）imply no coding is needed toachieve capacity?No. 这只是不考虑Pe的“information capacity”的优化结果。Coding is alwaysneeded to achieve capacity.

• Reliable communication cannot be achieved at exactly the rate C and when thechannel outputs are exactly independent.

• For a rate strictly less than C, no matter how close, the coding theorem guaranteesthat reliable communication is possible.


• Equality in (6.5) is achieved if the inputs are independent over time. Thus,Optimize P (xn) to maximize I(Xn;Y n) −→在单符号上优化P (xi)使I(Xi;Yi)最大。即

C(n) =1

n

∑i

maxP (xi)

I(Xi;Yi) = maxP (x1)

I(X1;Y1)

• For a memoryless channel, we can focus on maximizing the mutual information ofone channel use. This does not mean we can communicate reliably with just onechannel use.

• Pick codewords randomly according to i.i.d input distribution （随机编码）istotally different from sending uncoded symbols.

6.3 Symmetric channels

Let us consider the channel with transition probability matrix

T = [P (y/x)] =

P (y1/x1) P (y2/x1) · · · P (yJ/x1)P (y1/x2) P (y2/x2) · · · P (yJ/x2)

......

. . ....

P (y1/xK) P (y2/xK) · · · P (yJ/xK)

, size = |X | × |Y|

For example,

T =

0.3 0.2 0.50.5 0.3 0.20.2 0.5 0.3

each row of T corresponds to an input of the channel

each column of T corresponds to an output of the channel

In this channel, each row is a permutation of each other row (关于输入对称)and sois each column (关于输出对称). Such a channel is said to be symmetric.

• Definition: A DMC is symmetric iff all the rows of T are permutations of eachother, and the columns are permutations of each other.

• Denote a row of T by r =[r1, · · · , r|Y|

]. Then

H (r) = −∑i

ri log ri

• Optimal input distribution:

I (X;Y ) = H (Y )−H (Y /X)

= H (Y )− EX [H (Y /X = x)]

= H (Y )−H (r)

≤ log |Y| −H (r)

The equality holds only if Y is uniformly distributed.

6.3. SYMMETRIC CHANNELS 63

– Let X be uniformly distributed; i.e., P (x) = 1|X | for all x ∈ X .

PY (y) =∑x∈X

P (x)P (y/x) =1

|X |∑x

P (y/x) =c

|X |, c =

∑entries in one column.

Therefore, the uniform input distribution is optimal. The capacity of the abovechannel is C = log 3−H (0.3, 0.2, 0.5).

• Weakly symmetric channel:Every row of T is a permutation of every other row, and all the column sums∑xP (y/x) are equal.

• Example:

T =

13

16

12

13

12

16

X Y

0

1

1b

2b

3b

1

3

1

2

1

6

1

3 1

2

1

6

Figure 6.4: Example

• The capacity C = log |Y| −H (r).

• Example: Consider a K-ary symmetric channel.

P (y/x) =

1− p, if y = x

pK−1 , otherwise

The channel capacity is given by

C = logK + P (0/0) logP (0/0) + (|Y| − 1) · p

K − 1log

p

K − 1

= logK + (1− p) log (1− p) + p logp

K − 1

= logK −H (p)− p log (K − 1)

When K = 2, C = 1−H(p) −→ BSC.

6.3. SYMMETRIC CHANNELS 65

∴ I(X = k;Y ) =∂I(X;Y )

∂Qk+ log e

由式(6.6)：

I(X = k;Y )

= λ+ log e, for Qk > 0≤ λ+ log e, for Qk = 0

记C = λ+ log e, 就可得到定理公式.

6.3.1 准准准对对对称称称信信信道道道

信道输出Y可划分为n个子集，而每个子集对应的信道转移概率矩阵P中的列组成的子阵是对称的。例:

P =

[0.8 0.1 0.10.1 0.1 0.8

]−→关于输入对称

Y ⇒ 0 1 2

Y可划分为2个子集：0, 2, 1.

[0.8 0.10.1 0.8

] [0.10.1

]

0

1

0

1

2

X Y

0.8

0.1

0.1

0.8

0.1

Theorem 6.3.2. 准对称DMC信道的最佳输入分布为等概分布。

C = I(X = k;Y )

例：P =

[1− P −Q Q PP Q 1− P −Q

]

0

1

0

e

1

X Y

1-P-Q

Q

P

QP

Figure 6.6: 二元对称删除信道


C = I(X = 0;Y )

=∑j

P (j|X = 0) logP (j|X = 0)1K

∑i P (j|i)

= (1P −Q) log1− P −Q12(1−Q)

+Q logQ

12 · 2Q

+ P logP

12(1−Q)

作业：4.1, 4.3, 4.4, 4.6, 4.8, 4.12, 4.13, 4.14

6.4 Computing the Channel Capacity

• In general, analytical solutions of the optimal input distribution that maximizesthe mutual information is not always available.

• We need to complete the capacity numeracally.

• The difficulties, high dimensional nonlinear optimization.

• An iterative algorithm for computing channel capacity was suggested by S. Ari-moto and Blahut in 1972.

6.4.1 Alternating optimization

• Consider the double supremum

maxu1∈A1

maxu2∈A2

f (u1, u2)

where A1, A2 are convex subset of Rn1 and Rn2 , respectively, and f is a functiondefined on A1 ×A2.Suppose f is bounded from above and is continuous and has continuous partialderivatives on A1 ×A2.

• Suppose now for all u1 ∈ A1, there exists a unique C2 (u1) ∈ A2 such that

f (u1, C2 (u1)) = maxu′

2∈A2

f(u1, u

′2

)and for any u2 ∈ A2, there exists a unique C1 (u2) that

f (C1 (u2) , u2) = maxu′1∈A1

f(u′1, u2

).

• Now we can find the maximum with an alternating algorithm.

– Initialization: Pick any u(0)1 .

– Find u(0)2 = C2

(u(0)1

).

– For any k ≥ 1, u(k)1 and u

(k)2 are computed as

u(k)1 = C1

(u(k−1)2

)= argmax

u′1

f(u′1, u

(k−1)2

)u(k)2 = C2

(u(k)1

)= argmax

u′2

f(u(k)1 , u′2

)

6.4. COMPUTING THE CHANNEL CAPACITY 67

Since f (k) = f(u(k)1 , u

(k)2

)≥ f

(u(k)1 , u

(k−1)2

)≥ f

(u(k−1)1 , u

(k−1)2

)= f (k−1), i.e.,

the function value is non-decreasing in each step, the sequence must convergebecause f is bounded from above.

• Does it necessarily converge to the maximum?

6.4.2 The Blahut-Arimoto algorithm

Lemma 6.4.1. Fix any input distribution PX , and channel PY /X . Suppose PX(x) > 0for any x ∈ X . Then

maxQX/Y

∑x

∑y

logQ (x/y)

P (x)=∑x

∑y

P (x)P (y/x) logQ∗ (x/y)

P (x)

where the maximization is taken over all Q such that

Q (x/y) = 0 iff P (y/x) = 0 (6.7)

and

Q∗ (x/y) =P (x)P (y/x)∑

x′P (x′)P (y/x′)

(6.8)

i.e., the maxmizing Q is the true transition distribution from Y to X given the inputP (x) and the channel P (y/x).

Proof. Proof: First check that Q∗ satisfies the constranit.∑x

∑y


P (x)−∑x

∑y

P (x)P (y/x) logQ (x/y)

P (x)

=∑x

∑y


Q (x/y)

=∑x

∑y

P (y)Q∗ (x/y) logQ∗ (x/y)

Q (x/y)from (6.11)

=∑y

P (y)D (Q∗ (x/y) ||Q (x/y))

≥ 0

• How to use this?

Theorem 6.4.2. For a DMC,

C = maxPX

I (X;Y )

= supP (x)>0

maxQ(x/y)

∑x,y


P (x)


– To obtain the 2nd equality above, we need P (x) > 0 strictly for any x, whilethe optimal input distribution might have for some x, P ∗ (x) > 0.

– The continuity of I (X;Y ) with respect to the input.

– Now we have a joint optimization of two variables: P (x) and Q (x/y) to applythe alternating algorithm, need to check:

(1) The space of distributions is convex.

(2) The function is bounded.

f (P,Q) ≤ IP (X;Y ) ≤ H (X) ≤ log |X | .

where

f (P,Q) =∑x

∑y


P (x),

and

IP (X;Y ) =∑x

∑y


P (x).

• Updating rules

(1) Arbitrarily choose a distribution P (x) > 0 strictly. Let it be P (0). Thengiven P (k).

Q(k) = arg maxQ(x/y)

∑x,y

P (k) (x)P (y/x) logQ (x/y)

P (x)

=P (k) (x)P (y/x)∑

x′P (k) (x′)P (y/x′)

(2) Given Q(k).

P (k+1) = arg supP>0

∑x,y

P (x)P (y/x) logQ(k) (x/y)

P (x)

the maximization is taken over all the distributions on X . The constraintsare ∑

xP (x) = 1

P (x) > 0 for all x ∈ X

• Using the method of Lagrange multipliers to find the best P (x) by ignoring tem-porarily the positive constraints. Let

J =∑x,y


P (x)− λ

∑x

P (x)

Set

0 =∂J

∂P (x)=∑y

P (y/x) logQ (x/y)−∑y

P (y/x) [logP (x) + 1]− λ

=∑y

P (y/x) logQ (x/y)− logP (x)− 1− λ

6.5. CAPACITY OF CASCADED CHANNELS AND PARALLEL CHANNELS 69

Therefore,

P (x) = e−(λ+1)∏y

Q(x/y)P (y/x)

By considering∑xP (x) = 1, we can eliminate λ and obtain

P (x) =

∏yQ(x/y)P (y/x)

∑x′

∏yQ(x′/y)P (y/x′)

The above product is over all y such that P (y/x) > 0, and Q (x/y) > 0 for all suchy. This implies that both the numerator and the denominator above are positive,and therefore P (x) > 0. That is, the P (x) obtained happen to satisfy the positiveconstraints.So, define

P (k) (x) =

∏yQ(k−1)(x/y)P (y/x)

∑x′

∏yQ(k−1)(x′/y)P (y/x′)

• f (k) = f(P (k), Q(k)

)→ C.

• Convergence of the alternating optimization algorithm:

Lemma 6.4.3. If f is concave, then f(u(k)1 , u

(k)2

)→ f∗, where f∗ is the desired

maximum value.

– For a concave function, local maximum is global maximum.

• Convergence to the channel capacity:We only need to check

f (P,Q) =∑x,y


P (x)

is concave in P and Q.

6.5 Capacity of cascaded channels and parallel channels

6.5.1 Parallel Channels(积积积信信信道道道)

• The input of two component channels are independent; i.e., X1 and X2 are inde-pendent of each other.

• The output Y1 and Y2 of two component channels are independent.

• Two component channels are used simultaneously.

X = X1 ×X2 =(

ak, a′k

)=(

k, k′)

Y = Y1 × Y2 =(

bj , b′j

)=(

j, j′)

P(jj′|kk′

)= P (j|k)P

(j′|k′

)


…

ch1

ch2

chN

X

1Y

2Y

NY

1 2, , ,

NY Y Y Y

(a) 输入并接信道

…

ch1

ch2

1Y

2Y

NY

Y

X

chN

(b) 独立并行信道

…

ch1

ch2

1Y

2Y

NY

chN

X Y

(c) 和信道(并信道)

Figure 6.7: 信道模型

ch1

|P j k

ch2

|P j k

2Y

1Y1X

2X

1 ka 1 jb

2 ka 2 jb

Figure 6.8: Parallel channel

Theorem 6.5.1. The channel capacity of the set of N independent channels (in parallel)is given by

C =

N∑n=1

Cn

Proof. Since P (y1 · · · yN |x1 · · ·xN ) =N∏

n=1P (yn|xn), we have

I (X;Y) = H (Y)−H (Y|X)

= H (Y)−N∑

n=1

H (Yn|Xn)

6N∑

n=1

H (Yn)−N∑

n=1

H (Yn|Xn)

=N∑

n=1

I (Xn;Yn)

6.5. CAPACITY OF CASCADED CHANNELS AND PARALLEL CHANNELS 71

Therefore,

C = maxP (X)

I (X;Y)

= maxP (X)

N∑n=1

I (Xn;Yn)

=N∑

n=1

Cn

C 6N∑

n=1Cn, 当输入独立时等号成立。

6.5.2 Union of channels (和和和信信信道道道)

• At each time interval, we randomly select one of N channels to use.

• The probability that the ith channel is selected is assumed to be Pi, and∑iPi = 1.

• For this type of channels, the input alphabet X =N∪i=1Xi and the output alphabet

Y =N∪i=1Yi.

• Let Xi = 1, 2, · · · ,Ki, Yi = 1, 2, · · · , Ji.

P (ji|ki) = P (Yi = ji|Xi = ki) .

Then the channel transition matrix

P =

P1

P2

. . .

PN

• 和信道的容量

– 设信道i的（先验）输入概率分布为Qi. 则有

Q(x) =

P1Q1(x), if x ∈ X1

P2Q2(x), if x ∈ X2

– 和信道的互信息为：

I(X;Y ) =∑x,y

Q(x)P (y|x) log P (y|x)P (y)

(对于x ∈ X1) =∑x1,y1

Q(x1)P (y1|x1) logP (y1|x1)P1P (y1)

(对于x ∈ X2) +∑x2,y2

Q(x2)P (y2|x2) logP (y2|x2)P2P (y2)


其中P1P (y1) =∑x1

P1Q1(x)P (y1|x1)︸︷︷︸y∈Y1时œP (y)

由上式有I(X;Y ) = P1I(X1;Y1)− P1 logP1 + P2I(X2;Y2)− P2 logP2

C = maxPi,Qi

I(X;Y )

= maxP1

maxQ1

P1I(X1;Y1)− P1 logP1 +maxQ2

P2I(X2;Y2)− P2 logP2

= max

P1

P1C1 + P2C2 − P1 logP1 − P2 logP2 (6.9)

令f(P1) = P1C1 + P2C2 − P1 logP1 − P2 logP2

= P1C1 + (1− P1)C2 − P1 logP1 − (1− P1) log (1− P1)

则df(P1)

dP1= C1 − C2 − logP1 + log (1− P1)

= C1 − logP1 − (C2 − log (1− P1)) = 0

=⇒ C1 − logP1 = C2 − logP2 , λ

∴ P1 = 2C1−λ

P2 = 2C2−λ

P1 + P2 = 1 =⇒ 2λ = 2C1 + 2C2

λ = log2 2C1 + 2C2

将P1, P2 代入式 6.9, 有

C = P1C1 + P2C2 − P1 log 2C1−λ − P2 log 2

C2−λ

= P1C1 + P2C2 − P1(C1 − λ)− P2(C2 − λ)

= (P1 + P2)λ

= λ

∴ C = log2(2C1 + 2C2

)一般地：C = log

(∑Ni=1 2

Ci

)

6.5.3 Cascaded Channels (级级级联联联信信信道道道)

ch1

|P j k

2Y1X ch21 2Y X

1 21 2

|P j k

Figure 6.9: Cascaded channel

6.6. MAXIMUM LIKELIHOOD DECODING 73

P(j′|k)=∑j

P(j, j′|k

)=∑j

P (j|k)P(j′|j, k

)=∑j

P (j|k)P(j′|j)

Thus P = P1P2 · · ·PN =N∑

n=1Pn.

I (X1;Y2) 6 I (X1;Y1)I (X1;Y2) 6 I (X2;Y2)

6.6 Maximum Likelihood Decoding

• Message W uniformly distributed in 1, 2, . . . ,M,M = 2nR.

• Encoder transmit codeword xn(m) if the incoming message is W = m.

• Receiver receives yn, and find the most likely transmitted codeword.

W = argmaxm

P (yn|xn(m))

• Define Ym = yn : yn decodes to m.(将空间yn按译码规则划分为不相交的判决区域)

• The probability of error conditioned on W = m is transmitted:

Pe,m =∑

yn∈Ycm

P (yn|xn(m))

• The overall probability of block decoding error: Pe =∑M

m=1 P (W = m)Pe,m.

• Bit error probability Pb = 1K

∑Ki=1 Pei, 其中Pei = P (Ui = Ui)是第i位出错的概

率。

• Decoding criterion

(1) The minimum error probability decoding rule:

Pe(yn) = P (m = m|yn) = 1− P (m = m|yn)

(2) The MAP decoding rule:

m = argmaxm

P (xn(m)|yn)


(3) The ML decoding rule: m = argmaxm

P (yn|xn(m))(∵ P (xn(m)|yn) = P (yn|xn(m))P (m)

P (yn)

)In the case of DMCs, this becomes

m = argmaxm

n∏i=1

P (yi|xm,i)

(4) 最小距离译码：

AWGN: P (y|x) ∝ e−(y−x)2

2σ2

m = argminm||yn − xn(m)||2

BSC: P (yn|xn) = pdH(yn,xn)(1− p)n−dH(yn,xn)

For p < 12 , 由单调函数有

P (yn|xn(m)) > P(yn|xn(m′)

)iff dH (yn, xn(m)) < dH

(yn, xn(m′)

)6.7 Preview of channel coding theorem

• Reliable communication requires disjoint partitioning in the received signal space,corresponding to the different possible transmitted signals.

• For each (typical) input sequence Xn, there are approximately 2nH(Y |X) possible

Y n sequences.

• There are in total 2nH(Y )(typical) Y n sequences.Divide this set into sets of size 2nH(Y |X), each corresponding to one input Xn

sequences.

• The number of disjoint sets is no more than 2nH(Y )−nH(Y |X) = 2nI(X;Y ).

• Therefore we can send at most 2nI(X;Y ) different sequences that are distinguish-able.

• The number of codewords |C| = M < 2nI(X;Y ) ⇒

1

nlog2M︸︷︷︸

code rateR

< I(X;Y )

code rate R [bits/symbol], R < C.

6.8 Joint AEP

• Xn = (X1 . . . Xn) ∈ X , Y n = (Y1 . . . Yn) ∈ Y.

P (xn, yn) =

n∏i=1

P (xn, yn)

6.8. JOINT AEP 75

• Definition：The set of the jointly typical sequences (xn, yn) is

T (n)ϵ =

(xn, yn) ∈ X × Y :∣∣∣∣− 1

nlgP (xn)−H(X)

∣∣∣∣ < ϵ, −→ xn ∈ T (n)ϵ (X)∣∣∣∣− 1

nlgP (yn)−H(Y )

∣∣∣∣ < ϵ, −→ yn ∈ T (n)ϵ (Y )∣∣∣∣− 1

nlgP (xn, yn)−H(XY )

∣∣∣∣ < ϵ

−→ (xn, yn) ∈ T (n)

ϵ (XY )

• From the above conditions,we have

−ϵ < −frac1n lgP (xn, yn)−H(XY ) < ϵ

=⇒ 2−n[H(XY )+ϵ] < P (xn, yn) < 2−n[H(XY )−ϵ]

Theorem 6.8.1. Consider sequences(Xn, Y n) drawn i.i.d. from P (x, y).Then

(1) P((xn, yn) ∈ T

(n)ϵ

)→ 1 as n→∞

(2) |T (n)ϵ | ≤ 2n[H(XY )+ϵ]

Proof. (1)

P((xn, yn) ∈ T (n)

ϵ

)= 1− P

((xn, yn) /∈ T (n)

ϵ

)= 1− P

xn /∈ T (n)

ϵ (X) ∪ yn /∈ T (n)ϵ (Y ) ∪ (xn, yn) /∈ T (n)

ϵ (XY )

since P (∪Ai) ≤∑

i P (Ai),

P((xn, yn) ∈ T (n)

ϵ

)≥ 1−

P(xn /∈ T (n)

ϵ (X))+ P

(yn /∈ T (n)

ϵ (Y ))

+ P

(∣∣∣∣−1

2logP (xn, yn)−H(XY )

∣∣∣∣ ≥ ϵ

)By the weak law of large numbers , given ϵ,there exists n1 such that for alln > n1,

P(xn /∈ T (n)

ϵ (X))= Pr

(∣∣∣∣− 1

nlgP (xn)−H(X)

∣∣∣∣ ≥ ϵ

)≤ σ2

w

nϵ2=

ϵ

3

similarly,there exist n2 and n3 such that for all n > n2,

P(yn /∈ T (n)

ϵ (Y ))= Pr

(∣∣∣∣− 1

nlgP (yn)−H(Y )

∣∣∣∣ ≥ ϵ

)≤ ϵ

3

and for all n > n3,

P

(∣∣∣∣− 1

nlgP (xn, yn)−H(XY )

∣∣∣∣ ≥ ϵ

)≤ ϵ

3

choosing n > max n1, n2, n3,we have

P((xn, yn) ∈ T (n)

ϵ

)≥ 1−

( ϵ3+

ϵ

3+

ϵ

3

)= 1− ϵ


(2)

1 =∑

(xn,yn)∈Xn×Yn

P (xn, yn) ≥∑

(xn,yn)∈T (n)ϵ

P (xn, yn)

≥∑

(xn,yn)∈T (n)ϵ

2−n[H(XY )+ϵ]

= |T (n)ϵ | · 2−n[H(XY )+ϵ]

so |T (n)ϵ | ≤ 2n[H(XY )+ϵ]

Since 1− ϵ =≤∑

(xn,yn)∈T (n)ϵ

P (xn, yn) = |T (n)ϵ | · 2−n[H(XY )+ϵ]

∴ |T (n)ϵ | ≥ (1− ϵ)2n[H(XY )+ϵ]

Question:

If I randomly pick a typical sequence Xn ∈ T(n)ϵ (X),Y n ∈ T

(n)ϵ (Y ),are they joint

typical?T (n)ϵ (X)× T (n)

ϵ (Y ) = T (n)ϵ (Y )

2nH(X) · 2nH(Y ) ≥ 2nH(XY )

Theorem 6.8.2. If (Xn, Y n) ∼ P (xn)P (yn) ,then

P((Xn, Y n) ∈ T (n)

ϵ (XY ))≤ 2−n[I(X;Y )−3ϵ]

and for n large enough.

P((Xn, Y n) ∈ T (n)

ϵ (XY ))≥ (1− ϵ)2−n[I(X;Y )+3ϵ]

Proof.

P((Xn, Y n) ∈ T (n)

ϵ (XY ))=

∑T

(n)ϵ (X,Y )

P (xn)P (yn)

≤ 2n[H(XY )+ϵ] · 2−n[H(X)−ϵ] · 2−n[H(Y )−ϵ]

= 2−n[I(X;Y )−3ϵ]

P((Xn, Y n) ∈ T (n)

ϵ (XY ))=

∑T

(n)ϵ (X,Y )

P (xn)P (yn)

≥ (1− ϵ)2n[H(XY )−ϵ] · 2−n[H(X)+ϵ] · 2−n[H(Y )+ϵ]

= (1− ϵ)2−n[H(X)+H(Y )−H(XY )+3ϵ]

• Fix a typical yn, out of the 2nH(X) typical X sequences,there are ≈ 2nH(X|Y )

sequences that are jointly typical with yn.

2nH(XY ) = 2[nH(X)+H(X|Y )] = 2nH(X|Y )

6.9. THE NOISY-CHANNEL CODING THEOREM 77

( )2nH Y ntypical y

nx

ny

( ) ( )2 ,nH Y n njointly typical x y

( )2nH X ntypical x

· · ·

· · ·

··

· ·

• Joint typical sequences

• 上述定理说明，随着n ↑,接收矢量几乎只能是与发送码字联合典型的序列，取其他序列的概率→ 0.

这样我们有译码方法：取与接收矢量联合典型的码字作为译码器的输出。（在条件yn下x的典型序列集）

Example 6.8.1. Example：Binary source and BSC

• Consider a binary source xn passes through a BSC with flipping probability P .

• Fix an arbitrary output sequence yn ,what xn.s are jointly typical ?

• Write wn = yn − xn,there should be approximately np 1’s in wn .

• How many of such xn are there?.= 2nH(P )

• Check:This set is much smaller than |X |n = 2n.

6.9 The noisy-channel coding theorem

6.9.1 System model

Encoder DMC Decoder

message estimate

W nX

nY W

( )|P y x

Figure 6.10: system model

• Denote the codebook C = xn(w) = (x1(w), x2(w), . . . , xn(w)) , 1 ≤ w ≤M

• A message W = w ∈ 1, 2, . . . ,M results in the transmitted codeword xn(w) ∈ C.

• Let Y n v p(yn|xn) be the received sequence.

• On the basis of the received sequence Y n, the decoder produces an integer w. Adecision error occurs if w = w.


• The probability of block decoding error is

Pe = P (w = w)

Pe(y) = P (w = w|y) = 1− P (w = w|y)

• Decoding method: Joint typical decoding、MAP、ML

• The code rate R = 1n log2 |C| = 1

n bits/channel use (bits per transmission)

• A rate R is said ro be achievable if there exists a sequence of (⌈nR⌉, n) codes suchthat the maximal Pe → 0 as n →∞ (for which the probability of error goes to 0as n→∞).

6.9.2 Channel coding theorem

Theorem 6.9.1. All rates below capacity C are achievable. Specifically, for every rateR < C there exists a sequence of (2nR, n) codes with maximum Pe → 0 as n→∞.

Proof. (1) Construction of codebooks: random coding

• Generate M = 2nR codebooks, each with length n, i.i.d from an input distri-bution P (x).具体地说，我们按照P (x)完全独立随机地从X中选取字母作为码字的符号。从xn(1)的第1个码符号开始一直到xn(M)的最后一个码符号，从而得到全部的2nR个码字，它们组成一个码（书）。

C =

xn(1)...

xn(M)

=

x1(1) x2(1) · · · xn(1)...

x1(M) x2(M) · · · xn(M)

−→ 一个码字,可能存在重复码字只是概率很小

Each entry in this matrix is generated i.i.d according to P (x). The prob. thatwe generate a particular code is

P (C) =M∏

w=1

n∏i=1

P (xi(w))

• The total number of possible codebooks is (|X |n)M = |X |n2nR. (所有可能产生

的码的总数)(When |X | = 2, n = 16,M = 28, 总数≈ 101233)

• Shannon 不是计算某一特定好码的性能，而是计算这一码集的平均性能。

• All messages are equiprobable. P (W = w) = 2−nR, for w = 1, . . . , 2nR

(2) Encoding:Depend on the message W = w, transmit the codeword xn(w) through n usages ofthe channel.

(3) Decoding；typical set decoding

6.9. THE NOISY-CHANNEL CODING THEOREM 79

• For a given yn, if there exists unique xn(w) ∈ C such that (xn(w), yn) is jointly

typical. ((xn(w), yn) ∈ T(n)ϵ )

then decoder W = w.

Note: Y n ∼ P (yn|xn(w)) =n∏

i=1

P (yi|xi(w))

• If no such w exists or if there is more than one such, then an error is declared.

(4) Calculate the probability of error averaged over all codewords in the codebook andaveraged over all codebooks (i.e., over the ensemble of the codes).If 平均错误概率is small, then there exists one code with small Prob. of error.

Pe =∑C

P (C)Pe(C)

=∑C

P (C) ·∑w

P (xn(w))P (error|xn(w))

=∑C

P (C) · 1

2nR

2nR∑w=1

Pe|w(C)

=1

2nR

2nR∑w=1

∑C

P (C)Pe|w(C)︸︷︷︸在码上平均

(Pe|w(C) = P (erroe|C,W = w))

• Random coding creates symmetry. So the average Prob. of error averaged overall codes does not depend on the particular index that was sent.

• So we can assume the message W = 1 was sent. Thus

Pe =∑C

P (C)Pe|1(C) = Pe|1 = P (error|w = 1)

(5) Using Joint AEP

• Define the event

Ei =(xn(i), yn) ∈ T (n)

ϵ

, i ∈ 1, 2, . . . , 2nR

Eci =

(xn(i), yn) /∈ T (n)

ϵ

•

P (error|w = 1) = P (Ec1 ∪ E2 ∪ · · · ∪EM )

=≤ P (Ec1) +

M∑i=2

P (Ei)

• In computing the Prob. of Ei, what the distribution of Xn, Y n we should use?

– For E1, Xn, Y n are drawn from the joint distribution of P (xn.yn)

– For Ei, i ≥ 2, Xn, Y n ∼ P (xn)P (yn).(Xn(1) and Xn(i) are independent, so are Y n and Xn(i), i = 1)


• For large enough n, P (Ec1) ≤ ϵ

For i ≥ 2, P (Ei) ≤ 2−n[I(X;Y )−3ϵ]

Therefore,

Pe = P (error|w = 1) ≤ ϵ+

M∑i=2

2−n[I(X;Y )−3ϵ]

= ϵ+ (2nR − 1) · 2−n[I(X;Y )−3ϵ]

≤ ϵ+ 2nR · 2−n[I(X;Y )−3ϵ]

= ϵ+ 2−n[I(X;Y )−R−3ϵ]

• If R < I(X;Y )−3ϵ, we can choose n large enough such that 2−n[I(X;Y )−R−3ϵ] ≤ϵ. Then we have Pe ≤ 2ϵ , δ.

(6) Choose P (x) to be P ∗(x), the distribution on X that achieves capacity.Then R < I(X;Y ) −→ R < C. As long as R ≤ C = max I(X;Y ), the rate isachievable.

(7) The average performance of the random code is “good”, so that exists a “good”code.

Remaining questions:

1 Long codewords are good, but how long?What is the performance for finite codeword length?

2 Random codes seems good, but how do we decode?

• Is there any better structured code that can do as well?

• The randomness in coding is a computational trick or a fundamental requirementto achieve capacity?

6.10 Converse of coding theorem

• Recall Fano’s inequality:For any r.v.’s X and Y , we try to guess X by X = g(Y ). The error probabilityPe = P (X = X) satisfies

H(Pe) + Pe log (|X − 1|) ≥ H(X|Y )

Proof. Applying chain rule on H(X,E|Y ) ,where error r.v.

E =

1 if X = X

0 if X = X


( )E R

RC

0R

0R

– An i.i.d source sequence Vn.

– Source coder Vn → 0, 1∗, maps source sequence into binary sequence U∗.

– Channel coder 0, 1∗ → Xm, maps binary sequence U∗ into transmittedsequence Xm.

– Receiver receives Ym.

– Channel decoder recovers U∗.

– Source decoder guess Vn.

sourcesource

encoder

channel

encoder

Channel channel

decoder

source

decodersink( )|p y x

nV

*U

mX

mY

*U ˆ nV

Encoder Decoder

Figure 6.11: 系统框图

• Source coding: use less bits to represent every possible source sequence – spherecovering in Vn.

• Channel coding: use m transmitted symbol to represent as many as possible bits,while having small Pe – sphere packing in Xm.

• Source coding reduces redundancy: |V|n → 2nH(V ).Example: Bern(p) sequence: n bits → nH(p).

• Channel coding adds redundancy: 2mI(X;Y ) → 2mH(X).Example: BSC: m (1−H(p))→ m bits.

6.11.2 Joint source-channel coding (The source-channel separation the-orem)

• Overall goal: transmit i.i.d source Vn (that satisfies the AEP) over a DMC P (y|x)reliably:

V n → Xn → Y n → V n

P (n)e = P (V n = V n)→ 0

• Source-channel coding theorem: This can be done whenever H(V ) < C.

Proof. We will encode only V n ∈ Tϵ; all other sequences will result in an error.可达性:

P (n)e ≤ P

(V n /∈ T (n)

ϵ

)︸︷︷︸

V n is not correctly encoded

+ P(V n = V n|V n ∈ T (n)

ϵ

)︸︷︷︸UnR is not correctly recoverd

≤ ϵ+ ϵ = 2ϵ

6.12. CHANNEL WITH FEEDBACK 83

(if n→∞, and H(V ) < R < C) (H(V ) + ϵ = R < C)

• The two-stage process/method is as good as one stage process.

• So source and channel coding can be done separately.

• Converse of source-channel coding theorem: Joint source-channel encoder anddecoder:

Xn(Vn) : Vn → X n (take m = n)

Vn(Yn) : Yn → Vn

Question: Is it possible to reliably transmit a source with H(V ) > C ?

(We will show that P(n)e → 0 implies H(V ) ≤ C)

Proof. By Fano’s inequality, H(V n|V n) ≤ 1 + P(n)e n log |V|︸︷︷︸

log |V|n

Now

H(V ) =1

nH(V n)

=1

nH(V n|V n) +

1

nI(V n; V n)

≤ 1

n

(1 + P (n)

e n log |V|)+

1

nI(V n; V n) (∵ Markov chain)

≤ 1

n+ P (n)

e log |V|+ C (memoryless)

So P(n)e → 0 =⇒ H(V ) ≤ C. (n→∞)

• Source-channel coding theorem: For an i.i.d source V n and the DMC p(y|x), aslong as H(V ) < C, there exists a source channel code with P

(n)e → 0.

Conversely, if H(V ) > C, the probability of error is bounded away from 0. Itis not possible to send the source over the channel with arbitrarily low prob. oferror.

• Applies for general cases:

– source with memory

– channel with memory

• It is not always true :

(1) Channel is correlated with the source(2) Multiple access with correlated sources(3) Broadcasting channels (non-degraded)

separation theorem does not hold

6.12 Channel with feedback

• Does feedback help us to transmit higher data rate?

• Perfect feedback: Assume all previously received signals (Y1, . . . , Yi−1) is availableat the transmitted at time i. – Why do we ignore the possible errors in the feedbacklink?


6.12.1 Feedback capacity

• Feedback code: A (2nR, n) feedback code is a sequences of mapping

Xi = Xi(W,Yi−1), i = 1, ..., n

A decoder recover W as W = g(Yn).

• DMC with feedback:

EncoderChannel

Decoder( )|p y x

( )1, i

iX W Y

-W Wi

Y

Figure 6.12: DMC with feedback

• A rate is available if there exists a sequences of (2nR, n) codes such that the

probability of error P(n)e = P (W = W → 0).

• The capacity with feedback, CFB, of a DMC is the supremum of all rates achievablewith feedback codes.Obviously, CFB ≥ C. (∵ a non-feedback code is a special case of a feedback code)

Theorem 6.12.1. For a DMC,

CFB = C = maxP (x)

I(X;Y )

――Feedback does not improve the capacity at all.

Proof. Repeat the converse of the coding theorem without feedback:

nR = H(W ) = H(W |Y n) + I(W ;Y n)

≤ H(W |Y n) + I(Xn;Y n)

≤ 1 + P (n)e nR+ I(Xn;Y n)

≤ 1 + P (n)e nR+ nC

What is wrong?

I(Xn;Y n) = H(Y n)−H(Y n|Xn)

≤ H(Y n) +n∑

i=1

H(Yi|Y i−11 , Xn)

↓ with feedback,this does not hold since Yi depends on the future X ′s

≤ H(Y n) +

n∑i=1

H(Yi|Xi)

6.12. CHANNEL WITH FEEDBACK 85

correct:

I(W ;Y n) = H(Y n)−H(Y n|W )

= H(Y n)−n∑

i=1

H(Yi|Y i−11 ,W )

= H(Y n)−n∑

i=1

H(Yi|Y i−11 ,W,Xi) (since Xi 是Y i−1

1 和W的函数)

= H(Y n)−n∑

i=1

H(Yi|Xi)

≤n∑

i=1

H(Yi)−n∑

i=1

H(Yi|Xi)

≤ nC

Example 6.12.1. BEC with feedback (ARQ)

• Capacity of BEC: C = 1− α

• Consider a BEC with perfect feedback. Feedback reveals if a transmission issuccessful.

• Design a feedback code as:

– transmit a new bit if the last transmission is successful. (收到ACK)

– retransmit if it failed. (NAK)

• The average (total) number of channel uses is

1 + α+ α2 + · · · = 1

1− α

• Average number of bits transmitted per channel use is

11

1−α

= 1− α

• The capacity is not improved at all, but the code is greatly simplified.

• Discussions: Is this the only reason that feedback is used?

– DMC does not benefit from feedback

– General channels with memoryWireless channelsnetwork protocols


6.13 小小小结结结

(1) Source coding

SourceSource

encoder

nV

sR

| W | 2k=

0,1K kU =

sH R£

Figure 6.13: Source coding

(2) Channel coding

Channel

encoder( )|P y x

W nX

nY

cR

convey the message W

cR C<

Figure 6.14: Channel coding

(3) Separation of source and channel coding: convey the inform source V n

Source

encoder

nV

sR

Channel

encoder( )|P y x

W nX

nY

cR

Figure 6.15: Separation of source and channel coding

This is possible 只需源码字个数2nRs ≤信道码字个数2nRc .能可靠地convey信源V n: if nH ≤ nRs ≤ nRc ≤ nC, 即H ≤ Rs ≤ Rc ≤ C.

(4) Joint source-channel coding

encodern

V

sR

( )|P y x

nX

nY

H C<

Figure 6.16: Joint source-channel coding

所以, 信源、信道可分开编码without loss of optimality.

Chapter 7

The Gaussian Channel

• Discrete-time memoryless channel

p(y|x) =∏n

i=1 p(yi|xi)↑ ↑pdf pdf

例：xi ∈ X ⊂ R, yi ∈ Y = R

• For memoryless channel:

I(Xn;Y n) ≤n∑

i=1

I(Xi;Yi) ≤ nC

C , maxP (x)

I(X;Y ) = maxP (x)

[h(Y )− h(Y |X)]

7.1 Additive White Gaussian Noise (AWGN) Channel

• Consider the channel

Y = X + Z

with power constraint E[X2] ≤ P , and Z ∼ N (0, σ2)

+X Y

Z

Figure 7.1:

•

Definition 7.1.1. The capacity of the Gaussian channel with power constraint Pis defined as

C = maxP (x):E[X2]≤P

I(X;Y )

87

88 CHAPTER 7. THE GAUSSIAN CHANNEL

• Calculation:

I(X;Y ) = h(Y )− h(Y |X)

= h(Y )− h(X + Z|X)= h(Y )− h(Z)

since Z is independent of X.

= h(Y )− 1

2log 2πeσ2

Since

E[Y 2] = E[(X + Z)2] = E[X2] + 2E[X]E[Z] + E[Z2]

= σ2x + σ2 ≤ P + σ2

and recalling that the normal maximizes the entropy for a given variance. Wehave

h(Y ) ≤ 1

2log2 2πe(P + σ2), with equality if Y ∼ N (0, P + σ2)

Thus,

I(X;Y ) ≤ 1

2log2 2πe(P + σ2)− 1

2log2 2πeσ

2

=1

2log2

(1 +

P

σ2

)

Hence,

C = maxE[X2]≤P

I(X;Y )

=1

2log2

(1 +

P

σ2

)=

1

2log2 (1 + SNR) bits/symbol or bits/channel use

and the maximum is attained when X ∼ N (0, P ).Maximize h(Y ):

Y ∼ N (0, P + σ2)而Z ∼ N (0, σ2)Y = X + Z

⇒ X ∼ N (0, P )

7.2 Capacity as an Estimation Problem

Consider

I(X;Y ) = h(X)− h(X|Y )

= h(X)− h (X − g(Y )|Y )

for any function g(·).

• Choose X to be N (0, σ2x) distributed.

7.3. NON-GAUSSIAN NOISE CHANNEL 89

• Choose g(·) to be the Linear Minimum Mean-Square Error estimate of X. (L-MMSE)In the Gaussian case:

g(Y ) =σ2x

σ2x + σ2

z

Y

V ar [X − g(Y )] =σ2xσ

2z

σ2x + σ2

z

and X − g(Y ) is independent of Y.︸︷︷︸正交性原理

X

Y( )g Y

Figure 7.2:

Now

I(X;Y ) =1

2log (2πeσ2

x)−1

2log

(2πe

σ2zσ

2x

σ2z + σ2

x

)=

1

2log

(1 +

σ2x

σ2z

)• Discussions:

– Denote x = g(Y ), we call x a sufficient statistics ifX −→ Y −→ X

X −→ X −→ Y

– In Gaussian estimation problems, the L-MMSE estimate X satisfies this.

I(X;Y ) = I(X; X)

L-MMSE:y = Hx+ n

x = µx +KxyK−1y (y − µy)

7.3 Non-Gaussian noise channel

• For general distributions of Z with E[Z] = 0, V ar[Z] = σ2, consider the channelY = X + Z.

I(X;Y ) = h(Y )− h(Z)

h(Y ) ≤ 1

2log 2πe(P + σ2)

h(Z) ≤ 1

2log 2πeσ2 (熵功率)


Therefore,

I(X;Y ) ≤ 1

2log 2πe(P + σ2)− 1

2log 2πeσ2

=1

2log2 2πe

(P + σ2

σ2

)

h(Z) ≤ 1

2log 2πeσ2 ⇒ C ≥ 1

2log

(1 +

P

σ2

)Equality holds only for the Gaussian noise:AWGN is the worst noise.

• A mutual information game:

– The transmitter tries to maximize I(X;Y ) by choosing P (X), subject to apower constraint E[X2] = σ2

x ≤ ρ.

– The channel (jammer) tries to minimize I(X;Y ) by choosing a noise P (Z),subject to E[Z2] = σ2

z .Saddle point:

– the optimal input is Gaussian.

– thee worst noise is also Gaussian.

7.4 Coding theorem for Gaussian channels

• More realistic: Consider the channel

Yi = Xi + Zi

where Zi is iid. N (0, σ2z), and the input has power constraint

1

n

n∑i=1

x2i ≤ σ2x

•

Theorem 7.4.1. C = 12 log

(1 + σ2

xσ2z

)is the maximum achievable rate.

Proof. (1) Generate random codebook with 2nR codewords, each of length n,with iid N (0, σ2

x − ϵ) entries.

(2) Encoding: w −→ Xn(w)

(3) Jointly typical decoding.

(4) Compute the error probability:Without loss of generality, assume that the 1st codeword X1 = Xn(1) istransmitted.

– If the generated codeword violates the power the power constraint, declarean error.

E0 =

1

2

∑i=1

nX2i (1) ≥ σ2

x

7.4. CODING THEOREM FOR GAUSSIAN CHANNELS 91

– Define Ei = (Xn(i), Y n) is jointly typicalThen

P (E1)→ 1

P (Ei) ≈ 2−nI(X;Y ) for i = 1

P (n)e = P (e|w = 1) = P (E0 ∪ Ec

1 ∪ E2 · · · ∪ E2nR)

≤ P (E0) + P (Ec1) +

2nR∑i=2

P (Ei)

≤ ϵ+ ϵ+2nR∑i=2

2−n[I(X;Y )−ϵ]

= 2ϵ+(2nR − 1

)2−n[I(X;Y )−ϵ]

≤ 2ϵ+ 2nR · 2−n[I(X;Y )−ϵ]

= 2ϵ+ 2−n[I(X;Y )−R] · 2nϵ

≤ 3ϵ

for n sufficiently large and R < I(X;Y )− ϵ.

• Converse to the coding theorem: (If Pe → 0, then R ≤ C)

nR = H(W ) = I(W ;Y n) +H(W |Y n)

≤ I(W ;Y n) + 1 + nRP (n)e

≤ I(Xn;Y n) + 1 + nRP (n)e

≤n∑

i=1

I(Xi;Yi) + 1 + nRP (n)e

To drive P(n)e → 0, need

R ≤ 1

n

n∑i=1

I(Xi;Yi)

– Key: Individual power constraint v.s. average power constraintLet Pi =

12nR

∑w x2i (w), and

1n

∑i Pi ≤ P = σ2

x

Then

R ≤ 1

n

n∑i=1

I(Xi;Yi)

≤ 1

n

∑i

1

2log

(1 +

Pi

σ2z

)≤ 1

2log

(1 +

1

n

Pi

σ2z

)(By Jensen’s 不等式, sincef(x) =

1

2log(1 + x)

is a concave function of x)

=1

2log

(1 +

Pi

σ2z

)


• Sphere packing in the Gaussian channelSphere packing — Useful geometric approach to channel capacity, an optimal inputdistribution.

– For a transmitted codeword xn, the set of yn that are jointly typical with xn

is a sphere centered at xn with radius√

nσ2z .

– The volume of an n-dim sphere of radius√r

Vol(S(n)

(√r))

=(√r)

nπ

n2

Γ(n2 + 1

) = An (r)n2

– The maximum data rate is

R =1

nlog

An

(nσ2

y

)n2

An (nσ2z)

n2

=1

2log

(σ2y

σ2z

)Need Y n to be iid Gaussian.

– For xn and yn independently chosen according to the marginal distribution,the probability that they are jointly typical is

An

(nσ2

Z

)n2

An

(nσ2

Y

)n2

≈ 2−nI(X;Y )

– For a codebook of xn(w), w = 1, . . . , 2nR, and an independently chosen yn,the probability that there exists w, such that (xn(w), yn) are jointly Gaussian

2nRAn

(nσ2

Z

)n2

An

(nσ2

Y

)n2

≈ 2−n[I(X;Y )−R] −→ 0

• A Greedy way to spend power (Rate-splitting):Consider the AWGN channel, with power constraint

1

n

∑i

x2i ≤ P

The transmitter

– Split the power into P = P1 + P2, and the data stream into R = R1 + R2,and assign to two virtual users.

– Each user has a random codebook, generate codewords Xn1 and Xn

2 . Thesuperposition Xn = Xn

1 +Xn2 is transmitted.

The receiver receivesY = X1 +X2 + Z

– First treat X2 + Z as the noise, can support

nR1 ≤ I(X1;Y) =n

2log

(1 +

P1

P2 + σ2z

)– Upon decoding X1, subtract X1 form Y, then decode X2.

Can support rate

nR2 ≤ I(X2;Y|X1) =n

2log

(1 +

P2

σ2z

)

7.5. PARALLEL GAUSSIAN CHANNELS 93

– The overall data rate is R1 +R2, the channel capacity is achieved.

• Discussions:

– To generalize, we can have many virtual users.Useful in multiaccess channels.

– Highly depends on the fact that the sum of independent Gaussian r.v.’s isGaussian.

7.5 Parallel Gaussian Channels

• Assume that we have a set of Gaussian channels in parallel,

Yi = Xi + Zi, i = 1, 2, · · · ,K

where Zi ∼ N (0, σ2i ) are independent of each other.

+

+

+

1X

2X

KYK

X

2Y

1Y

1Z

2Z

KZ

K independent

Gaussian noise

channels in

parallel

Figure 7.3:

• The power constraint:K∑i=1

Pi = E

(K∑i=1

X2i

)≤ P

• Consider how to distribute the power among the various channels so as to maximizethe total capacity.

I(X1 · · ·XK ;Y1 · · ·YK) ≤K∑i=1

I(Xi;Yi) ≤∑i

1

2log

(1 +

Pi

σ2i

)

Equality is achieved by independent Xi, Xi ∼ N (0, Pi)

C = maxP (x):

∑E[X2

i ]≤PI(X1 · · ·XK ;Y1 · · ·YK) =

1

2

K∑i=1

log2

(1 +

Pi

σ2i

)

• Now the problem is reduced to find the optimal power allocation:Maximize 1

2

∑i log2

(1 + Pi

σ2i

)s.t.

∑i Pi ≤ P


• Using Lagrange multiplier λ, the object function is

J (P1 · · ·PK) =1

2

∑i

log

(1 +

Pi

σ2i

)+ λ

(∑i

Pi

)

∂J

∂Pi=

1

2· 1

Pi + σ2i

+ λ = 0⇒ Pi + σ2i = − 1

2λ, γ(常数)

So Pi = γ − σ2i

Note that the Pi’s must be non-negative, we choose

Pi = (γ − σ2i )

+ =

γ − σ2

i , if γ ≥ σ2i

0, otherwise

where constant γ is solved by that∑i

(γ − σi)+ = P

Therefore, the capacity of this channel is given by C =∑K

i=112 log

[1 +

(γ−σ2i )

+

σ2i

]power

ch1 ch2 ch3

1P2P

2

1s2

2s

2

3s

Figure 7.4: Water-pouring for parallel channels

7.6 Waveform channel:(Bandlimited AWGN channels)

• Channel model: y(t) = x(t) + z(t)where z(t) is a Gaussian random process. Any set of samples are iid Gaussiandistributed with zero mean and variance E[z(t)2] = N0

2 .

• Front end of the receiver: Provide sufficient statistics for bandlimited transmittedsignals.

– Assume that the x(t) has non-zero frequency components only in [−W,W ].Then x(t) and y(t) can be sampled at Ts =

12W .

– Can be thought as a bank of matched filter matched to sinc(t− n

2W

), where

sinc(t) = sin(2πwt)2πwt

• Discrete-time channel model:

– 2W samples per second.

– The noise power in each sample is σ2 = N02

(= N0W

2W

)– Average signal power per symbol: P/2W

7.6. WAVEFORM CHANNEL:(BANDLIMITED AWGN CHANNELS) 95

Waveform

channel

+

1x

1 1 1y x z= +

1z

ß

+

nx

n n ny x z= +

nz

( )x t ( ) ( ) ( )y t x t z t= +

( )z t

Figure 7.5:

We obtain Yi = Xi + Zi

• The overall capacity in [bits/sec] is

C = 2W · 12log

(1 +

P/2W

N0/2

)= W log2

(1 +

P

N0W

)(P :Walt, N0:Walt/Hz)

• As W →∞,

C∞ = limW→∞

W log2

(1 +

P

N0W

)= lim

W→∞

P

N0

[n0W

Plog2

(1 +

P

N0W

)]= lim

W→∞

P

N0log (1 + x)

1x ln (1 + x)

1x

x→0−→ 1

=P

N0log2 e [bits/s] (capacity grows linearly with the power. Power-efficient regime.)

• The energy required to send 1 bit is Eb =PC (Walts × 1 sec = 焦耳)

发送1bit所需最小SNR:Eb

N0=

1

log2 e= −1.59dB

• 更一般地，in order to be capable of transmitting at rate R bits/s with an arbi-trarily small error probability, we must have R < C; i.e.,

R < W log2

(1 +

P

N0W

)Define the spectral efficiency (bandwidth efficiency) as η = R/W (bits/s/Hz).Then

η < log2

(1 +

P

N0W

)= log2(1 + SNR)

This result is called the Shannon limit. An equivalent statement is SNR > 2η−1.

• Using the relation Eb = PTb =PR , we obtain

Eb

N0=

P

ηWN0=

SNR

η


Here, Eb denotes the average (received) energy per information bit. The Shannonlimit in terms of Eb/N0 may be expressed as

Eb

N0>

2η − 1

η

As η → 0, it approaches the ultimate Shannon limit : Eb/N0 > ln2 (-1.59dB).

• Note that when the SNR is high, the capacity increases logarithmically with P :C ≈ W log2(

PN0W

). Increasing W yields a rapid increase in capacity. The systemis in the bandwidth-limited regime.

Figure: Capacity vs. Eb/N0.

• Example: 有线电话信道: W = 3 kHz, SNR = 30 dB, C ≈ 33 kps. (Voicemodem)

• Discussions:

– Waveform channels are simply 2WT parallel channels.

– White Gaussian noise allows us to look at the signal space with any chosenbasis.

7.7 Gaussian channel with colored noise

lecture notes on information theory and codingweb.xidian.edu.cn/bmbai/files/20171126_142523.pdf ·...

Documents