Entropy, Inference, and Channel Coding
Sean Meyn
Department of Electrical and Computer EngineeringUniversity of Illinois and the Coordinated Science Laboratory
NSF support: ECS 02-17836, ITR 00-85929 and CCF 00-49089
Overview
Hypothesis testing and channel coding
Structure of optimal codes
Error exponents
Algorithms
Optimal code
QAM
R
E Rr( )
References
Large deviations
Dembo and Zeitouni, Large Deviations Techniques And Applications, 1998
Kontoyiannis, Lastras-Montano and Meyn, Relative Entropy and Exponential Deviation Bounds for General Markov Chains, ISIT, 2005
Pandit and Meyn, Extremal Distributions and Worst-Case Large-Deviation Bounds, 2004
Hypothesis testing
D&Z 1998
Zeitouni and Gutman. On universal hypothesis testing via large deviations, IT-37, 1991
Pandit, Meyn and Veeravalli, Asymptotic Robust Neyman-Pearson Testing Based on Moment Classes, ISIT, 2004.
References
Channel coding
Csiszar and Korner. Information theory: Coding Theorems for Discrete Memoryless Systems. Academic Press New York, 1997
MacKay, Information Theory, Inference, and Learning Algorithms, CUP, 2003 http://www.inference.phy.cam.ac.uk/mackay/itila/
Blahut, Hypothesis testing and information theory, IT-20, 1974
Outline (today)
Introduction
Relative entropy & Large deviations
Hypothesis testing
Channel capacity
Conclusions
Introduction
Relative entropy & Large deviations
Hypothesis testing
Channel capacity
Conclusions
Memoryless Channel Model
Memoryless channel with input sequence X, output sequence Y
Channel kernel
If X is i.i.d. with marginal distribution µ
Then, Y is i.i.d. with marginal distribution π
P (dy | x) = P{Yt ∈ dx | Xt = x}
π( · ) =
∫P ( · | x)µ(dx)
Random codebook
Channel kernel
N-dimensional code words
N-dimensional output Y received: i.i.d.,
with marginal distribution π
X i, i = 1, 2, . . . , eNR
P (dy | x) = P{Yt ∈ dx | Xt = x}
IEEE Std 802.11a -1999 SUPPLEMENT TO IEEE STANDARD FOR INFORMATION TECHNOLOGY
0 1
I+1
+1
QBPSK QPSK
01
00 10
11
I+1
+1
Q b0b0 b1
11 10
11 11 10 11
10 10
I+1
+1
Q b0b1b2 b3
+3
11 01
11 00 10 00
10 01+3
00 10
00 11 01 11
01 10
00 01
00 00 01 00
01 01
16-QAM
011 010
011 011 010 011
010 010 I+1
+1
Q b0b1b2b3 b4b5
+3
011 001
011 000 010 000
010 001
+3
000 010
000 011 001 011
001 010
000 001
000 000 001 000
001 001
011 110
011 111 010 111
010 110
011 101
011 100 010 100
010 101
000 110
000 111 001 111
001 110
000 101
000 100 001 100
001 101
111 010
111 011110 011
110 010
111 001
111 000110 000
110 001
100 010
100 011101 011
101 010
100 001
100 000101 000
101 001
111 110
111 111110 111
110 110
111 101110 101
100 110
100 111101 111
101 110
100 101101 101
111 100110 100 100 100101 100+7
+5
+5 +7
64-QAM
Questions & Objectives
1. What is the structure of optimal µ ?
2. Construct algorithms based on this structure
3. Worst-case modeling to simplify code construction
4. Decoding algorithms and evaluation
Questions & Objectives
1. What is the structure of optimal µ ?
2. Construct algorithms based on this structure
3. Worst-case modeling to simplify code construction
4. Decoding algorithms and evaluation
Methodology& Viewpoint: Hypothesis testing
Large deviations
Convex & linear optimization theory
Example: Rayleigh Channel Y = AX + N
σ2A = 1, σ2
N = 1, and σ2P = 26.4 (SNR=14.2 dB)
A and N are i.i.d. and mutually independent:
Example: Rayleigh Channel Y = AX + N
σ2A = 1, σ2
N = 1, and σ2P = 26.4 (SNR=14.2 dB)
16-point QAM
I = 0.2 nats/symbol.
A and N are i.i.d. and mutually independent:
Standard:
Rate:2.57 7.71
16-point QAM
Example: Rayleigh Channel Y = AX + N
σ2A = 1, σ2
N = 1, and σ2P = 26.4 (SNR=14.2 dB)
A and N are i.i.d. and mutually independent:
2.7 8 2.57 7.71
16-point QAM Three-point constellation
Example: Rayleigh Channel Y = AX + N
σ2A = 1, σ2
N = 1, and σ2P = 26.4 (SNR=14.2 dB)
A and N are i.i.d. and mutually independent:
3-point distribution: three-fold improvement
over 16-point QAM
E R
R
r( )
0 0.1 0.2 0.3 0.4 0.5 0.60
0.05
0.15
0.25
0.10
0.20
Outline
Introduction
Relative entropy & Large deviations
Hypothesis testing
Channel capacity
Conclusions
Introduction
Relative entropy & Large deviations
Hypothesis testing
Channel capacity
Conclusions
Large Deviations
Simulate a function
X = {X1, X2, . . . } a nice Markov chain on X, marginal distribution µ
g : X → R
n−1n∑
t=1
g(Xt)cn =
Large Deviations
Simulate a function
Probability of over-estimate
X = {X1, X2, . . . } a nice Markov chain on X, marginal distribution µ
c > c0
n−1 log P{n−1
n∑t=1
g(Xt) ≥ c}→ −Λ∗(c)
g : X → R
n−1n∑
t=1
g(Xt)cn = → c0 = µ(g)
Large Deviations
Simulate a function
Rate function & log-moment generating function
Probability of over-estimate
X = {X1, X2, . . . } a nice Markov chain on X, marginal distribution µ
c > c0 = µ(g),
Λ∗(c) = supθ>0
[θc − Λ(θ)]
n−1 log P{n−1
n∑t=1
g(Xt) ≥ c}→ −Λ∗(c)
Λ(θ) = limn→∞ n−1 log E
[exp
(θ
n∑t=1
g(Xt))]
g : X → R
n−1n∑
t=1
g(Xt)cn = → c0 = µ(g)
Hoeffding's Bound
Marginal distribution µ unknown
Worst-case rate function & log-moment generating function
X = {X1, X2, . . . } is i.i.d. on X
n−1n∑
t=1
Xtcn = → c0 = µ(g)
g(x) = x= [0, 1]
inf{Λ∗µ(c) : µ(g) = c0} sup{Λµ(θ) : µ(g) = c0}
Hoeffding's Bound
Marginal distribution µ unknown
Worst-case rate function & log-moment generating function
Solution: is binary on
X = {X1, X2, . . . } is i.i.d. on X
n−1n∑
t=1
Xtcn = → c0 = µ(g)
g(x) = x= [0, 1]
{0, 1}
inf{Λ∗
∗
µ(c) : µ(g) = c0} sup{Λµ(θ) : µ(g) = c0}
µ
Bennett's Lemma
Marginal distribution µ unknown
Worst-case rate function & log-moment generating function
X = {X1, X2, . . . } is i.i.d. on X Mean and variance given
n−1n∑
t=1
Xtcn =
g(x) = x
= [0, 1]
finf{Λ∗µ(c) : µ(gi) = ci, i = 1, 2} sup{Λµ(θ) : µ(gi) = ci, i = 1, 2}
Bennett's Lemma
Marginal distribution µ unknown
Worst-case rate function & log-moment generating function
Solution: is binary on
X = {X1, X2, . . . } is i.i.d. on X Mean and variance given
n−1n∑
t=1
Xtcn =
g(x) = x
= [0, 1]
{ , 1}
f
∗ µ
inf{Λ∗µ(c) : µ(gi) = ci, i = 1, 2} sup{Λµ(θ) : µ(gi) = ci, i = 1, 2}
x0
Generalized Bennett's Lemma
Marginal distribution µ unknown
Worst-case moment generating function:
X = {X1, X2, . . . } is i.i.d. on X n moments given
n−1n∑
t=1
cn =
= [0, 1]
g(Xt)
λ(θ) = E[eθg(Xt)] = 〈µ, eθg〉
gi
Generalized Bennett's Lemma
Marginal distribution µ unknown
Worst-case moment generating function:
Linear program over M:
X = {X1, X2, . . . } is i.i.d. on X n moments given
n−1n∑
t=1
cn =
= [0, 1]
g(Xt)
max 〈µ, eθg〉s. t. 〈µ, gi〉 = ci, i = 1, . . . , n.
µ∗ is discrete
λ(θ) = E[eθg(Xt)] = 〈µ, eθg〉
gi
Sanov's Theorem
X Probability measures: M
Ln :=1
n
n−1∑t=0
δXt n ≥ 1
µ a measureg a function on X
〈µ, g〉 = µ(g) :=
∫g(y)µ(dy)
〈Ln, g〉 =1
n
n−1∑t=0
g(Xt)
State space:
Notation:
Empirical measures:
Ln ∈ M for
Sanov's Theorem
X Probability measures: M
Ln :=1
n
n−1∑t=0
δXt n ≥ 1
µ a measureg a function on X
〈µ, g〉 = µ(g) :=
∫g(y)µ(dy)
State space:
Notation:
Empirical measures:
Ln ∈ M for
Relative entropy:
D(ν‖µ) =⟨ν, log
(dν
dµ
)⟩=
∫log
(dν
dµ
)ν(dx)
Sanov's Theorem
µ
Ln :=1
n
n−1∑t=0
δXt
Law of large numbers:Ln
Ln
→ µ,
µ
n → ∞
Sanov's Theorem
n−1 log P{Ln ∈ K} → −
K ⊂ M
µ
K
Convex set of probability measures
?
?
µ �∈ K
Ln
µ
Sanov's Theorem
n−1 log P{Ln ∈ K} → infν∈K
J(ν)η −− =
K ⊂ M
µ
K
Qη
Convex set of probability measures
Qη = {ν : J (ν) < η}
µ �∈ K
Ln
µ
Sanov's Theorem
µ
K
Qη
Qη = {ν : J (ν) < η}
J(ν) = D(ν‖µ)
tr. kernel with ν invariantJ(ν) = inf :D(ν � P P‖ν � P )
i.i.d. source:
Markov:
Ln
µ
Sanov's Theorem
n−1 log P{Ln ∈ K} → inf J(ν)
K = {ν : 〈ν, g〉 ≥ c}
η −− = =
Example:
− Λ∗(c)〈ν, g〉 ≥ c
Sanov's Theorem
n−1 log P{Ln ∈ K} → inf J(ν)
K = {ν : 〈ν, g〉 ≥ c
〈ν, g〉 = c
}
η −− = =
µ
K
Qη
Example:
Qη = {ν : J (ν) < η}
− Λ∗(c)〈ν, g〉 ≥ c
Ln
µ
Outline
Introduction
Relative entropy & Large deviations
Hypothesis testing
Channel capacity
Conclusions
Introduction
Relative entropy & Large deviations
Hypothesis testing
Channel capacity
Conclusions
Neyman Pearson Hypothesis Testing
Observations X = {Xt : t = 1,2, . . . N}X i.i.d. with marginal πj under Hj, j = 0,1
Hypothesis test:
φ(x) = 1 if H is declared true,based on N observations
1
Error Probabilities
Pe,0 = P0 {φ(X) = 1} , Pe,1 = P1 {φ(X) = 0}
N-P Criterion: infφ
Pe,1 subject to Pe,0 ≤ e−Nη
Neyman Pearson Hypothesis Testing
Observations X = {Xt : t = 1,2, . . . N}
Erro
Solution: if
r Probabilities
Pe,0 = P0 {φ(X) = 1} , Pe,1 = P1 {φ(X) = 0
φ(X) = 0
}
N-P Criterion: infφ
Pe,1 subject to Pe,0 ≤ e−Nη
X i.i.d. with marginal πj
π0
π1
under Hj, j = 0,1
Ln ∈ Qη(π0)
Qη(π0)
Neyman Pearson Hypothesis Testing
Solution: ifφ(X) = 0
π0
π1
Ln ∈ Qη(π0)
Qη(π0)
limN→∞
N−1 log P0{φN = 1} = −η
limN→∞
N−1 log P1{φN = 0} = −β∗
Neyman Pearson Hypothesis Testing
Solution: ifφ(X) = 0
π0
π1
Ln ∈ Qη(π0)
Qη(π0)
Q (π1)lim
N→∞N−1 log P0{φN = 1} = −η
limN→∞
N−1 log P1{φN = 0} = −β∗
β∗ = inf{J1(ν) : J0(ν) ≤ η}
= inf{β > 0 : Qβ(π1) ∩ Qη(π0) �= ∅}
β∗
〈ν, � 〉 = c
Robust Neyman Pearson Hypothesis Testing
Uncertainty classes defined by moment constraints
P1
P0
π0 ∈ P0 π1 ∈ P1
Robust Neyman Pearson Hypothesis Testing
Uncertainty classes defined by moment constraints
P1
P0
Q (P0)η
π0 ∈ P0 π1 ∈ P1
Robust Neyman Pearson Hypothesis Testing
Uncertainty classes defined by moment constraints
β∗ = infπ1∈P1
infµ∈Qη(P0)
D(µ ‖ π1 )
There exist π∗0 ∈ P0, π∗
1 ∈ P1, and µ∗ solving,
P1
P0
Q (P0)η
π0∗
π1∗
µ∗
Robust Neyman Pearson Hypothesis Testing
Uncertainty classes defined by moment constraints
Optimizers again discrete
β∗ = infπ1∈P1
infµ∈Qη(P0)
D(µ ‖ π1 )
There exist π∗0 ∈ P0, π∗
1 ∈ P1, and µ∗ solving,
Qβ∗(P1)P1
P0
Q (P0)η
〈µ, log(�)〉 = 〈µ∗, log(� ) 〉π0∗
π1∗
µ∗
Outline
Introduction
Relative entropy & Large deviations
Hypothesis testing
Channel capacity
Conclusions
Introduction
Relative entropy & Large deviations
Hypothesis testing
Conclusions
Channel Coding and Sanov's Theorem
Channel kernelChannel kernel
N-dimensional code wordsN-dimensional code words
X is i.i.d. with marginal distribution µX is i.i.d. with marginal distribution µ
Y is i.i.d. with marginal distribution πY is i.i.d. with marginal distribution π
N-dimensional output Y receivedN-dimensional output Y received
X i, i = 1, 2, . . . , eNR
P (dy | x) = P{Yt ∈ dy | Xt = x}
π( · ) =
∫P ( · | x)µ(dx)
Channel Coding and Sanov's Theorem
Channel kernelChannel kernel
N-dimensional code wordsN-dimensional code words
N-dimensional output Y receivedN-dimensional output Y received
X i, i = 1, 2, . . . , eNR
µ � P (dx, dy) = µ(dx)P
µ ⊗ π (dx, dy) = µ(dx)π(dy)
P (dy | x) = P{Yt ∈ dy | Xt = x}
(dy | x)
If i is the true codeword then( , ) has marginal distributionIf i is the true codeword then( , ) has marginal distribution
Otherwise, independence:Otherwise, independence:
X Yi
Channel Coding and Sanov's Theorem
µ � P (dx, dy) = µ(dx)P
µ � P
µ ⊗ π
µ ⊗ π (dx, dy) = µ(dx π(dy))
Two hypotheses based on observations:
H :0
H :1 (dy | x)
Channel Coding and Sanov's Theorem
Qη(π0)
µ � P (dx, dy) = µ(dx)P
µ � P
µ ⊗ π
µ ⊗ π (dx, dy) = µ(dx π(dy))
( , )
Solution: Reject codeword i ( )
Empirical distributions forjoint observations
Two hypotheses based on observations:
H :
X Yi
0
H :1
φ = 0
if Ln ∈ Qη(π0)
(dy | x)
Channel Coding and Sanov's Theorem
Qη(π0)
µ � P
µ ⊗ π
The error probability must be multiplied by
For vanishing error,
That is,
limN→∞
N−1 log P0{φN = 1} = −η
e−Nη e NR
eNR × e−Nη
η
< 1
<R
Solution: ifφ = 0 Ln ∈ Qη(π0)
Channel Coding and Sanov's Theorem
Qη (π0)
µ � P
µ ⊗ π
The error probability must be multiplied by
Solution: ifφ = 0 Ln ∈ Qη(π0)
limN→∞
N−1 log P0{φN = 1} = −η
e−Nη e NR
R < ηmax =
= mutual information
D(µ � P‖µ⊗ π)
max
Error Exponent
limN→∞
N−− 1 log P { } =E(R,µ ) error
Formula expressed as solution to a robust hypothesis testing problem:
For a given input distribution µ, denote product measureson
P0 = { {µ ⊗ ν : ν is a probability measure on Y
X × Y with first marginal µ,
Error Exponent
limN→∞
N−− 1 log P { } =E(R,µ ) error
Formula expressed as solution to a robust hypothesis testing problem:
For a given input distribution µ, denote product measureson
P0 = { {µ ⊗ ν : ν is a probability measure on Y
X × Y with first marginal µ,
Hypothesis : Code word i not sent; independent
Test: Empirical distributions within entropy ball around P0
(X ij ) ( Yj )H0
Error Exponent
Entropy neighborhood of P0
Entropy neighborhood of
H0: {(X ij , Yj) : j = 1, . . . , N} has marginal distribution π0 ∈ P0
H1: {(X ij , Yj) : j = 1, . . . , N} has marginal distribution π1
π1
π1
:=
� p
Q+R(P0) = { {γ : minν D(γ ‖ µ ⊗ ν) ≤ R
Q+β( ) = { {γ : D(γ ‖ µ
� pµ
) ≤ β
Error Exponent
limN→∞
N−− 1 log P { }
= infimum over β such that these entropy neighborhoods meet:
=E(R,µ ) error
β
Qβ(µ p )
µ ⊗ pµ
µ p
ˆµ pQR(P0)
P0
+
+
Error Exponent
limN→∞
N−− 1 log P { }
{ } =
=E(R,µ )
= random coding exponent = supremum over µE(R )
error
Qβ(µ p )
µ ⊗ pµ
µ p
ˆµ pQR(P0)
P0
+
+
infβ
β : Q+β (µ � p) ∩ Q+
R(P0) �= ∅
Outline
Introduction
Relative entropy & Large deviations
Hypothesis testing
Channel capacity
Conclusions
Introduction
Relative entropy & Large deviations
Hypothesis testing
Channel capacity
Summary
Large Deviations is the grand unifying principle of Information Theory
Summary
Standard coding based on AWGN models
May be unrealistic in wireless models with fading
Discrete distributions arise in coding, and other applications involving optimization over M
Extremal distributions arise in worst-case models
M
What's Next?
Channel models
Convex optimization and channel coding
Cutting plane algorithm
II
Worst-case models
Extremal distributions
III