information theory & coding - eee-sustceee.sustc.edu.cn/p/wangrui/docs/lecture 11.pdf · 1-1...

21
1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology Office: Nanshan i-park A7 Email: [email protected] Dr. Rui Wang Department of Electrical and Electronic Engineering Office: Nanshan i-park A7-1107 Email: [email protected] Website: eee.sustc.edu.cn/p/wangrui

Upload: phungnga

Post on 06-Feb-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

1-1

INFORMATION THEORY & CODING

Dr. Qi WangDepartment of Computer Science and TechnologyOffice: Nanshan i-park A7Email: [email protected]

Dr. Rui WangDepartment of Electrical and Electronic EngineeringOffice: Nanshan i-park A7-1107Email: [email protected]: eee.sustc.edu.cn/p/wangrui

Page 2: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

2-1

Differential Entropy - 2

Definitions

AEP for Continuous Random Variables

Relation of Differential Entropy to Entropy

Joint and Conditional Differential Entropy

Relative Entropy and Mutual Information

Estimation Counterpart of Fano’s Inequality

Page 3: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

3-1

Joint and conditional differential entropy

Definition The joint differential entropy ofX1,X2, . . . ,Xn with pdf f (x1, x2, . . . , xn) is

h(X1,X2, . . . ,Xn) = −∫

f (xn) log f (xn)dxn.

Definition If X ,Y have a joint pdf f (x , y), theconditional differential entropy h(X |Y ) is

h(X |Y ) = −∫

f (x , y) log f (x |y)dxdy = h(X ,Y )− h(Y ).

Page 4: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

4-1

Entropy of a multivariate Gaussian

Theorem 8.4.1 (Entropy of a multivariate normal distribution)Let X1,X2, . . . ,Xn have a multivariate normal distribution withmean µ and covariance matrix K . Then

h(X1,X2, . . . ,Xn) = h(Nn(µ,K )) =1

2log(2πe)n|K | bits,

where |K | denotes the determinant of K .

Proof. please read the textbook by yourself.

Definition (Multivariate Gaussian Distribution) : If the jointPDF of X1,X2, ...,Xn satisfies

f (x) = f (x1, ...x2) =1

(√

2π)n|K |1/2e−

12 (x−µ)

TK−1(x−µ),

then X1,X2, ...,Xn are multivariate/joint Gaussian/normaldistributed with mean µ and covariance matrix K . Denote as(X1,X2, ...,Xn) ∼ Nn(µ,K )

Page 5: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

5-1

Relative entropy and mutual information

Definition The relative entropy D(f ||g) between twopdfs f and g is

D(f ||g) =

∫f log

f

g.

Note: D(f ||g) is finite only if the support set of f iscontained in the support set of g .

Page 6: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

5-2

Relative entropy and mutual information

Definition The relative entropy D(f ||g) between twopdfs f and g is

D(f ||g) =

∫f log

f

g.

Note: D(f ||g) is finite only if the support set of f iscontained in the support set of g .

Definition The mutual information I (X ; Y ) betweentwo random variables with joint pdf f (x , y) is

I (X ; Y ) =

∫f (x , y) log

f (x , y)

f (x)f (y)dxdy .

Page 7: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

6-1

Relative entropy and mutual information

By definition, it is clear that

I (X ; Y ) = h(X )−h(X |Y ) = h(Y )−h(Y |X ) = h(X )+h(Y )−h(X ,Y )

andI (X ; Y ) = D(f (x , y)||f (x)f (y)).

Same as in the discrete case: chain rule, conditioningreduces entropy, union bound...

Page 8: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

7-1

Mutual information between correlated Gaussian r.v.s

Let (X ,Y ) ∼ N (0,K ), where

K =

[σ2 ρσ2

ρσ2 σ2

]h(X ) = h(Y ) = 1

2 log(2πe)σ2

h(X ,Y ) = 12 log(2πe)2|K | = 1

2 log(2πe)2σ4(1− ρ2)

I (X ; Y ) = h(X ) + h(Y )− h(X : Y ) = − 12 log(1− ρ2)

Page 9: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

7-2

Mutual information between correlated Gaussian r.v.s

Let (X ,Y ) ∼ N (0,K ), where

K =

[σ2 ρσ2

ρσ2 σ2

]h(X ) = h(Y ) = 1

2 log(2πe)σ2

h(X ,Y ) = 12 log(2πe)2|K | = 1

2 log(2πe)2σ4(1− ρ2)

I (X ; Y ) = h(X ) + h(Y )− h(X : Y ) = − 12 log(1− ρ2)

if ρ = 0, X and Y are independent, the mutualinformation is 0.

if ρ = ±1, X and Y are perfectly correlated, themutual information is infinite.

Page 10: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

8-1

Properties of differential entropy

Theorem 8.6.1 D(f ||g) ≥ 0 with equality iff f = galmost everywhere.

Page 11: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

8-2

Properties of differential entropy

Theorem 8.6.1 D(f ||g) ≥ 0 with equality iff f = galmost everywhere.

Proof: Let S be the support set of f . Then

−D(f ||g) =

∫S

f logg

f

≤ log

∫S

fg

f(by Jensen’s inequality)

= log

∫S

g

≤ log 1 = 0.

Page 12: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

9-1

Properties of differential entropy

Corollary I (X ; Y ) ≥ 0 with equality iff X and Y areindependent.

Corollary h(X |Y ) ≤ h(X ) with equality iff X and Yare independent.

Theorem 8.6.2 (Chain rule for differential entropy)

h(X1,X2, . . . ,Xn) =n∑

i=1

h(Xi |X1,X2, . . . ,Xi−1).

Corollary

h(X1,X2, . . . ,Xn) ≤∑

h(Xi ),

with equality iff X1,X2, . . . ,Xn are independent.

Page 13: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

10-1

Properties of differential entropy

Theorem 8.6.3 (Translation does not change the differentialentropy)

h(X + c) = h(X ).

Theorem 8.6.4

h(aX ) = h(X ) + log |a|.

Page 14: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

10-2

Properties of differential entropy

Theorem 8.6.3 (Translation does not change the differentialentropy)

h(X + c) = h(X ).

Theorem 8.6.4

h(aX ) = h(X ) + log |a|.

Proof: Let Y = aX . Then fy (y) = 1|a| fX ( y

a ), and we have

h(aX ) = −∫

fY (y) log fY (y)dy

= −∫

1

|a|fX(y

a

)log

(1

|a|fX(y

a

))dy

= −∫

fX (x) log fX (x)dx + log |a|

= h(X ) + log |a|

Page 15: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

11-1

Properties of differential entropy

Theorem 8.6.3 (Translation does not change the differentialentropy)

h(X + c) = h(X ).

Theorem 8.6.4

h(aX ) = h(X ) + log |a|.

Corollaryh(AX) = h(X) + log | det(A)|.

Page 16: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

12-1

Multivariate Gaussian maximizes the entropy

Theorem 8.6.5 Let the random vector X ∈ Rn have zero meanand covariance K = EXXt (i.e., Kij = EXiXj , 1 ≤ i , j ≤ n).Then

h(X) ≤ 1

2log(2πe)n|K |

with equality iff X ∼ N (0,K ).

Page 17: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

13-1

Estimation counterpart to Fano’s inequality

Random variable X , estimator X̂

the expected prediction error E (X − X̂ )2

Theorem 8.6.6 (Estimation error and differentialentropy) For any random variable X and estimator X̂ ,

E (X − X̂ )2 ≥ 1

2πee2h(X ),

with equality iff X is Gaussian and X̂ is the mean of X .

Page 18: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

14-1

Estimation counterpart to Fano’s inequality

Theorem 8.6.6 (Estimation error and differential entropy) Forany random variable X and estimator X̂ ,

E (X − X̂ )2 ≥ 1

2πee2h(X ),

with equality iff X is Gaussian and X̂ is the mean of X .

Proof: We have

E (X − X̂ )2 ≥ minX̂

E (X − X̂ )2

= E (X − E (X ))2 mean is the best estimator

= Var(X )

≥ 1

2πee2h(X ). The Gaussian has maximum entropy

Page 19: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

15-1

Estimation counterpart to Fano’s inequality

Theorem 8.6.6 (Estimation error and differential entropy) Forany random variable X and estimator X̂ ,

E (X − X̂ )2 ≥ 1

2πee2h(X ),

with equality iff X is Gaussian and X̂ is the mean of X .

Page 20: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

16-1

Summary

Discrete r.v. ⇒ continuous r.v.

entropy ⇒ differential entropy

Page 21: INFORMATION THEORY & CODING - EEE-SUSTCeee.sustc.edu.cn/p/wangrui/docs/Lecture 11.pdf · 1-1 INFORMATION THEORY & CODING Dr. Qi Wang Department of Computer Science and Technology

16-2

Summary

Discrete r.v. ⇒ continuous r.v.

entropy ⇒ differential entropy

Many things similar: mutual information, relativeentropy, AEP, chain rule, ...

Some things different: h(X ) can be negative, maximumentropy distribution is Gaussian