chapter02 entropy

8/12/2019 Chapter02 Entropy

1/97

Chapter 2: Entropy, Relative Entropy, andMutual Information

Xiaojun Hei

Internet Technology and Engineering R&D CenterDepartment of Electronics and Information Engineering

Email: [email protected]: http://itec.hust.edu.cn/heixj

Phone: 027-87544704


2/97

Chapter 2: Entropy, Relative Entropy, and Mutual Information

Entropy

Joint and conditional entropy

Relative entropy and mutual information

Chain rules

Jensens inequality

Log sum inequality

Data processing inequality


3/97

Block diagram of communication systems

The transmission and process of information in communicationsystems

Data compression limit, provided by Shannons first theorem

Commonly-used coding algorithms forzero-errorsource coding


4/97

Source encoder side

The output sequence of information source is stochastic: howto characterize?

We can think of a discrete source as generating the

message,symbol by symbol...a mathematical model of ssystem... is known as astochastic process. C.E. Shannon


5/97

Source coding

C

:X

D

:C

(x

),

where D is the set of finite length strings of symbols from a D-aryalphabet1.

Let C(x) denote the codeword corresponding to x.

Let l(x) denote the length ofC(x).

Expected lengthof a source code

L(C) =

xXp(x)l(x)

ExampleX ={red, blue},D ={0,1}, C(red)=0, C(blue)=1

1When D= 2, it is binary, in which the alphabet is {0,1}.


6/97

Outcomes of the source

Single outcome or outcome sequence

Continuous or Discrete


7/97

Modeling single outcome

Continuous Source Rp(x)

,

R

p(x)dx= 1

Discrete Source

X

P(X)

=

a1, a2, aqP(a1), P(a2), P(aq)

,

qi=1

P(ai) = 1


8/97

Modeling outcome sequence

Waveform Source Continuous in both time and amplitude Modeled as a continuous stochastic process {x(t)}

Sequence Source

Sampled from waveform source Discrete in time or space Modeled as a stochastic sequence {Xi(ti)}


9/97

Classification of sources

Stationary: whether the distribution changes with time?

Stationary Source: goodsource (easy to analyze) Unstationary source:sometimes can be simplified as Markov

source Memory: whether the variables in sequence have relationship?

Source without memory: goodsource(easy to analyze) Source with memory: can be modeled as Markov source


10/97

Sources studied in our course

Motivation: we study the ideal sources withgood properties, thenuse them to approximate real sources Discrete Source

Single OutcomeDiscrete Source Outcome sequenceDiscrete Source Discrete stationary memoryless source Discrete stationary source with memory

Continuous source

Waveform source


11/97

Diagram of communication systems

The transmission and process of information in communicationsystems


12/97

Information source model

Notations Sample space:X Random variable (r.v.):X Outcome ofXor realization ofX: x Cardinality of set X (the number of elements): |X |

Probability mass function (p.m.f.) P(x) =Pr[X =x], x X P(x, y) =Pr[X =x,Y =y], x X, y Y


13/97

Information properties

At first, we investigate the measure of information forsingleoutcome. (Property 1)

It should have the following properties: The larger the measure, the more surprising the outcome

(Property 2) is Function of the probability distribution Proportional to the inverse of probability None-linear mapping from probability to information

([0, 1] [0,))

Information content of two independent r.v.s is the sum of

information contents (Property 3) Logarithm of the probability


14/97

Definition

Theself-informationof a realization x ofr.v.Xcan be definedas:

I(x) = log[p(x)] = log 1p(x)it can be proved that this is the only formsatisfying theproperties of information

The base of logarithm can be any 2: information measured in binary units (bits) e: information measured in natural units (nats)


15/97

Self-information: example

Given a source with Mpossible outcomes Then the self-information of each outcome is k= log 2M bits

This means the outcome can be described by k bitsinformation

For instance, a source has M= 4 outcomes {a, b, c, d}, thenthe outcome can be described by 2 = log 24 bits code , suchas set{00, 01, 10, 11}.

a 00

b 01c 10d 11


16/97

Entropy definition

Theaverage informationofr.v.X is called theentropyofX

H(X) = xX

p(x)log[p(x)]

A (convenient)measure of uncertaintyof the r.v..

Entropy is a function of the probability distribution Independent of the outcomes of the r.v. itself Only the distribution matters

Logarithm Base can by any, i.e., 2 by default. Baseb is sometimes marked as Hb(X). By the continuity argument: ifp(x) = 0,

p(x) log

1p(x)

= 0 log

10

= 0, in that thezero

probability has no impact on entropy.


17/97

Logarithm: propertieshttp://en.wikipedia.org/wiki/Logarithm

Product, quotient, power, and root

Formula Example

product logb(xy) = logb(x) + logb(y) log3(243) = log3(9 27) = log3(9) + log3(27) = 2 + 3 = 5quotient logb

xy

= logb(x) logb(y) log2(16) = log2( 644 ) = log2(64) log2 (4) = 6 2 = 4

power logb(xp) = plogb(x) log2(64) = log2(2

6) = 6 log2(2) = 6

root logb px = logb(x)

p log10(

1000) = 12 log10(1000) =

32 = 1.5

Change of base

logb(x) = logk(x)

logk(b

)

.

Derivative and antiderivative

d

dxlogb(x) =

1

xln(b).

d

dx

ln (f(x)) = f(x)

f(x)

.

ln (x) dx= xln (x) x+C.

Integral representation of the natural logarithm

ln (t) =

t1

1

xdx.

f


18/97

Logarithm: functionshttp://en.wikipedia.org/wiki/Logarithm

The graph of the logarithm to base 2 crosses the xaxis (horizontal axis)at 1 and passes through the points with coordinates (2, 1), (4, 2), and(8, 3).

L i h f i


19/97


The graph of the logarithm function logb(x) (blue) is obtained byreflecting the graph of the function bx (red) at the diagonal line (x=y).

L i h f i


20/97


The graph of the natural logarithm (green) and its tangent at x= 1.5(black).

E t b i ti


21/97

Entropy: basic properties

Entropy is the expected value of self-information

H(X) = E{log[p(x)]} =E

log 1

p(x)

Entropy H(X) isnon-negative.

0 p(x) 1 log 1

p(x)

0

H(X) =E

log

1

p(x)

0

Change of bases, since

Hb(X) = [logba] Ha(X)

logbp= [logba] logap

E l #1 t f if


22/97

Example #1: entropy of uniform r.v.

Consider a uniform random variable with M= 32 = 25

possible outcomes

Then 5 bits are sufficient to describe each outcome

The entropy of this r.v. is

H(X) = 32i=1

p(x)log2[p(x)] =32i=1

1

32log2[

1

32] = 5bits

Entropy agrees with log2M (H(X) = log2M)

E a le #2 e t o of o ifo


23/97

Example #2: entropy of non-uniform r.v.

Consider a non-uniform random variable withM= 8 = 23 Possible

outcomes and probabilities(1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64)

Then 3 bits are sufficient to describe each outcome

The entropy of this r.v. is

H(X) = 12

log2(12

) 14

log2(14

) 18

log2(18

)

1

16log2(

1

16)

4

64log2(

1

64)

= 2bits

Entropy and log2(M) disagree (H(X)< log2M)

The average length of source code word can be made shorter than 3

Example: (0, 10, 110, 1110, 111100, 111101, 111110, 111111)

Definition


24/97

Definition

Joint entropy

H(X,Y) = xX

yY

p(x, y)log[p(x, y)]

= E{log[p(x, y)]}

Conditional entropy

H(Y|X) =xX

p(x)H(Y|X =x)

=

xXp(x)

yYp(y|x) log[p(y|x)]

= xX

yY

p(x, y) log[p(y|x)]

= E{log[p(y|x)]}

Note that in general: H(Y|X) =H(X|Y)

Venn Diagram


25/97

Venn Diagram

Motivation


26/97

Motivation

|A B| =

(xA xB)2 + (yA yB)2 |IBIA| =

log

1

q(x)

log

1

p(x)

= log

p(x)

q(x)

Average:

xXp(x)log

p(x)

q(x)

Relative entropy (or Kullback leibler distance)


27/97

Relative entropy (or Kullback-leibler distance)

Definition: a measure of the information distanceor theinformational divergencebetween two p.m.f., p(x) and q(x):

D(p(x)||q(x)) =xX

p(x)log

p(x)

q(x)

= Ep

log

p(X)

q(X)

When p(x) is the true p.m.f. ofX, this measures theinefficiency of assuming q(x) is the p.m.f. ofX.

It is distance-like in many respects. It is not a true distance, since it

Is not symmetricD(p||q)v.s.D(q||p)

Does not satisfy the triangle inequality

D(p||q) +D(q||r)v.s.D(p||r)

Geometric space


28/97

Geometric space

Characterize the position of any point with coordinates

Compute distances using coordinates

Symmetric distance

|P1,P2| = |P2,P1|

Triangle inequality

|P1,P2| + |P2,P3| |P1,P3|

Relative entropy: example


29/97

Relative entropy: example

Let x X = {0, 1}, p(0) = 1 r, p(1) =r, q(0) = 1 s, q(1) =s.

D(p(x)||q(x)) =p(0) log

p(0)

q(0)

+p(1) log

p(1)

q(1)

= (1 r)log

1 r

1 s

+rlog

r

s

D(q(x)||p(x)) =q(0) log

q(0)p(0)

+q(1) log

q(1)p(1)

= (1 s)log

1 s

1 r

+slog

s

r

Ifr=s, D(p||q) =D(q||p)

Ifr=s, such as r= 1/2, s= 1/4 D(p||q) = 0.2075 bits, D(q||p) = 0.1887 bits

Thus, in general D(p||q) =D(q||p).

Mutual information


30/97

Mutual information

Things are commonly related; two random variables areusually related.

In an information perspective, how to characterize therelationship between r.v.X and r.v.Y?

Observe Xalone, the information ofX is H(X). KnowingY, the information ofX becomes H(X|Y). KnowingY, the information ofX is reduced by

=H(X) H(X|Y). This reduced information is the uncertainty ofX after

knowingY

Mutual information


31/97

Mutual information

Mutual informationis the relative entropy between the jointdistribution and the product distribution of two random

variablesX, Y:

I(X; Y) = D[p(x, y)||p(x)p(y)]

=xX

yY

p(x, y)log p(x, y)

p(x)p(y)

= E(X,Y)

log

p(X,Y)

p(X)p(Y)

Measure of the information one random variable (say, X)contains on the other (Y)

Special cases IfX and Yare independent, I(X; Y) = 0. IfY =X, I(X; X) =H(X).

Mutual information


32/97

Mutual information

Conditional relative entropy

D(p(y|x)||q(y|x)) =xX

p(x)yY

p(y|x)log

p(y|x)

p(x|y)

Conditional mutual information

I(X; Y|Z) =xX

yY

zZ

p(x, y, z)log

p(x, y|z)

p(x|z)p(y|z)

= Ep(x,y,z)

log p(X,Y|Z)

p(X|Z)p(Y|Z)

Mutual information vs Entropy


33/97

py

I(X; Y) =H(X) H(X|Y)

Proof:

I(X; Y) =x

y

p(x, y)log

p(x, y)p(x)p(y)

=x

y

p(x, y)log

p(x|y)

p(x)

= x

y

p(x, y) log[p(x)] +x

y

p(x, y)log[p(x|y)]

= x

p(x)log[p(x)] +x

y

p(x, y)log[p(x|y)]

= H(X) H(X|Y)

Mutual information vs Entropy


35/97

p #

Joint p.m.f. is:

YX

1 2 3 4 p(y)

1 1/8 1/16 1/32 1/32 1/4

2 1/16 1/8 1/32 1/32 1/43 1/16 1/16 1/16 1/16 1/4

4 1/4 0 0 0 1/4

p(x) 1/2 1/4 1/8 1/8

What is H(X), H(Y), H(X|Y), H(Y|X), H(X,Y), I(X; Y)?

Solution of example #1


36/97

p #

H(X) = xXp(x)log[p(x)]=H

1

2,1

4,1

8,1

8

=

1

2log(

1

2) +

1

4log(

1

4) +

1

8log(

1

8) +

1

8log(

1

8)

= 1.75 bits

H(Y) = yY

p(y)log[p(y)]

=H1

4 ,

1

4 ,

1

4 ,

1

4

=

1

4log(

1

4) +

1

4log(

1

4) +

1

4log(

1

4) +

1

4log(

1

4)

= 2 bits



37/97

H(X|Y) =yY

p(y)H(X|Y =y)

=yY

p(y)xX

p(x|y)log

1p(x|y)

=

xX

yY

p(x, y)log[p(x|y)]

= xX

yY

p(x, y)log

p(x, y)p(y)

=

18log

1814

+ 116log1

1614

+ 132log1

3214

+ 132log1

3214

+ 116log1

1614

+ 18log1814

+ 132log1

3214

+ 132log1

3214

+ 116log11614

+ 116log11614

+ 116log11614

+ 116log11614

+ 14log1414

+ 0 log 014

+ 0 log 014

+ 0 log 014

= 1.375 bits



38/97

H(Y|X) =xX

p(x)H(Y|X=x)

=xX

p(x)yY

p(y|x)log

1p(y|x)

=

xX

yY

p(x, y)log[p(y|x)]

= xX

yY

p(x, y)log

p(x, y)p(x)

=

18log

1812

+ 116log1

1614

+ 132log1

3218

+ 132log1

3218

+ 116log1

1612

+ 18log1814

+ 132log1

3218

+ 132log1

3218

+ 116log11612

+ 116log11614

+ 116log11618

+ 116log11618

+ 14log1412

+ 0 log 014

+ 0 log 018

+ 0 log 018

= 1.625 bits



39/97

H(X,Y) = xX

yY

p(x, y)log[p(x, y)]

=

18log

18 +

116log

116 +

132log

132 +

132log

132

+ 1

16log 1

16 + 1

8log 1

8 + 1

32log 1

32 + 1

32log 1

32+ 116log

116 +

116log

116 +

116log

116 +

116log

116

+ 14log 14 + 0 log 0 + 0 log 0 + 0 log 0

= 3.375 bits

Note that H(X,Y) =H(X) +H(Y|X) =H(Y) +H(X|Y) byobservation in this example.



40/97

Method 1:

I(X; Y) =H(X) H(X|Y) = 1.75 1.375 = 0.375 bit

I(X; Y) =H(Y) H(Y|X) = 2 1.625 = 0.375 bit

Method 2:

I(X,Y) =D[p(x, y)||p(x)p(y)]

=xX

yY

p(x, y)log

p(x, y)

p(x)p(y)

=

18log

18

12

14

+ 116log1

1614

14

+ 132log1

3218

14

+ 132log1

3218

14

+ 116log1

1612

14

+ 18log1

814

14

+ 132log1

3218

14

+ 132log1

3218

14

+ 116log1

1612

14

+ 116log1

1614

14

+ 116log1

1618

14

+ 116log1

1618

14

+ 14log14

12

14

+ 0 log 0 + 0 log 0 + 0 log 0

= 0.375 bit

The chain rule: motivation


41/97

In calculus, the chain rule is a formula for computing thederivative of the composition of two or more functions.

Let y=f(u) and u=g(x).

[f(g(x)] =f(g(x))g(x)

dy

dx =

dy

du

du

dx

In information theory, the chain rule is a formula for

computing the entropies of the composition of two or morerandom variables.

The chain rule


43/97

Joint p.m.f. is:

YX

1 2 3 4 p(y)

1 1/8 1/16 1/32 1/32 1/4

2 1/16 1/8 1/32 1/32 1/43 1/16 1/16 1/16 1/16 1/4

4 1/4 0 0 0 1/4

p(x) 1/2 1/4 1/8 1/8

What is H(X),H(Y),H(X|Y),H(Y|X),H(X,Y)?

Compute H(X), H(Y)


44/97

H(X) =H(1/2, 1/4, 1/8, 1/8)

= i

p(X =i)log p(X =i)

=

1

2log

1

2+

1

4log

1

4+

1

8log

1

8+

1

8log

1

8

= 1.75bits

H(Y) =H(1/4, 1/4, 1/4, 1/4)

= i

p(Y =i)log p(Y =i)

=

1

4log

1

4+

1

4log

1

4+

1

4log

1

4+

1

4log

1

4

= 2bits

Compute H(X|Y)


45/97

H(X|Y) = j

p(Y =j)H(X|Y =j)

= j

p(Y =j)i

p(X =i|Y =j)log[p(X =i|Y =j)]

= ij

p(X =i,Y =j)log p(X =i,Y =j)p(Y =j)

=

18log

1814

+ 116log

11614

+ 116log

11614

+ 14log

1414

+ 116log

11614

+ 18log

1814

+ 116log

11614

+ 0 log

014

+ 1

32

log 13214 + 1

32

log 13214 + 1

16

log 11614 + 0 log 01

4

+ 132log

13214

+ 132log

13214

+ 116log

11614

+ 0 log

014

= 1.375bits

Compute H(Y|X)


46/97

H(Y|X) = i

p(X =i)H(Y|X=i)

= i

p(X =i)j

p(Y =j|X =i)log[p(Y =j|X =i)]

= ij

p(X =i,Y =j)log p(X =i,Y =j)p(X =i)

=

18log

1812

+ 116log

11612

+ 116log

11612

+ 14log

1412

+ 116log

11614

+ 18log

1814

+ 116log

11614

+ 0 log

014

+ 132log 1321

8 + 132log 1321

8 + 116log 1161

8 + 0 log 01

8

+ 132log

13218

+ 132log

13218

+ 116log

11618

+ 0 log

018

= 1.625bits

Compute H(X,Y)


48/97

Chain rules can be derived by repeated applications oftwo-variable expansion rules

H(X,Y) =H(X) +H(Y|X)

Entropy

H(X1,X2, . . . ,Xn) =ni=1H(Xi|Xi1,Xi2, . . . ,X1)

Mutual information

I(X1,X2, . . . ,Xn; Y) =

ni=1I(Xi; Y|Xi1,Xi2, . . . ,X1)

Relative entropy

D(p(x, y)||q(x, y)) =D(p(x)||q(x)) +D(p(y|x)||q(y|x))

Chain rule examples


50/97

System model Source sends r.v.X, destination receives r.v.Y. Realization ofX (orY) is xi (oryi).

Information transferred from the source to the destination?

Options: H(X), H(Y), H(X,Y), H(X|Y), H(Y|X), I(X; Y)

I(X; Y) = xX

yY

p(x, y)log p(x, y)p(x)p(y)

= xX

yY

p(x)p(y|x)log p(y|x)p(y)

I(X; Y): a function of the input p(x) and the channel

characteristics p(y|x). Channel capacity: C= max

p(x){I(X,Y)}

H(X|Y) in communication systems


51/97

Ideally, H(X) should be transmitted from the source to the

destination.

H(X) =H(X|Y) +I(X; Y)


At Destination, after Y is received, there still exists averageuncertainty about source Xdue to the transmission distortion in thechannel.

H(X|Y): loss entropy

H(Y|X) in communication systems


52/97

Ideally, if there is no noise in the channel, there should existdeterministicrelationship between the sender and the receiver.

H(Y) =I(Y; X) +H(Y|X)

I(Y; X) =H(Y) H(Y|X)

At Source, after X is sent, there still exists averageuncertainty about destination Ydue to the channel noise.

H(Y|X): noise entropy

Mutual information of realization: atdestination


53/97

Priori probability p(xi): uncertainty on xi withoutreceiving yj

Posteriori probability p(xi|yj): uncertainty on xi withreceiving yj

I(xi; yj): amount of uncertainty reduction by receiving yj

I(xi; yj) =I(xi) I(xi|yj)

= log

1p(xi)

log

1

p(xi|yj)

= log

p(xi|yj)

p(xi)

Mutual information of realization: atsource


54/97

Priori probability p(yj): uncertainty on yjwithout sending xi

Posteriori probability p(yj|xi): uncertainty on yjwith sending xi I(yj; xi): amount of uncertainty reduction by sending xi

I(yj; xi) =I(yj) I(yj|xi)

= log

1

p(yj)

log

1

p(yj|xi)

= log

p(yj|xi)

p(yj)

I(xi;yj) vs I(yj; xi)


55/97

I(xi; yj) = I(yj; xi)

Proof:

I(xi; yj) = log

p(xi|yj)

p(xi)

=log

p(xi|yj)p(yj)p(xi)p(yj)

=log

p(xi, yj)

p(xi)p(yj)

=log

p(yj|xi)p(yj)

=I(yj; xi)

Mutual information of realization: system


56/97

Before communication, X and Yare considered to be statisticallyindependent.

p(xi, yj) =p(xi)p(yj)

Ibefore(xi, yj) = log

1

p(xi, yj)

= log

1

p(xi)p(yj)

= log

1

p(xi)

+ log

1

p(yj)

= I(xi) +I(yj)

After communication, X and Yare related due to channel characteristics.

p(xi, yj) =p(xi)p(yj|xi) =p(yj)p(xi|yj)

Iafter(xi, yj) = log

1p(xi, yj)

= I(xi, yj)

I(xi; yj) is the reduction of uncertainty before and after communication.

I(xi; yj)= Ibefore(xi, yj) Iafter(xi, yj) =I(xi) +I(yj) I(xi, yj)

Mutual information of realization: equivalency


57/97

At destination, I(xi; yj) =I(xi) I(xi|yj).

At source, I(yj; xi) =I(yj) I(yj|xi).

From system, I(xi; yj) =I(xi) +I(yj) I(xi, yj)

I(xi, yj) = log

1

p(xi, yj)

= log

1

p(xi)p(yj|xi)

= log

1

p(xi)

+ log

1

p(yj|xi)

=I(xi) +I(yj|xi)

I(xi; yj) =I(xi)+I(yj)I(xi, yj) =I(xi)+I(yj)[I(xi)+I(yj|xi)] =I(yj)I(yj|xi)

I(yj, xi) = log 1p(yj, xi)

= log 1p(yj)p(xi|yj)

= log 1p(yj)

+ log 1p(xi|yj)

=I(yj) +I(xi|yj)

I(xi; yj) =I(xi)+I(yj)I(xi, yj) =I(xi)+I(yj)[I(yj)+I(xi|yj)] =I(xi)I(xi|yj)

Mutual information in communication systems


58/97

Mutual information of realization at the micro-level I(xi; yj) = log p(xi|yj)

p(xi)= log 1

p(xi) log 1

p(xi|yj)

At destination: I(xi; yj) =I(xi) I(xi|yj) At source: I(yj; xi) =I(yj) I(yj|xi) From system: I(xi; yj) =I(xi) +I(yj) I(xi, yj)

Mutual information at the macro-level

I(X; Y) =D[p(x, y)||p(x)p(y)] =xX

yY

p(x, y)log

p(x, y)

p(x)p(y)

I(X; Y) =I(Y; X)

I(xi;yj) vs I(X; Y)


59/97

I(xi; yj) = log

p(xi, yj)

p(xi)p(yj)

I(X; Y) =D[p(x, y)||p(x)p(y)]

= xiX

yjY

p(xi, yj) log p(xi, yj)

p(xi)p(yj)=xiX

yjY

p(xi, yj)I(xi; yj)

=EX,Y [I(x; y)]

An example communication systemGiven a discrete source of

X

p(X )

=

x10 2

x20 8

, the output messages


60/97

p(X)

0.2 0.8

pass through a noise channel; then, the received messages are modeled using

Y = [y1, y2].

self-information in event x1: I(x1) = log 1p(x1)

= 2.322 bits

p(y1) =xi

p(xi)p(y1|xi) = 0.335

I(x1; y1) = log2

p(y1|x1)

p(y1)

= log2

7/80.335

= 1.39 bits

I(x1; y2) = log2

p(y2|x1)

p(y2)

= 2.42 bits

Motivation


61/97

Jensens Inequality has theconvexity.

Jensens inequality preview


62/97

It is used very widely in information theory.

Most basic theorems are proved based on Jensens inequality.

Preview:Iff is a convex function, then E[f(X)] f(E[X]).

What is convexity?Convex functions lie below any chord.


63/97

ConvexConcave upwardsConcave upConvex cup

Function f(x) is convex over (a, b) if

x1, x2 (a, b), 0 1f( x1+ (1 ) x2) f(x1) + (1 ) f(x2)

Function f(x) is strictly convex over (a, b) if it is convex and

x1, x2 (a, b), 0 1f( x1+ (1 ) x2) = f(x1) + ( 1 ) f(x2)= 0 or= 1

What is concavity?


64/97

Concave functions lie above any chord

Function f(x) is concave over (a, b) if

x1, x2 (a, b), 0 1f( x1+ (1 ) x2)) f(x1) + (1 ) f(x2)

Functionf(x) is strictly concave over (a, b) if it is concave and

x1, x2 (a, b), 0 1f( x1+ (1 ) x2) = f(x1) + ( 1 ) f(x2)= 0 or

= 1

Examples


65/97

Test of convexity and concavityIf function f(x) has a secondderivative f(x), which is non-negative (positive) everywhere,then f(x) is convex (strictly convex).

Examples of convex and concave functions

Jensens inequality: proof


66/97

Iff is convex, then for r.v.X, E[f(X)] f(E[X]).

Iff is strictly convex, X =E[X] with probability 1.

Sketch of the proof: We prove this for discrete distributions by themathematical induction2 on the number of the mass points.

n= 2, the inequality becomesp1f(x1) +p2f(x2) f(p1x1+p2x2). It holds by convexity.

Suppose the theorem is true for distributions with n 1masspoints.

n1i=1

qif(xi) fn1i=1

qixi

Then, prove the inequality holds for n.

2http://en.wikipedia.org/wiki/Mathematical induction

Jensens inequality: proof

If f is convex then for r v X E [f (X )] f (E [X ])


67/97

Iff is convex, then for r.v.X, E[f(X)] f(E[X]).

Iff is strictly convex, X =E[X] with probability 1.

E[f(X)] =ni=1

pif(xi) =pnf(xn) +n1i=1

pif(xi)

= pnf(xn) + (1 pn)

n1i=1

pi1 pn f(x

i)

pnf(xn) + (1 pn)f

n1i=1

pi1 pn

xi

f

pnxn+ (1 pn)n1i=1

pi1 pn

xi

= f

ni=1

pixi

= f(E[X])

Relative-entropy properties


68/97

We can use Jensens Inequality to prove some of the properties ofrelative entropy.

Theorem: Information inequalityLet p(x), q(x), x X, be two p.m.f.s. Then,

D(p(x)||q(x)) 0.D(p(x)||q(x)) = 0 p(x) =q(x).

Corollary: Non-negativity of mutual information

I(X; Y) 0.

I(X; Y) = 0 X and Yare independent.

Entropy properties proved by Jensens inequality

Theorem: Uniform PMF maximizes the entropy


69/97

Theorem: Uniform PMF maximizes the entropy

H(X) log(|X |)

H(X) = log(|X |) p(x) =q(x) = 1/|X |

Theorem: Conditioning reduces entropy

H(X|Y) H(X)

Theorem: Independence bound on entropy

H(X1,X2, . . . ,Xn) i

H(Xi)

H(X1,X2, . . . ,Xn) =i

H(Xi) Xiare independent with each other.

Information inequality

[Theorem:] Let p(x) q(x) x X be two p m f s then


70/97

[Theorem:] Let p(x), q(x), x X, be two p.m.f. s, then

D(p(x)||q(x)) 0.D(p(x)||q(x)) = 0 p(x) =q(x).

Proof:Let A = {x :p(x)> 0} be the support set ofp(x), then

D(p(x)||q(x)) = xA

p(x)log

p(x)

q(x)

= xA

p(x)log q(x)

p(x) log xA

p(x)q(x)

p(x) (by Jensens inequality= log

xA

q(x)

log

xX

q(x)

= log 1 = 0

Note on information inequality

( ) q(x)

( )q(x)


71/97

xAp(x)log

q(x)p(x)

log

xA

p(x)q(x)p(x)

Notey= log(t): a strictly concave function oft.

Consider a simple case n= 2. Let 1 =p(x1), 2 =p(x2), t1 = q(x1)p(x1)

,

t2 = q(x2)p(x2)

. Then,

1log t1+ 2log t2 log(1t1+ 2t2) .

p(x1)log

q(x1)

p(x1)

+p(x2)log

q(x2)

p(x2)

log

p(x1)

q(x1)

p(x1)+p(x2)

q(x2)

p(x2)

.

Corollary: non-negativity of mutual information


72/97

I(X; Y) 0;

I(X; Y) = 0 X and Y are independent.

Proof: I(X; Y) =D(p(x, y)||p(x)p(y)) 0

With equality if and only ifp(x, y) =p(x)p(y), i.e., X and Y areindependent.

A Binary Source: Entropy


73/97

Consider a binary source of X

p(X)

= x1,

p,

x2

1 p

.

H(X) = xXp(x)log[p(x)]

= plog2p (1 p)log2(1 p)

Whenp= 0.5,H(X) = 1 bit.H(X) log2 |X |.

Theorem: uniform PMF maximizes the entropy

H(X) log |X |;


74/97

H(X) = log |X | p(x) =q(x) = 1|X | .

Proof: Let u(x) = 1|X| be the uniform p.m.f . over X. Let p(x) be the p.m.f . forr.v.X. Then,

D(p(x)||u(x)) =x

X

p(x)logp(x)

u(x)

=xXp(x)log

1

u(x)

xXp(x)log p(x)

=xXp(x)log |X |H(X)

= log |X |H(X).

Hence, by the non-negativity of relative entropy,

0 D(p(x)||u(x)) = log |X |H(X).

Theorem: conditioning reduces entropy


75/97

H(X|Y) H(X)

Proof:

0 I(X; Y) =H(X)H(X|Y).Comments:

Knowing another r.v.Ycan only reduce the uncertainty in X.This is true only on theaverage.

Theorem: independence bound on entropy


76/97

H(X1,X2, . . .Hn) i

H(Xi).

H(X1,X2, . . .Hn) i

H(Xi) Xiare independent with each other.

Proof:By the chain rule for entropy, we apply the theorem of conditioningreduces entropy.

H(X1,X2, ...,Xn) =n

i=1

H(Xi|Xi1,Xi2, . . . ,X1)

ni=n

H(Xi)

Log sum inequality: theorem

For non-negative numbers, ai and bi, (i= 1, 2, . . . , n),


77/97

ni=1

ailogaibi

ni=1

ai

logni=1ai

ni=1

bi

.

With equality, if and only if aibi

= constant.

Log sum inequality: proof


78/97

Proof: (a brief sketch)

Assume ai and biare positive.

Construct f(t) =tlog t.

The function f(t) =tlog tis strictly convex for all positive t.

Construct i = bijbj

and ti= aibi

.

By Jensens inequality,

if(ti) f(

iti).

Then we obtain the log sum inequality

Log sum inequality: elaboration ai 0 and bi 0. 0 log 0 = 0, a log

a0

= ifa>0 and 0 log 0 = 00

.

Construct f (t) = t log t The function f (t) = t log t is strictly convex


79/97

Construct f(t) =tlog t. The function f(t) =tlog t is strictly convex,since f(t) = 1

t>0 for all positive t.

By Jensens inequality,(if(ti)) f

iti

.

For i 0, i= 1. Set i =

binj=1

bj

and ti = aibi

.

f(ti) = ai

bilog

ai

bi

ni=1

[if(ti)] =ni=1

binj=1

bj

ai

bilog

ai

bi

=

ni=1

ainj=1

bj

log

ai

bi

Log sum inequality: elaboration

n

[ f ( )]n

ail

ai


80/97

i=1[if(ti)] =

i=1 i

n

j=1

bj

log

i

bi

f

n

i=1(iti)

=n

i=1(iti) log

n

i=1(iti)

=ni=1

binj=1

bj

ai

bi

log

ni=1

binj=1

bj

ai

bi

=

ni=1

ainj=1

bj

logni=1

ainj=1

bj

Log sum inequality: elaboration

B J i lit


81/97

By Jensens inequality,

(if(ti)) f

iti

.

n

i=1

ainj=1

bj

log ai

bi

n

i=1

ainj=1

bj

log

n

i=1

ainj=1

bj

ni=1

ailog

aibi

ni=1 ai

log

n

i=1ai

nj=1

bj

Log sum inequality: applicationsTheorem: convexity of relative entropy


82/97

D(p||q) is convex in the pair (p, q);

D[p1+ (1 )p2||q1+ (1 )q2] D(p1||q1) + (1 )D(p2||q2).Corollary: convexity of mutual information

Theorem: concavity of entropy

H(p) is a concave function ofp.



84/97

IfX Y Z, then

I(X; Y) I(X; Z).

Proof:

I(X; Y,Z) =I(X; Z) +I(X; Y|Z) =I(X; Y) +I(X; Z|Y),

I(X; Z|Y) = 0 and I(X; Y|Z) 0.

Thus, we have I(X; Y) I(X; Z).

Comments: Manipulation of data cannot increase its information.

Summary: model

S


85/97

Single outcome or outcome sequence

Continuous or Discrete

Summary: basic system of the simplest discrete source


86/97

Notations

Sample space: X Random variable (r.v.): X Outcome ofXor realization ofX : x Cardinality of set X (the number of elements): |X |

Probability mass function (p.m.f.)

P(x) =Pr[X =x], x X P(x, y) =Pr[X =x,Y =y], x X, y Y

Summary: concept


87/97

Self-InformationI(x)

I(x) = log[p(x)] = log

1p(x)

Measure of uncertainty of single outcome

Non-negative

Summary: concept


88/97

Entropy H(X)

H(X) = E[log(p(x))] =E

log

1p(x)

H(X) =EX[I(x)]

Measure of uncertainty of information source

Non-negative

Summary: concept


89/97

Joint entropy H(X,Y)

Conditional entropy H(X|Y) orH(Y|X)

Chain ruleH(X,Y) =H(X) +H(Y|X)

Summary: concept

Self-Information I(x)


90/97

Measure of uncertainty of single outcome

Non-negative Entropy H(X)

H(X) =EX[I(x)] Measure of uncertainty of information source

Non-negative Relative entropy D(p(x)||q(x))

Measure of similarity of distributions Non-negative

Mutual information I(X; Y) I(X; Y) =D[p(x, y)||p(x)p(y)] =EX,Y[I(x; y)] Measure of similarity between joint and product p.m.f.s Special case of relative entropy (Non-negative)

Summary: entropy properties


91/97

Non-negativity H(X) 0

Chain Rule H(X,Y) =H(X) +H(Y|X)Uniform p.m.f. H(X) log(|X|)maximization

Conditional H(X|Y) H(X)

reductionIndependence H(X1,X2, . . . ,Xn) iH(Xi)

bound

Concavity H(p(x) + (1 )p(x)) H(p(x)) + (1 )H(p(x))

Having maximum in a given range

Summary: mutual information properties


92/97

Name Expression

Non-negativity I(X; Y) 0

Maximum I(X; Y) =H(X) +H(Y)H(X,Y) H(X)

Symmetry I(X; Y) =I(Y; X)

Convexity D[p1+ (1 )p2||q1+ (1 )q2] D(p1||q1) + (1 )D(p2||q2)Having minimum in a given range

Summary: entropy


94/97


H(X|Y) H(X|Y) =H(X,Y)H(Y)

H(X|Y) =H(X)I(X; Y)

H(Y|X) H(Y|X) =H(X,Y)H(X)

H(Y|X) =H(Y)I(X; Y)

Summary: joint entropy and mutual information


96/97

Joint and conditional entropy

Relative entropy and mutual information

Chain rules

Jensens inequality

Log sum inequality


Reference


97/97

T. M. Cover and J. A. Thomas, Elements of information theory, 2nd ed.Hoboken, N.J. : J. Wiley, 2006.

chapter02 entropy

Documents