chapter02 entropy

Upload: azeem-iqbal

Post on 03-Jun-2018

251 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Chapter02 Entropy

    1/97

    Chapter 2: Entropy, Relative Entropy, andMutual Information

    Xiaojun Hei

    Internet Technology and Engineering R&D CenterDepartment of Electronics and Information Engineering

    Email: [email protected]: http://itec.hust.edu.cn/heixj

    Phone: 027-87544704

  • 8/12/2019 Chapter02 Entropy

    2/97

    Chapter 2: Entropy, Relative Entropy, and Mutual Information

    Entropy

    Joint and conditional entropy

    Relative entropy and mutual information

    Chain rules

    Jensens inequality

    Log sum inequality

    Data processing inequality

  • 8/12/2019 Chapter02 Entropy

    3/97

    Block diagram of communication systems

    The transmission and process of information in communicationsystems

    Data compression limit, provided by Shannons first theorem

    Commonly-used coding algorithms forzero-errorsource coding

  • 8/12/2019 Chapter02 Entropy

    4/97

    Source encoder side

    The output sequence of information source is stochastic: howto characterize?

    We can think of a discrete source as generating the

    message,symbol by symbol...a mathematical model of ssystem... is known as astochastic process. C.E. Shannon

  • 8/12/2019 Chapter02 Entropy

    5/97

    Source coding

    C

    :X

    D

    :C

    (x

    ),

    where D is the set of finite length strings of symbols from a D-aryalphabet1.

    Let C(x) denote the codeword corresponding to x.

    Let l(x) denote the length ofC(x).

    Expected lengthof a source code

    L(C) =

    xXp(x)l(x)

    ExampleX ={red, blue},D ={0,1}, C(red)=0, C(blue)=1

    1When D= 2, it is binary, in which the alphabet is {0,1}.

  • 8/12/2019 Chapter02 Entropy

    6/97

    Outcomes of the source

    Single outcome or outcome sequence

    Continuous or Discrete

  • 8/12/2019 Chapter02 Entropy

    7/97

    Modeling single outcome

    Continuous Source Rp(x)

    ,

    R

    p(x)dx= 1

    Discrete Source

    X

    P(X)

    =

    a1, a2, aqP(a1), P(a2), P(aq)

    ,

    qi=1

    P(ai) = 1

  • 8/12/2019 Chapter02 Entropy

    8/97

    Modeling outcome sequence

    Waveform Source Continuous in both time and amplitude Modeled as a continuous stochastic process {x(t)}

    Sequence Source

    Sampled from waveform source Discrete in time or space Modeled as a stochastic sequence {Xi(ti)}

  • 8/12/2019 Chapter02 Entropy

    9/97

    Classification of sources

    Stationary: whether the distribution changes with time?

    Stationary Source: goodsource (easy to analyze) Unstationary source:sometimes can be simplified as Markov

    source Memory: whether the variables in sequence have relationship?

    Source without memory: goodsource(easy to analyze) Source with memory: can be modeled as Markov source

  • 8/12/2019 Chapter02 Entropy

    10/97

    Sources studied in our course

    Motivation: we study the ideal sources withgood properties, thenuse them to approximate real sources Discrete Source

    Single OutcomeDiscrete Source Outcome sequenceDiscrete Source Discrete stationary memoryless source Discrete stationary source with memory

    Continuous source

    Waveform source

  • 8/12/2019 Chapter02 Entropy

    11/97

    Diagram of communication systems

    The transmission and process of information in communicationsystems

  • 8/12/2019 Chapter02 Entropy

    12/97

    Information source model

    Notations Sample space:X Random variable (r.v.):X Outcome ofXor realization ofX: x Cardinality of set X (the number of elements): |X |

    Probability mass function (p.m.f.) P(x) =Pr[X =x], x X P(x, y) =Pr[X =x,Y =y], x X, y Y

  • 8/12/2019 Chapter02 Entropy

    13/97

    Information properties

    At first, we investigate the measure of information forsingleoutcome. (Property 1)

    It should have the following properties: The larger the measure, the more surprising the outcome

    (Property 2) is Function of the probability distribution Proportional to the inverse of probability None-linear mapping from probability to information

    ([0, 1] [0,))

    Information content of two independent r.v.s is the sum of

    information contents (Property 3) Logarithm of the probability

  • 8/12/2019 Chapter02 Entropy

    14/97

    Definition

    Theself-informationof a realization x ofr.v.Xcan be definedas:

    I(x) = log[p(x)] = log 1p(x)it can be proved that this is the only formsatisfying theproperties of information

    The base of logarithm can be any 2: information measured in binary units (bits) e: information measured in natural units (nats)

  • 8/12/2019 Chapter02 Entropy

    15/97

    Self-information: example

    Given a source with Mpossible outcomes Then the self-information of each outcome is k= log 2M bits

    This means the outcome can be described by k bitsinformation

    For instance, a source has M= 4 outcomes {a, b, c, d}, thenthe outcome can be described by 2 = log 24 bits code , suchas set{00, 01, 10, 11}.

    a 00

    b 01c 10d 11

  • 8/12/2019 Chapter02 Entropy

    16/97

    Entropy definition

    Theaverage informationofr.v.X is called theentropyofX

    H(X) = xX

    p(x)log[p(x)]

    A (convenient)measure of uncertaintyof the r.v..

    Entropy is a function of the probability distribution Independent of the outcomes of the r.v. itself Only the distribution matters

    Logarithm Base can by any, i.e., 2 by default. Baseb is sometimes marked as Hb(X). By the continuity argument: ifp(x) = 0,

    p(x) log

    1p(x)

    = 0 log

    10

    = 0, in that thezero

    probability has no impact on entropy.

  • 8/12/2019 Chapter02 Entropy

    17/97

    Logarithm: propertieshttp://en.wikipedia.org/wiki/Logarithm

    Product, quotient, power, and root

    Formula Example

    product logb(xy) = logb(x) + logb(y) log3(243) = log3(9 27) = log3(9) + log3(27) = 2 + 3 = 5quotient logb

    xy

    = logb(x) logb(y) log2(16) = log2( 644 ) = log2(64) log2 (4) = 6 2 = 4

    power logb(xp) = plogb(x) log2(64) = log2(2

    6) = 6 log2(2) = 6

    root logb px = logb(x)

    p log10(

    1000) = 12 log10(1000) =

    32 = 1.5

    Change of base

    logb(x) = logk(x)

    logk(b

    )

    .

    Derivative and antiderivative

    d

    dxlogb(x) =

    1

    xln(b).

    d

    dx

    ln (f(x)) = f(x)

    f(x)

    .

    ln (x) dx= xln (x) x+C.

    Integral representation of the natural logarithm

    ln (t) =

    t1

    1

    xdx.

    f

  • 8/12/2019 Chapter02 Entropy

    18/97

    Logarithm: functionshttp://en.wikipedia.org/wiki/Logarithm

    The graph of the logarithm to base 2 crosses the xaxis (horizontal axis)at 1 and passes through the points with coordinates (2, 1), (4, 2), and(8, 3).

    L i h f i

  • 8/12/2019 Chapter02 Entropy

    19/97

    Logarithm: functionshttp://en.wikipedia.org/wiki/Logarithm

    The graph of the logarithm function logb(x) (blue) is obtained byreflecting the graph of the function bx (red) at the diagonal line (x=y).

    L i h f i

  • 8/12/2019 Chapter02 Entropy

    20/97

    Logarithm: functionshttp://en.wikipedia.org/wiki/Logarithm

    The graph of the natural logarithm (green) and its tangent at x= 1.5(black).

    E t b i ti

  • 8/12/2019 Chapter02 Entropy

    21/97

    Entropy: basic properties

    Entropy is the expected value of self-information

    H(X) = E{log[p(x)]} =E

    log 1

    p(x)

    Entropy H(X) isnon-negative.

    0 p(x) 1 log 1

    p(x)

    0

    H(X) =E

    log

    1

    p(x)

    0

    Change of bases, since

    Hb(X) = [logba] Ha(X)

    logbp= [logba] logap

    E l #1 t f if

  • 8/12/2019 Chapter02 Entropy

    22/97

    Example #1: entropy of uniform r.v.

    Consider a uniform random variable with M= 32 = 25

    possible outcomes

    Then 5 bits are sufficient to describe each outcome

    The entropy of this r.v. is

    H(X) = 32i=1

    p(x)log2[p(x)] =32i=1

    1

    32log2[

    1

    32] = 5bits

    Entropy agrees with log2M (H(X) = log2M)

    E a le #2 e t o of o ifo

  • 8/12/2019 Chapter02 Entropy

    23/97

    Example #2: entropy of non-uniform r.v.

    Consider a non-uniform random variable withM= 8 = 23 Possible

    outcomes and probabilities(1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64)

    Then 3 bits are sufficient to describe each outcome

    The entropy of this r.v. is

    H(X) = 12

    log2(12

    ) 14

    log2(14

    ) 18

    log2(18

    )

    1

    16log2(

    1

    16)

    4

    64log2(

    1

    64)

    = 2bits

    Entropy and log2(M) disagree (H(X)< log2M)

    The average length of source code word can be made shorter than 3

    Example: (0, 10, 110, 1110, 111100, 111101, 111110, 111111)

    Definition

  • 8/12/2019 Chapter02 Entropy

    24/97

    Definition

    Joint entropy

    H(X,Y) = xX

    yY

    p(x, y)log[p(x, y)]

    = E{log[p(x, y)]}

    Conditional entropy

    H(Y|X) =xX

    p(x)H(Y|X =x)

    =

    xXp(x)

    yYp(y|x) log[p(y|x)]

    = xX

    yY

    p(x, y) log[p(y|x)]

    = E{log[p(y|x)]}

    Note that in general: H(Y|X) =H(X|Y)

    Venn Diagram

  • 8/12/2019 Chapter02 Entropy

    25/97

    Venn Diagram

    Motivation

  • 8/12/2019 Chapter02 Entropy

    26/97

    Motivation

    |A B| =

    (xA xB)2 + (yA yB)2 |IBIA| =

    log

    1

    q(x)

    log

    1

    p(x)

    = log

    p(x)

    q(x)

    Average:

    xXp(x)log

    p(x)

    q(x)

    Relative entropy (or Kullback leibler distance)

  • 8/12/2019 Chapter02 Entropy

    27/97

    Relative entropy (or Kullback-leibler distance)

    Definition: a measure of the information distanceor theinformational divergencebetween two p.m.f., p(x) and q(x):

    D(p(x)||q(x)) =xX

    p(x)log

    p(x)

    q(x)

    = Ep

    log

    p(X)

    q(X)

    When p(x) is the true p.m.f. ofX, this measures theinefficiency of assuming q(x) is the p.m.f. ofX.

    It is distance-like in many respects. It is not a true distance, since it

    Is not symmetricD(p||q)v.s.D(q||p)

    Does not satisfy the triangle inequality

    D(p||q) +D(q||r)v.s.D(p||r)

    Geometric space

  • 8/12/2019 Chapter02 Entropy

    28/97

    Geometric space

    Characterize the position of any point with coordinates

    Compute distances using coordinates

    Symmetric distance

    |P1,P2| = |P2,P1|

    Triangle inequality

    |P1,P2| + |P2,P3| |P1,P3|

    Relative entropy: example

  • 8/12/2019 Chapter02 Entropy

    29/97

    Relative entropy: example

    Let x X = {0, 1}, p(0) = 1 r, p(1) =r, q(0) = 1 s, q(1) =s.

    D(p(x)||q(x)) =p(0) log

    p(0)

    q(0)

    +p(1) log

    p(1)

    q(1)

    = (1 r)log

    1 r

    1 s

    +rlog

    r

    s

    D(q(x)||p(x)) =q(0) log

    q(0)p(0)

    +q(1) log

    q(1)p(1)

    = (1 s)log

    1 s

    1 r

    +slog

    s

    r

    Ifr=s, D(p||q) =D(q||p)

    Ifr=s, such as r= 1/2, s= 1/4 D(p||q) = 0.2075 bits, D(q||p) = 0.1887 bits

    Thus, in general D(p||q) =D(q||p).

    Mutual information

  • 8/12/2019 Chapter02 Entropy

    30/97

    Mutual information

    Things are commonly related; two random variables areusually related.

    In an information perspective, how to characterize therelationship between r.v.X and r.v.Y?

    Observe Xalone, the information ofX is H(X). KnowingY, the information ofX becomes H(X|Y). KnowingY, the information ofX is reduced by

    =H(X) H(X|Y). This reduced information is the uncertainty ofX after

    knowingY

    Mutual information

  • 8/12/2019 Chapter02 Entropy

    31/97

    Mutual information

    Mutual informationis the relative entropy between the jointdistribution and the product distribution of two random

    variablesX, Y:

    I(X; Y) = D[p(x, y)||p(x)p(y)]

    =xX

    yY

    p(x, y)log p(x, y)

    p(x)p(y)

    = E(X,Y)

    log

    p(X,Y)

    p(X)p(Y)

    Measure of the information one random variable (say, X)contains on the other (Y)

    Special cases IfX and Yare independent, I(X; Y) = 0. IfY =X, I(X; X) =H(X).

    Mutual information

  • 8/12/2019 Chapter02 Entropy

    32/97

    Mutual information

    Conditional relative entropy

    D(p(y|x)||q(y|x)) =xX

    p(x)yY

    p(y|x)log

    p(y|x)

    p(x|y)

    Conditional mutual information

    I(X; Y|Z) =xX

    yY

    zZ

    p(x, y, z)log

    p(x, y|z)

    p(x|z)p(y|z)

    = Ep(x,y,z)

    log p(X,Y|Z)

    p(X|Z)p(Y|Z)

    Mutual information vs Entropy

  • 8/12/2019 Chapter02 Entropy

    33/97

    py

    I(X; Y) =H(X) H(X|Y)

    Proof:

    I(X; Y) =x

    y

    p(x, y)log

    p(x, y)p(x)p(y)

    =x

    y

    p(x, y)log

    p(x|y)

    p(x)

    = x

    y

    p(x, y) log[p(x)] +x

    y

    p(x, y)log[p(x|y)]

    = x

    p(x)log[p(x)] +x

    y

    p(x, y)log[p(x|y)]

    = H(X) H(X|Y)

    Mutual information vs Entropy

  • 8/12/2019 Chapter02 Entropy

    34/97

    py

    Expression I(X; Y) =H(X) H(X|Y) =H(Y) H(Y|X) =I(Y; X) I(X; Y) =H(X) +H(Y) H(X,Y) I(X; X) =H(X) I(X; Y|Z) =H(X|Z) H(X|Y,Z)

    Venn Diagram

    Example #1

  • 8/12/2019 Chapter02 Entropy

    35/97

    p #

    Joint p.m.f. is:

    YX

    1 2 3 4 p(y)

    1 1/8 1/16 1/32 1/32 1/4

    2 1/16 1/8 1/32 1/32 1/43 1/16 1/16 1/16 1/16 1/4

    4 1/4 0 0 0 1/4

    p(x) 1/2 1/4 1/8 1/8

    What is H(X), H(Y), H(X|Y), H(Y|X), H(X,Y), I(X; Y)?

    Solution of example #1

  • 8/12/2019 Chapter02 Entropy

    36/97

    p #

    H(X) = xXp(x)log[p(x)]=H

    1

    2,1

    4,1

    8,1

    8

    =

    1

    2log(

    1

    2) +

    1

    4log(

    1

    4) +

    1

    8log(

    1

    8) +

    1

    8log(

    1

    8)

    = 1.75 bits

    H(Y) = yY

    p(y)log[p(y)]

    =H1

    4 ,

    1

    4 ,

    1

    4 ,

    1

    4

    =

    1

    4log(

    1

    4) +

    1

    4log(

    1

    4) +

    1

    4log(

    1

    4) +

    1

    4log(

    1

    4)

    = 2 bits

    Solution of example #1

  • 8/12/2019 Chapter02 Entropy

    37/97

    H(X|Y) =yY

    p(y)H(X|Y =y)

    =yY

    p(y)xX

    p(x|y)log

    1p(x|y)

    =

    xX

    yY

    p(x, y)log[p(x|y)]

    = xX

    yY

    p(x, y)log

    p(x, y)p(y)

    =

    18log

    1814

    + 116log1

    1614

    + 132log1

    3214

    + 132log1

    3214

    + 116log1

    1614

    + 18log1814

    + 132log1

    3214

    + 132log1

    3214

    + 116log11614

    + 116log11614

    + 116log11614

    + 116log11614

    + 14log1414

    + 0 log 014

    + 0 log 014

    + 0 log 014

    = 1.375 bits

    Solution of example #1

  • 8/12/2019 Chapter02 Entropy

    38/97

    H(Y|X) =xX

    p(x)H(Y|X=x)

    =xX

    p(x)yY

    p(y|x)log

    1p(y|x)

    =

    xX

    yY

    p(x, y)log[p(y|x)]

    = xX

    yY

    p(x, y)log

    p(x, y)p(x)

    =

    18log

    1812

    + 116log1

    1614

    + 132log1

    3218

    + 132log1

    3218

    + 116log1

    1612

    + 18log1814

    + 132log1

    3218

    + 132log1

    3218

    + 116log11612

    + 116log11614

    + 116log11618

    + 116log11618

    + 14log1412

    + 0 log 014

    + 0 log 018

    + 0 log 018

    = 1.625 bits

    Solution of example #1

  • 8/12/2019 Chapter02 Entropy

    39/97

    H(X,Y) = xX

    yY

    p(x, y)log[p(x, y)]

    =

    18log

    18 +

    116log

    116 +

    132log

    132 +

    132log

    132

    + 1

    16log 1

    16 + 1

    8log 1

    8 + 1

    32log 1

    32 + 1

    32log 1

    32+ 116log

    116 +

    116log

    116 +

    116log

    116 +

    116log

    116

    + 14log 14 + 0 log 0 + 0 log 0 + 0 log 0

    = 3.375 bits

    Note that H(X,Y) =H(X) +H(Y|X) =H(Y) +H(X|Y) byobservation in this example.

    Solution of example #1

  • 8/12/2019 Chapter02 Entropy

    40/97

    Method 1:

    I(X; Y) =H(X) H(X|Y) = 1.75 1.375 = 0.375 bit

    I(X; Y) =H(Y) H(Y|X) = 2 1.625 = 0.375 bit

    Method 2:

    I(X,Y) =D[p(x, y)||p(x)p(y)]

    =xX

    yY

    p(x, y)log

    p(x, y)

    p(x)p(y)

    =

    18log

    18

    12

    14

    + 116log1

    1614

    14

    + 132log1

    3218

    14

    + 132log1

    3218

    14

    + 116log1

    1612

    14

    + 18log1

    814

    14

    + 132log1

    3218

    14

    + 132log1

    3218

    14

    + 116log1

    1612

    14

    + 116log1

    1614

    14

    + 116log1

    1618

    14

    + 116log1

    1618

    14

    + 14log14

    12

    14

    + 0 log 0 + 0 log 0 + 0 log 0

    = 0.375 bit

    The chain rule: motivation

  • 8/12/2019 Chapter02 Entropy

    41/97

    In calculus, the chain rule is a formula for computing thederivative of the composition of two or more functions.

    Let y=f(u) and u=g(x).

    [f(g(x)] =f(g(x))g(x)

    dy

    dx =

    dy

    du

    du

    dx

    In information theory, the chain rule is a formula for

    computing the entropies of the composition of two or morerandom variables.

    The chain rule

  • 8/12/2019 Chapter02 Entropy

    42/97

    H(X,Y) =H(X) +H(Y|X) Proof:

    H(X, Y) = x

    y

    p(x, y)log[p(x, y)]

    = x

    y

    p(x, y)log[p(x)p(y|x)]

    = x

    y

    p(x, y) log[p(x)] x

    y

    p(x, y) log[p(y|x)]

    =

    xp(x) log[p(x)]

    x yp(x, y)log[p(y|x)]

    =H(X) +H(Y|X) Corollary

    H(X,Y|Z) =H(X|Z) +H(Y|X, Z)

    Example #2

  • 8/12/2019 Chapter02 Entropy

    43/97

    Joint p.m.f. is:

    YX

    1 2 3 4 p(y)

    1 1/8 1/16 1/32 1/32 1/4

    2 1/16 1/8 1/32 1/32 1/43 1/16 1/16 1/16 1/16 1/4

    4 1/4 0 0 0 1/4

    p(x) 1/2 1/4 1/8 1/8

    What is H(X),H(Y),H(X|Y),H(Y|X),H(X,Y)?

    Compute H(X), H(Y)

  • 8/12/2019 Chapter02 Entropy

    44/97

    H(X) =H(1/2, 1/4, 1/8, 1/8)

    = i

    p(X =i)log p(X =i)

    =

    1

    2log

    1

    2+

    1

    4log

    1

    4+

    1

    8log

    1

    8+

    1

    8log

    1

    8

    = 1.75bits

    H(Y) =H(1/4, 1/4, 1/4, 1/4)

    = i

    p(Y =i)log p(Y =i)

    =

    1

    4log

    1

    4+

    1

    4log

    1

    4+

    1

    4log

    1

    4+

    1

    4log

    1

    4

    = 2bits

    Compute H(X|Y)

  • 8/12/2019 Chapter02 Entropy

    45/97

    H(X|Y) = j

    p(Y =j)H(X|Y =j)

    = j

    p(Y =j)i

    p(X =i|Y =j)log[p(X =i|Y =j)]

    = ij

    p(X =i,Y =j)log p(X =i,Y =j)p(Y =j)

    =

    18log

    1814

    + 116log

    11614

    + 116log

    11614

    + 14log

    1414

    + 116log

    11614

    + 18log

    1814

    + 116log

    11614

    + 0 log

    014

    + 1

    32

    log 13214 + 1

    32

    log 13214 + 1

    16

    log 11614 + 0 log 01

    4

    + 132log

    13214

    + 132log

    13214

    + 116log

    11614

    + 0 log

    014

    = 1.375bits

    Compute H(Y|X)

  • 8/12/2019 Chapter02 Entropy

    46/97

    H(Y|X) = i

    p(X =i)H(Y|X=i)

    = i

    p(X =i)j

    p(Y =j|X =i)log[p(Y =j|X =i)]

    = ij

    p(X =i,Y =j)log p(X =i,Y =j)p(X =i)

    =

    18log

    1812

    + 116log

    11612

    + 116log

    11612

    + 14log

    1412

    + 116log

    11614

    + 18log

    1814

    + 116log

    11614

    + 0 log

    014

    + 132log 1321

    8 + 132log 1321

    8 + 116log 1161

    8 + 0 log 01

    8

    + 132log

    13218

    + 132log

    13218

    + 116log

    11618

    + 0 log

    018

    = 1.625bits

    Compute H(X,Y)

  • 8/12/2019 Chapter02 Entropy

    47/97

    H(X) =H(1/2, 1/4, 1/8, 1/8) = 1.75 bits

    H(Y) =H(1/4, 1/4, 1/4, 1/4) = 2 bitsH(X|Y) =

    iPr(Y =i)H(X|Y =i) = 1.375 bits

    H(Y|X) =iPr(X =i)H(Y|X =i) = 1.625 bits

    H(X,Y) =H(X) +H(Y|X) = 1.75 + 1.375 = 3.375 bits (chain

    rule)H(X) H(X|Y) = 1.75 1.375 =0.375bitsH(Y) H(Y|X) = 2 1.625 =0.375bits

    I(X; Y) =H(X) H(X|Y)

    I(X; Y) =H(Y) H(Y|X)

    Chain rules

  • 8/12/2019 Chapter02 Entropy

    48/97

    Chain rules can be derived by repeated applications oftwo-variable expansion rules

    H(X,Y) =H(X) +H(Y|X)

    Entropy

    H(X1,X2, . . . ,Xn) =ni=1H(Xi|Xi1,Xi2, . . . ,X1)

    Mutual information

    I(X1,X2, . . . ,Xn; Y) =

    ni=1I(Xi; Y|Xi1,Xi2, . . . ,X1)

    Relative entropy

    D(p(x, y)||q(x, y)) =D(p(x)||q(x)) +D(p(y|x)||q(y|x))

    Chain rule examples

  • 8/12/2019 Chapter02 Entropy

    49/97

    H(X1,X2,X3) =3i=1

    H(Xi|Xi1,Xi2, . . . ,X1)

    =H(X1) +H(X2|X1) +H(X3|X2,X1)

    I(X1,X2,X3; Y) =3

    i=1I(Xi; Y|Xi1,Xi2, . . . ,X1)

    =I(X1; Y) +I(X2; Y|X1) +I(X3; Y|X2,X1)

    Conditional entropies in communication systems

  • 8/12/2019 Chapter02 Entropy

    50/97

    System model Source sends r.v.X, destination receives r.v.Y. Realization ofX (orY) is xi (oryi).

    Information transferred from the source to the destination?

    Options: H(X), H(Y), H(X,Y), H(X|Y), H(Y|X), I(X; Y)

    I(X; Y) = xX

    yY

    p(x, y)log p(x, y)p(x)p(y)

    = xX

    yY

    p(x)p(y|x)log p(y|x)p(y)

    I(X; Y): a function of the input p(x) and the channel

    characteristics p(y|x). Channel capacity: C= max

    p(x){I(X,Y)}

    H(X|Y) in communication systems

  • 8/12/2019 Chapter02 Entropy

    51/97

    Ideally, H(X) should be transmitted from the source to the

    destination.

    H(X) =H(X|Y) +I(X; Y)

    I(X; Y) =H(X) H(X|Y)

    At Destination, after Y is received, there still exists averageuncertainty about source Xdue to the transmission distortion in thechannel.

    H(X|Y): loss entropy

    H(Y|X) in communication systems

  • 8/12/2019 Chapter02 Entropy

    52/97

    Ideally, if there is no noise in the channel, there should existdeterministicrelationship between the sender and the receiver.

    H(Y) =I(Y; X) +H(Y|X)

    I(Y; X) =H(Y) H(Y|X)

    At Source, after X is sent, there still exists averageuncertainty about destination Ydue to the channel noise.

    H(Y|X): noise entropy

    Mutual information of realization: atdestination

  • 8/12/2019 Chapter02 Entropy

    53/97

    Priori probability p(xi): uncertainty on xi withoutreceiving yj

    Posteriori probability p(xi|yj): uncertainty on xi withreceiving yj

    I(xi; yj): amount of uncertainty reduction by receiving yj

    I(xi; yj) =I(xi) I(xi|yj)

    = log

    1p(xi)

    log

    1

    p(xi|yj)

    = log

    p(xi|yj)

    p(xi)

    Mutual information of realization: atsource

  • 8/12/2019 Chapter02 Entropy

    54/97

    Priori probability p(yj): uncertainty on yjwithout sending xi

    Posteriori probability p(yj|xi): uncertainty on yjwith sending xi I(yj; xi): amount of uncertainty reduction by sending xi

    I(yj; xi) =I(yj) I(yj|xi)

    = log

    1

    p(yj)

    log

    1

    p(yj|xi)

    = log

    p(yj|xi)

    p(yj)

    I(xi;yj) vs I(yj; xi)

  • 8/12/2019 Chapter02 Entropy

    55/97

    I(xi; yj) = I(yj; xi)

    Proof:

    I(xi; yj) = log

    p(xi|yj)

    p(xi)

    =log

    p(xi|yj)p(yj)p(xi)p(yj)

    =log

    p(xi, yj)

    p(xi)p(yj)

    =log

    p(yj|xi)p(yj)

    =I(yj; xi)

    Mutual information of realization: system

  • 8/12/2019 Chapter02 Entropy

    56/97

    Before communication, X and Yare considered to be statisticallyindependent.

    p(xi, yj) =p(xi)p(yj)

    Ibefore(xi, yj) = log

    1

    p(xi, yj)

    = log

    1

    p(xi)p(yj)

    = log

    1

    p(xi)

    + log

    1

    p(yj)

    = I(xi) +I(yj)

    After communication, X and Yare related due to channel characteristics.

    p(xi, yj) =p(xi)p(yj|xi) =p(yj)p(xi|yj)

    Iafter(xi, yj) = log

    1p(xi, yj)

    = I(xi, yj)

    I(xi; yj) is the reduction of uncertainty before and after communication.

    I(xi; yj)= Ibefore(xi, yj) Iafter(xi, yj) =I(xi) +I(yj) I(xi, yj)

    Mutual information of realization: equivalency

  • 8/12/2019 Chapter02 Entropy

    57/97

    At destination, I(xi; yj) =I(xi) I(xi|yj).

    At source, I(yj; xi) =I(yj) I(yj|xi).

    From system, I(xi; yj) =I(xi) +I(yj) I(xi, yj)

    I(xi, yj) = log

    1

    p(xi, yj)

    = log

    1

    p(xi)p(yj|xi)

    = log

    1

    p(xi)

    + log

    1

    p(yj|xi)

    =I(xi) +I(yj|xi)

    I(xi; yj) =I(xi)+I(yj)I(xi, yj) =I(xi)+I(yj)[I(xi)+I(yj|xi)] =I(yj)I(yj|xi)

    I(yj, xi) = log 1p(yj, xi)

    = log 1p(yj)p(xi|yj)

    = log 1p(yj)

    + log 1p(xi|yj)

    =I(yj) +I(xi|yj)

    I(xi; yj) =I(xi)+I(yj)I(xi, yj) =I(xi)+I(yj)[I(yj)+I(xi|yj)] =I(xi)I(xi|yj)

    Mutual information in communication systems

  • 8/12/2019 Chapter02 Entropy

    58/97

    Mutual information of realization at the micro-level I(xi; yj) = log p(xi|yj)

    p(xi)= log 1

    p(xi) log 1

    p(xi|yj)

    At destination: I(xi; yj) =I(xi) I(xi|yj) At source: I(yj; xi) =I(yj) I(yj|xi) From system: I(xi; yj) =I(xi) +I(yj) I(xi, yj)

    Mutual information at the macro-level

    I(X; Y) =D[p(x, y)||p(x)p(y)] =xX

    yY

    p(x, y)log

    p(x, y)

    p(x)p(y)

    I(X; Y) =I(Y; X)

    I(xi;yj) vs I(X; Y)

  • 8/12/2019 Chapter02 Entropy

    59/97

    I(xi; yj) = log

    p(xi, yj)

    p(xi)p(yj)

    I(X; Y) =D[p(x, y)||p(x)p(y)]

    = xiX

    yjY

    p(xi, yj) log p(xi, yj)

    p(xi)p(yj)=xiX

    yjY

    p(xi, yj)I(xi; yj)

    =EX,Y [I(x; y)]

    An example communication systemGiven a discrete source of

    X

    p(X )

    =

    x10 2

    x20 8

    , the output messages

  • 8/12/2019 Chapter02 Entropy

    60/97

    p(X)

    0.2 0.8

    pass through a noise channel; then, the received messages are modeled using

    Y = [y1, y2].

    self-information in event x1: I(x1) = log 1p(x1)

    = 2.322 bits

    p(y1) =xi

    p(xi)p(y1|xi) = 0.335

    I(x1; y1) = log2

    p(y1|x1)

    p(y1)

    = log2

    7/80.335

    = 1.39 bits

    I(x1; y2) = log2

    p(y2|x1)

    p(y2)

    = 2.42 bits

    Motivation

  • 8/12/2019 Chapter02 Entropy

    61/97

    Jensens Inequality has theconvexity.

    Jensens inequality preview

  • 8/12/2019 Chapter02 Entropy

    62/97

    It is used very widely in information theory.

    Most basic theorems are proved based on Jensens inequality.

    Preview:Iff is a convex function, then E[f(X)] f(E[X]).

    What is convexity?Convex functions lie below any chord.

  • 8/12/2019 Chapter02 Entropy

    63/97

    ConvexConcave upwardsConcave upConvex cup

    Function f(x) is convex over (a, b) if

    x1, x2 (a, b), 0 1f( x1+ (1 ) x2) f(x1) + (1 ) f(x2)

    Function f(x) is strictly convex over (a, b) if it is convex and

    x1, x2 (a, b), 0 1f( x1+ (1 ) x2) = f(x1) + ( 1 ) f(x2)= 0 or= 1

    What is concavity?

  • 8/12/2019 Chapter02 Entropy

    64/97

    Concave functions lie above any chord

    Function f(x) is concave over (a, b) if

    x1, x2 (a, b), 0 1f( x1+ (1 ) x2)) f(x1) + (1 ) f(x2)

    Functionf(x) is strictly concave over (a, b) if it is concave and

    x1, x2 (a, b), 0 1f( x1+ (1 ) x2) = f(x1) + ( 1 ) f(x2)= 0 or

    = 1

    Examples

  • 8/12/2019 Chapter02 Entropy

    65/97

    Test of convexity and concavityIf function f(x) has a secondderivative f(x), which is non-negative (positive) everywhere,then f(x) is convex (strictly convex).

    Examples of convex and concave functions

    Jensens inequality: proof

  • 8/12/2019 Chapter02 Entropy

    66/97

    Iff is convex, then for r.v.X, E[f(X)] f(E[X]).

    Iff is strictly convex, X =E[X] with probability 1.

    Sketch of the proof: We prove this for discrete distributions by themathematical induction2 on the number of the mass points.

    n= 2, the inequality becomesp1f(x1) +p2f(x2) f(p1x1+p2x2). It holds by convexity.

    Suppose the theorem is true for distributions with n 1masspoints.

    n1i=1

    qif(xi) fn1i=1

    qixi

    Then, prove the inequality holds for n.

    2http://en.wikipedia.org/wiki/Mathematical induction

    Jensens inequality: proof

    If f is convex then for r v X E [f (X )] f (E [X ])

  • 8/12/2019 Chapter02 Entropy

    67/97

    Iff is convex, then for r.v.X, E[f(X)] f(E[X]).

    Iff is strictly convex, X =E[X] with probability 1.

    E[f(X)] =ni=1

    pif(xi) =pnf(xn) +n1i=1

    pif(xi)

    = pnf(xn) + (1 pn)

    n1i=1

    pi1 pn f(x

    i)

    pnf(xn) + (1 pn)f

    n1i=1

    pi1 pn

    xi

    f

    pnxn+ (1 pn)n1i=1

    pi1 pn

    xi

    = f

    ni=1

    pixi

    = f(E[X])

    Relative-entropy properties

  • 8/12/2019 Chapter02 Entropy

    68/97

    We can use Jensens Inequality to prove some of the properties ofrelative entropy.

    Theorem: Information inequalityLet p(x), q(x), x X, be two p.m.f.s. Then,

    D(p(x)||q(x)) 0.D(p(x)||q(x)) = 0 p(x) =q(x).

    Corollary: Non-negativity of mutual information

    I(X; Y) 0.

    I(X; Y) = 0 X and Yare independent.

    Entropy properties proved by Jensens inequality

    Theorem: Uniform PMF maximizes the entropy

  • 8/12/2019 Chapter02 Entropy

    69/97

    Theorem: Uniform PMF maximizes the entropy

    H(X) log(|X |)

    H(X) = log(|X |) p(x) =q(x) = 1/|X |

    Theorem: Conditioning reduces entropy

    H(X|Y) H(X)

    Theorem: Independence bound on entropy

    H(X1,X2, . . . ,Xn) i

    H(Xi)

    H(X1,X2, . . . ,Xn) =i

    H(Xi) Xiare independent with each other.

    Information inequality

    [Theorem:] Let p(x) q(x) x X be two p m f s then

  • 8/12/2019 Chapter02 Entropy

    70/97

    [Theorem:] Let p(x), q(x), x X, be two p.m.f. s, then

    D(p(x)||q(x)) 0.D(p(x)||q(x)) = 0 p(x) =q(x).

    Proof:Let A = {x :p(x)> 0} be the support set ofp(x), then

    D(p(x)||q(x)) = xA

    p(x)log

    p(x)

    q(x)

    = xA

    p(x)log q(x)

    p(x) log xA

    p(x)q(x)

    p(x) (by Jensens inequality= log

    xA

    q(x)

    log

    xX

    q(x)

    = log 1 = 0

    Note on information inequality

    ( ) q(x)

    ( )q(x)

  • 8/12/2019 Chapter02 Entropy

    71/97

    xAp(x)log

    q(x)p(x)

    log

    xA

    p(x)q(x)p(x)

    Notey= log(t): a strictly concave function oft.

    Consider a simple case n= 2. Let 1 =p(x1), 2 =p(x2), t1 = q(x1)p(x1)

    ,

    t2 = q(x2)p(x2)

    . Then,

    1log t1+ 2log t2 log(1t1+ 2t2) .

    p(x1)log

    q(x1)

    p(x1)

    +p(x2)log

    q(x2)

    p(x2)

    log

    p(x1)

    q(x1)

    p(x1)+p(x2)

    q(x2)

    p(x2)

    .

    Corollary: non-negativity of mutual information

  • 8/12/2019 Chapter02 Entropy

    72/97

    I(X; Y) 0;

    I(X; Y) = 0 X and Y are independent.

    Proof: I(X; Y) =D(p(x, y)||p(x)p(y)) 0

    With equality if and only ifp(x, y) =p(x)p(y), i.e., X and Y areindependent.

    A Binary Source: Entropy

  • 8/12/2019 Chapter02 Entropy

    73/97

    Consider a binary source of X

    p(X)

    = x1,

    p,

    x2

    1 p

    .

    H(X) = xXp(x)log[p(x)]

    = plog2p (1 p)log2(1 p)

    Whenp= 0.5,H(X) = 1 bit.H(X) log2 |X |.

    Theorem: uniform PMF maximizes the entropy

    H(X) log |X |;

  • 8/12/2019 Chapter02 Entropy

    74/97

    H(X) = log |X | p(x) =q(x) = 1|X | .

    Proof: Let u(x) = 1|X| be the uniform p.m.f . over X. Let p(x) be the p.m.f . forr.v.X. Then,

    D(p(x)||u(x)) =x

    X

    p(x)logp(x)

    u(x)

    =xXp(x)log

    1

    u(x)

    xXp(x)log p(x)

    =xXp(x)log |X |H(X)

    = log |X |H(X).

    Hence, by the non-negativity of relative entropy,

    0 D(p(x)||u(x)) = log |X |H(X).

    Theorem: conditioning reduces entropy

  • 8/12/2019 Chapter02 Entropy

    75/97

    H(X|Y) H(X)

    Proof:

    0 I(X; Y) =H(X)H(X|Y).Comments:

    Knowing another r.v.Ycan only reduce the uncertainty in X.This is true only on theaverage.

    Theorem: independence bound on entropy

  • 8/12/2019 Chapter02 Entropy

    76/97

    H(X1,X2, . . .Hn) i

    H(Xi).

    H(X1,X2, . . .Hn) i

    H(Xi) Xiare independent with each other.

    Proof:By the chain rule for entropy, we apply the theorem of conditioningreduces entropy.

    H(X1,X2, ...,Xn) =n

    i=1

    H(Xi|Xi1,Xi2, . . . ,X1)

    ni=n

    H(Xi)

    Log sum inequality: theorem

    For non-negative numbers, ai and bi, (i= 1, 2, . . . , n),

  • 8/12/2019 Chapter02 Entropy

    77/97

    ni=1

    ailogaibi

    ni=1

    ai

    logni=1ai

    ni=1

    bi

    .

    With equality, if and only if aibi

    = constant.

    Log sum inequality: proof

  • 8/12/2019 Chapter02 Entropy

    78/97

    Proof: (a brief sketch)

    Assume ai and biare positive.

    Construct f(t) =tlog t.

    The function f(t) =tlog tis strictly convex for all positive t.

    Construct i = bijbj

    and ti= aibi

    .

    By Jensens inequality,

    if(ti) f(

    iti).

    Then we obtain the log sum inequality

    Log sum inequality: elaboration ai 0 and bi 0. 0 log 0 = 0, a log

    a0

    = ifa>0 and 0 log 0 = 00

    .

    Construct f (t) = t log t The function f (t) = t log t is strictly convex

  • 8/12/2019 Chapter02 Entropy

    79/97

    Construct f(t) =tlog t. The function f(t) =tlog t is strictly convex,since f(t) = 1

    t>0 for all positive t.

    By Jensens inequality,(if(ti)) f

    iti

    .

    For i 0, i= 1. Set i =

    binj=1

    bj

    and ti = aibi

    .

    f(ti) = ai

    bilog

    ai

    bi

    ni=1

    [if(ti)] =ni=1

    binj=1

    bj

    ai

    bilog

    ai

    bi

    =

    ni=1

    ainj=1

    bj

    log

    ai

    bi

    Log sum inequality: elaboration

    n

    [ f ( )]n

    ail

    ai

  • 8/12/2019 Chapter02 Entropy

    80/97

    i=1[if(ti)] =

    i=1 i

    n

    j=1

    bj

    log

    i

    bi

    f

    n

    i=1(iti)

    =n

    i=1(iti) log

    n

    i=1(iti)

    =ni=1

    binj=1

    bj

    ai

    bi

    log

    ni=1

    binj=1

    bj

    ai

    bi

    =

    ni=1

    ainj=1

    bj

    logni=1

    ainj=1

    bj

    Log sum inequality: elaboration

    B J i lit

  • 8/12/2019 Chapter02 Entropy

    81/97

    By Jensens inequality,

    (if(ti)) f

    iti

    .

    n

    i=1

    ainj=1

    bj

    log ai

    bi

    n

    i=1

    ainj=1

    bj

    log

    n

    i=1

    ainj=1

    bj

    ni=1

    ailog

    aibi

    ni=1 ai

    log

    n

    i=1ai

    nj=1

    bj

    Log sum inequality: applicationsTheorem: convexity of relative entropy

  • 8/12/2019 Chapter02 Entropy

    82/97

    D(p||q) is convex in the pair (p, q);

    D[p1+ (1 )p2||q1+ (1 )q2] D(p1||q1) + (1 )D(p2||q2).Corollary: convexity of mutual information

    Theorem: concavity of entropy

    H(p) is a concave function ofp.

    Data processing inequality

  • 8/12/2019 Chapter02 Entropy

    83/97

    Markov Chain: Random variables X, Y, Z form a Markovchain (X Y Z), if

    p(x, y, z) =p(x)p(y|x)p(z|y).

    Note that by the chain rule,

    p(x, y, z) =p(x)p(y, z|x) =p(x)p(y|x)p(z|y, x).

    Consequence: Markovity implies conditional independencebecause

    p(x, z|y) =p(x, y, z)

    p(y) =

    p(x, y)p(z|y)

    p(y) =p(x|y)p(z|y).

    Data processing inequality: theorem

  • 8/12/2019 Chapter02 Entropy

    84/97

    IfX Y Z, then

    I(X; Y) I(X; Z).

    Proof:

    I(X; Y,Z) =I(X; Z) +I(X; Y|Z) =I(X; Y) +I(X; Z|Y),

    I(X; Z|Y) = 0 and I(X; Y|Z) 0.

    Thus, we have I(X; Y) I(X; Z).

    Comments: Manipulation of data cannot increase its information.

    Summary: model

    S

  • 8/12/2019 Chapter02 Entropy

    85/97

    Single outcome or outcome sequence

    Continuous or Discrete

    Summary: basic system of the simplest discrete source

  • 8/12/2019 Chapter02 Entropy

    86/97

    Notations

    Sample space: X Random variable (r.v.): X Outcome ofXor realization ofX : x Cardinality of set X (the number of elements): |X |

    Probability mass function (p.m.f.)

    P(x) =Pr[X =x], x X P(x, y) =Pr[X =x,Y =y], x X, y Y

    Summary: concept

  • 8/12/2019 Chapter02 Entropy

    87/97

    Self-InformationI(x)

    I(x) = log[p(x)] = log

    1p(x)

    Measure of uncertainty of single outcome

    Non-negative

    Summary: concept

  • 8/12/2019 Chapter02 Entropy

    88/97

    Entropy H(X)

    H(X) = E[log(p(x))] =E

    log

    1p(x)

    H(X) =EX[I(x)]

    Measure of uncertainty of information source

    Non-negative

    Summary: concept

  • 8/12/2019 Chapter02 Entropy

    89/97

    Joint entropy H(X,Y)

    Conditional entropy H(X|Y) orH(Y|X)

    Chain ruleH(X,Y) =H(X) +H(Y|X)

    Summary: concept

    Self-Information I(x)

  • 8/12/2019 Chapter02 Entropy

    90/97

    Measure of uncertainty of single outcome

    Non-negative Entropy H(X)

    H(X) =EX[I(x)] Measure of uncertainty of information source

    Non-negative Relative entropy D(p(x)||q(x))

    Measure of similarity of distributions Non-negative

    Mutual information I(X; Y) I(X; Y) =D[p(x, y)||p(x)p(y)] =EX,Y[I(x; y)] Measure of similarity between joint and product p.m.f.s Special case of relative entropy (Non-negative)

    Summary: entropy properties

  • 8/12/2019 Chapter02 Entropy

    91/97

    Non-negativity H(X) 0

    Chain Rule H(X,Y) =H(X) +H(Y|X)Uniform p.m.f. H(X) log(|X|)maximization

    Conditional H(X|Y) H(X)

    reductionIndependence H(X1,X2, . . . ,Xn) iH(Xi)

    bound

    Concavity H(p(x) + (1 )p(x)) H(p(x)) + (1 )H(p(x))

    Having maximum in a given range

    Summary: mutual information properties

  • 8/12/2019 Chapter02 Entropy

    92/97

    Name Expression

    Non-negativity I(X; Y) 0

    Maximum I(X; Y) =H(X) +H(Y)H(X,Y) H(X)

    Symmetry I(X; Y) =I(Y; X)

    Convexity D[p1+ (1 )p2||q1+ (1 )q2] D(p1||q1) + (1 )D(p2||q2)Having minimum in a given range

    Summary: entropy

  • 8/12/2019 Chapter02 Entropy

    93/97

    Concept Relationship Venn Diagram

    H(X)H(X) H(X|Y)H(X) =H(X|Y) +I(X; Y)H(X) =H(X,Y) H(Y|X)

    H(Y)H(Y) H(Y|X)H(Y) =H(Y|X) +I(X; Y)H(Y) =H(X,Y) H(X|Y)

    Summary: conditional entropy

  • 8/12/2019 Chapter02 Entropy

    94/97

    Concept Relationship Venn Diagram

    H(X|Y) H(X|Y) =H(X,Y)H(Y)

    H(X|Y) =H(X)I(X; Y)

    H(Y|X) H(Y|X) =H(X,Y)H(X)

    H(Y|X) =H(Y)I(X; Y)

    Summary: joint entropy and mutual information

  • 8/12/2019 Chapter02 Entropy

    95/97

    Concept Relationship Venn Diagram

    H(X,Y)

    H(X,Y) =H(X) +H(Y|X)H(X,Y) =H(Y) +H(X|Y)H(X,Y) =H(X) +H(Y)I(X; Y)H(X,Y) =H(X|Y) +H(Y|X) +I(X; Y)

    I(X; Y)

    I(X; Y) =H(X)H(X|Y)I(X; Y) =H(Y)H(Y|X)I(X; Y) =H(X,Y)H(Y|X)(X|Y)I(X; Y) =H(X) +H(Y)H(X,Y)

    Summary

    Entropy

  • 8/12/2019 Chapter02 Entropy

    96/97

    Joint and conditional entropy

    Relative entropy and mutual information

    Chain rules

    Jensens inequality

    Log sum inequality

    Data processing inequality

    Reference

  • 8/12/2019 Chapter02 Entropy

    97/97

    T. M. Cover and J. A. Thomas, Elements of information theory, 2nd ed.Hoboken, N.J. : J. Wiley, 2006.