ex. 2.1 (davide basilio bartolini) - ece.uic.edudevroye/courses/ece534/hw/hw1s.pdf · ece 534:...

ECE 534: Elements of Information Theory, Fall 2010Homework 1

Solutions

Ex. 2.1 (Davide Basilio Bartolini)

Text

Coin Flips. A fair coin is flipped until the first head occurs. Let X denote the number of flipsrequired.

(a) Find the Entropy H(X) in bits

(b) A random variable X is drawn according to this distribution. Find an “efficient” sequence ofyes-no questions of the form, “Is X contained in the set S?”. Compare H(X) to the expectednumber of questions required to determine X.

Solution

(a) The random variable X is on the domain X = {1, 2, 3, . . .} and it denotes the number of flipsneeded to get the first head, i.e. 1 + the number of consecutive tails appeared before the firsthead.Since the coin is said to be fair, we have p(“head”) = p(“tail”) = 1

2 and hence (exploiting theindependence of the coin flips):

p(X = 1) = p(“head”) =12

p(X = 2) = p(“tail”) ∗ p(“head”) =12∗ 1

2=(

12

)2

...

p(X = n) =

ntimes︷︸︸︷p(“tail”) ∗ . . . ∗ p(“tail”) ∗p(“head”) =

12∗ . . . ∗ 1

2∗ 1

2=(

12

)n

from this, it is clear that the probability mass distribution of X is:

pX(x) =(

12

)x

1

Once the distribution is known, H(X) can be computed from the definition:

H(X) = −∑x∈X

pX(x) log2 pX(x)

= −∞∑

x=1

(12

)x

log2

(12

)x

= −∞∑

x=0

(12

)x

log2

(12

)x

(since the summed expr. equals 0 for x = 0)

= −∞∑

x=0

(12

)x

x log2

(12

)(property of logarithms)

= − log2

(12

) ∞∑x=0

(12

)x

x

=∞∑

x=0

(12

)x

x =12(

1− 12

)2 = 2 [bit]

(exploiting

∞∑x=0

(k)x x =k

(1− k)2

)

(b) Since the most likely value for X is 1 (p(X = 1) = 12), the most efficient first question is: “Is

X = 1?”; the next question will be “Is X = 2?” and so on, until a positive answer is found.If this strategy is used, the random variable Y representing the number of questions will havethe same distribution as X and it will be:

E [Y ] =∞∑

y=0

y

(12

)y

=12(

1− 12

)2 = 2

which is exactly equal to the entropy of X.An interpretation of this fact could be that 2 bits (which is the entropy value for X) are theamount of memory required to store the outcomes of the two binary questions which are enough(on average) to get a positive answer on the value of X.

Exercise 2.4 (Matteo Carminati)

Entropy of functions of a random variable. Let X be a discrete random variable. Show that theentropy of a function of X is less than or equal to the entropy of X by justifying the followingsteps:

H(X, g(X))(a)= H(X) + H(g(X)|X)

(b)= H(X) (1)

H(X, g(X))(c)= H(g(X)) + H(X|g(X))

(d)

≥ H(g(X)) (2)

2

Thus, H(g(X)) ≤ H(X).

Solution

(a) It comes from entropy’s chain rule applied to random variables X and g(X), i.e. H(X,Y ) =H(X) + H(Y |X), so H(X, g(X)) = H(X) + H(g(X)|X).

(b) Intuitively, if g(x) depends only on X and if the value of X is known, g(X) is completelyspecified and it has a deterministic value. The entropy of a deterministic value is 0, soH(g(X)|X) = 0 and H(X) + H(g(X)|X) = H(X).

(c) Again, this formula comes from the entropy’s chain rule, in the form: H(X, Y ) = H(Y ) +H(X|Y ).

(d) Proving that H(g(X)) + H(X|g(X)) ≥ H(g(X)) means proving that H(X|g(X)) ≥ 0: thenon-negativity is one of the property of entropy and can be proved from its definition by notingthat the logarithm of a probability (a quantity always less than or equal to 1) is non-positive.In particular H(X|g(X)) = 0 if the knowledge of the value of g(X) allows to totally specifythe value of X; otherwise H(X|g(X)) > 0 (for example if g(X) is an injective function).

Ex. 2.7(a) (Davide Basilio Bartolini)

Text

Coin weighing. Suppose that one has n coins, among which there may or may not be one counterfeitcoin. If there is a counterfeit coin, it may be either heavier or lighter than the other coins. Thecoins are to be weighed by a balance.Find an upper bound on the number of coins n so that k weighings will find the counterfeit coin(if any) and correctly declare it to be heavier or lighter.

Solution

Let X be a string of n characters on the alphabet X = {−1, 0, 1}n, each of which represents onecoin. Each of the characters of X may have three different values (say 1 if the coin is heavier thana normal one, 0 if it is regular, −1 if it is lighter). Since only one of the coins may be counterfeit,X may be a string of all 0 (if all the coins are regular) or may present either a 1 or a −1 at onlyone position. Thus, the possible configurations for X are 2n + 1.Under the hypothesis of a uniform distribution of the probability of which coin is counterfeit, theentropy of X will be:

H(X) = log2 (2n + 1)

3

Now let Z = [Z1, Z2, . . . , Zk] be a random variable representing the weighings; each of the Zi willhave three possible values to indicate whether the result of the weighing is balanced, left armheavier or right arm heavier. The entropy of each Zi will be upper-bounded by the three possiblevalues it can assume: H(Zi) ≤ log2 3,∀i ∈ [1, k] and for Z (under the hypothesis of independenceof the weighings):

H(Z) = H(Z1, Z2, . . . , Zk)(ChainR.)

=k∑

i=1

H(Zi|Zi−1, . . . , Z1)

(Indep.)

=k∑

i=1

H(Zi) ≤ log2 3

Since we want to know how many weghings will yield the same amount of information which isgiven by the configuration of X (i.e. we want to know how many weighings will be needed to findout which coin - if any - is counterfeit), we can write:

H(X) = H(Z) ≤ log2 3log2 (2n + 1) ≤ log2 3

2n + 1 ≤ 3k

n ≤ 3k − 12

, which is the wanted upper bound.

Ex. 2.12 (Kenneth Palacio)

X Y 0 10 1/3 1/31 0 1/3

Table 1: p(x,y) for problem 2.12.

Find:(a) H(X), H(Y ).(b) H(X|Y ), H(Y |X).(c) H(X, Y ).(d) H(Y )−H(Y |X).(e) I(X; Y ).(f) Draw a Venn diagram for the quantities in parts (a) through (e).

Solution:

4

Compute of marginal distributions:

p(x) = [23,13

]

p(y) = [13,23

]

(a) H(X), H(Y ).

H(X) = −23

log2

(23

)− 1

3log2

(13

)(3)

H(X) = 0.918bits (4)

H(Y ) = −13

log2

(13

)− 2

3log2

(23

)(5)

H(Y ) = 0.918bits (6)

Figure 1: H(X), H(Y)

(b) H(X|Y ), H(Y |X).

H(X|Y ) =1∑

i=0

p(y = i)H(X|Y = y) (7)

H(X|Y ) =13H(X|Y = 0) +

23H(X|Y = 1) (8)

H(X|Y ) =13H(1, 0) +

23H(1/2, 1/2) (9)

H(X|Y ) = 2/3 (10)

5

H(Y |X) =1∑

i=0

p(x = i)H(Y |X = x) (11)

H(Y |X) =23H(Y |X = 0) +

13H(Y |X = 1) (12)

H(Y |X) =23H(1/2, 1/2) +

13H(0, 1) (13)

H(Y |X) = 2/3 (14)

Figure 2: H(X|Y ), H(Y |X)

(c) H(X, Y ).

H(X, Y ) =1,1∑

x=0,y=0

p(x, y) log2 p(x, y) (15)

H(X, Y ) = −313

log2

(13

)(16)

H(X, Y ) = 1.5849625bits (17)

Figure 3: H(X,Y)

6

(d) H(Y )−H(Y |X).

H(Y )−H(Y |X) = 0.918− 2/3 (18)H(Y )−H(Y |X) = 0.25134 (19)

Figure 4: H(Y )−H(Y |X)

(e) I(X; Y ).

I(X; Y ) =∑x,y

p(x, y) log2

(p(x, y)

p(x)p(y)

)(20)

I(X; Y ) =13

log2

(1/3

(2/3)(1/3)

)+

13

log2

(1/3

(2/3)(2/3)

)+

13

log2

(1/3

(1/3)(2/3)

)(21)

I(X; Y ) = 0.25162916 (22)

Figure 5: I(X;Y)

(f) Venn diagram is already shown for each item.

Ex. 2.20 (Kenneth Palacio)

Run-length coding. Let X1,X2, . . . , Xn be (possibly dependent) binary random variables.Suppose that one calculates the run lengths R = (R1, R2, . . .) of this sequence (in order as theyoccur). For example, the sequence X = 0001100100 yields run lengths R = (3, 2, 2, 1, 2). Compare

7

H(X1, X2, ..., Xn), H(R), and H(Xn, R). Show all equalities and inequalities, and bound all thedifferences.

Solution:

Lets assume that one random variable Xj (0 < j ≤ n) is known, then if R is also known, H(Xj,R)will provide the same information about uncertainty than H(X1,X2, . . Xj, . . ,Xn), since the wholesequence of X can be completely recovered from the knowledge of Xj and R. For example, withX5 = 1 and the run lengths R = (3, 2, 2, 1, 2) it’s possible to recover the original sequence as follows:

X5 = 1, R = (3, 2, 2, 1, 2) leads to recover the sequence: X = 0001100100.

It can be concluded that:H(Xj,R) = H(X1, X2, ....Xn)

Using the chain rule, H(Xj,R) can be written as:

H(Xj,R) = H(R) + H(Xj|R)

H(Xj|R) ≤ H(Xj), since conditioning reduces entropy.

Then it’s possible to write:

H(Xj) ≥ H(X1, X2, ...Xn)−H(R)

H(Xj) + H(R) ≥ H(X1, X2, ...Xn)

Computing H(Xj) =∑n

x p(Xj) log2 Xj, where the distribution of Xj is unknown, it can be as-sumed to be: a probability of p for Xj=0 and of (1-p) for Xj=1. It can be observed that themaximum entropy is given when p=1/2 leading max H(Xj)= 1.

Then:

1 + H(R) ≥ H(X1, X2, ...Xn)

Considering the results obtained in problem 2.4, we can write also that: H(R) ≤ H(X). BecauseR is a function of X.

8

ex. 2.1 (davide basilio bartolini) - ece.uic.edudevroye/courses/ece534/hw/hw1s.pdf · ece 534:...

Documents