INFORMATION THEORY
and Communication
Wayne Lawton
Department of Mathematics
National University of Singapore
S14-04-04, [email protected]
http://math.nus.edu.sg/~matwml
CHOICE CONCEPT
We are confronted by choices every day, for instance we may choose to purchase an apple, a banana, or a coconut;or to withdraw k < N dollars from our bank account.
Choice has two aspect: we may need to make the decision: “what fruit to purchase” or “how many dollars to withdraw”. We will ignore this aspect which is the concern of Decision Theory.
We may also need to communicate this choice to our food seller or our bank. This is the concern of Information Theory.
INCREASING CHOICE?The number of choices increases if more elements are added to the set of existing choices. For example, if a shopper is to choose one fruit from a store that carries Apples, Bananas, and Coconuts, then discovers that the store added Durians and Edelberries. The number of choices increased from 3 to 5 by the addition of 2 extra choices.
The number of choices increases if two or more sets of choices are combined. A shopper may have the choiceof choosing one fruit from {A,B,C} on Monday and choosing one fruit from {D,E} on Thursday. Compare with case above.
},,,,{},{},,{,, EDCBAEDCBACBA
)},(),,(),,(),,(),,(),,{(
},{},,{,,
ECDCEBDBEADA
EDCBACBA
MEASURING CHOICE?For any set X let #X denote the number of elements in X. Then
We require that the information, required to specify the choice of a element from a set, be an additive function, therefore
YXYX ###
YHXHYXH
Theorem 1. The logarithm functions measure information
XXH a#loglog to the base a, where a > 0; called bits if a = 2, nats if a = e
FACTORIAL FUNCTION
For any positive integer n define n factorial
Often the choices within different sets are mutually constrained. If a shopper is to purchase a different fruit from the set {A,B,C,D,E} on each of 5 days, then there are 5 choices on the first days but only 4 choices on the second day, etc. so the total number of choices equals
12)2()1(! nnnn
120!512345
1!0
STIRLING’S APPROXIMATION
Theorem 2.
nnnn ee log!log 21
constant
Proof [K] pages 111-115
COMBINATORICSTheorem 2 (Binomial)
where
knknk
k
n tsk
nts
0
)(
)()()( tststs
k
ncalled n choose k, is the number of ways of choosing k elements from a set with n elements.
Proof. Consider that is the product of n-factors and it equals the sum of
n2terms, each term is obtained by specifying a choice of s or t from each factor, the number of terms with
)!(!
!
knk
n
Furthermore, n choose k equals
knkts
is exactly the number of ways of choosing k factors to have s out of n factors.
MULTINOMIAL COEFFICIENTS
Theorem 3 The number of sequences with
symbols of a given type equals
Mini ,...,1,0
!!!
!
21 Mnnn
N
where
MnnN 1
SHANNON’S FORMULA
Theorem 4 For large N, the average information per symbol of a string of length N containing M symbols with probabilities
isMpp ,,1
bitsppppH i
Mi
iiM 2
11 log,,
Proof. The law of large numbers says that the i-th symbol will occur approximately
MiNpn ii ,...,1,
times, so the result follows from Stirling’s Approximation.
ENTROPY OF A RANDOM VARIABLE
with probabilities
Let X denote a random variable with values in a set
m1 a,...,aA m1 p,...,p
We define the entropy H(X) by
bits
Hence the entropy of a random variable equals the average information required to describe the values that it takes, it takes 1000 bits to describe 1000 flips of a fair coin but we can describe the loaded coin sequence HHHHHHTHHHHHHHHHTT by its run lengths 6H1T9H2T
Recall that for a large integer N, NH(X) equals the log of the number of strings of length N from A whose relative frequencies of letters are these probabilities
i2i1n1 plogpp,...,pH(X)
m
iH
JOINT DISTRIBUTION
with probabilities
Let X denote a random variable with values in a set
m1 a,...,aA m1 p,...,p
and Y is a random variable with values in a set
Then
YX
n1 b,...,bB with probabilities n1 q,...,q
is a random variable with values in
n1,...,jm;1,...,i|)b,(aBA ji whose probabilities
satisfy the marginal equations (m+n-1 independent)
n1,...,jm;1,...,i|rij
m
1i jij qr
n
1j iij pr
MUTUAL INFORMATIONThe joint entropy of X and Y
satisfies
Equality holds if and only if X and Y are
m
1i
n
1j ij2ij rlogr-YXH
independent, this means that
The mutual information of X and Y, defined by
jiij qpr
H(Y)H(X)YXH
YXH-H(Y)H(X)Y)I(X, satisfies H(Y),H(X)minY)I(X,0
H(X)X)I(X,0 and
CHANNELS AND THEIR CAPACITYA channel is a relationship between the transmitted message X and received message Y
Typically, this relationship does not determine Y as a function of X but only determines the statistics of Y given the value of X, this determines the joint distribution of X and YThe channel capacity is defined as Y)}{I(X,maxC Example: a binary channel with a 10% bit error rate
has joint probabilities
)p-.9(1)p-.1(11
.1p.9p0
10
11
11
1p0}{Xproband
5.p1
Max I(X,Y) = .531 bits
for
SHANNON’S THEOREMIf a channel has capacity C then it is possible to send information over that channel with a rate arbitrarily close to,but never more than, C witha probability of error arbitrarily small.
Shannon showed that this was possible to do byProving that there existed a sequence of codesWhose rates approached C and whose probabilities of error approaced zero
His masterpiece, called the Channel Coding Theorem, never actually constructed any specific codes, and thus provided jobs for thousands offor engineers, mathematicians and scientists
LANGUAGE AS A CODEDuring my first visit to Indonesia I ate a curious small fruit. Back in Singapore I went to a store and asked for a small fruit with the skin of a dark brown snake and more bitter than any gourd. Now I ask for Salak – a far more efficient, if less descriptive, code to specify my choice of fruit.
When I specify the number of dollars that I want to withdraw from my bank account I use positional notation (in base 10), a code to specify nonnegative integers that was invented in Babylonia (now Iraq) about 4000 years ago (in base 60).
Digital Computers, in contrast to analog computers, represent numbers using positional notation in base 2 (shouldn’t they be called binary computers?). Is thisbecause they can’t count futher than 1? These lecturewill explore this and other intriguing mysteries.
WHAT IS THE BEST BASE?A base B-code of length L (or uses an ordered sequence on symbols from a set of B symbols to represent B x B x … x B (read ‘B to the power L’) choices
Physically, this is represented using L devices each of which can exist in one of B states.
The cost is L times the cost of each device and the cost of each device is proportional to B since physical material is required to represent each of the B-1 ‘inactive states’ for each of the L devices that correspond to each position.
The efficiency of base B is therefore the ratio of information in a base B sequence of length L divided by BL, therefore
BB
BLB eL
eEff loglog
IS THE SKY BLUE?If I use base 2 positional notation to specify that I want to withdraw d < 8 dollars from my bank account then
Positional notation is great for computing, but if I decide to withdraw 2 rather than 1 (or 4 rather than 3) dollars I must change my code by 2 (or 3) bits. Consider the gray code:
0000 d 1001 d 2010 d
3011 d 4100 d 5101 d
3110 d 7111 d
0000 d 1001 d 2011 d
3010 d 4110 d 5111 d
6101 d 7100 d What’s different?
GRAY CODE GEOMETRYHow many binary gray codes of length 3 are there? And how can we construct them. Cube geometry gives the answers.
000 001
011010
110 111
101100
0000 d 1001 d 2011 d
3010 d 4110 d 5111 d
6101 d 7100 d
Bits in a code are the Cartesian Coordinates of the vertices. The d-th and (d+1)-th vertex share a common edge.
Answer the questions.
PROBLEMS?1. Derive Theorem 1. Hint: review properties of logarithms.
2. Write and run a simple computer program to demonstrate Stirling’s Approximation.
3. Derive the formula for n choose k by induction and then try to find another derivation. Then use the other derivation to derive the multinomial formula.
4. Complete the details for the second half of the derivation of Shannon’s Formula for Information.
5. How many binary Gray codes are there of length 3?
ERROR CORRECTING CODESHow many bits of information can be sent reliably by sending 3 bits if one of those bits may be corrupted? Consider the 3-dimensional binary hypercube.
000 001
011010
110 111
101100
H = {binary seq. of length 3}
A Code C is a subset of H
The Hamming Distance d(x,y) between x and y in H is the number of bits that they differ by. Hence d(010,111) = 2.The minimal distance d(C) of a code C is min {d(x,y) | x, y in C}
A code C can correct 1 error bit if and only if d(C) 3So we can send 1 bit reliably with the code C = {(000),(111)}
H has 8 sequences
PARITY CODESIf we wanted to send 4 bits reliably (to correct up to 1 bit error) then we could send each of these bits three times – this code consist of a set C of 16 sequences having length 12 – the code rate is 50% since 12 bits are used to communicate 4 bitsHowever, it is possible to send 4 bits reliably using only 8 bitsArranging the four bits in a 2 x 2 square and assigning 4 parity bits- one for each row and each column
To send a sequence abcd
subscript means mod 2
22
2
2
)()(
)(
)(
dbca
dcdc
baba
dc
ba
01
110
011
10
11
Note: a single bit error in a,b,c,d results in odd parity in its row and column Ref. See rectangular and triangle codes in [H]
HAMMING CODESThe following [7,4,3] Hamming Code can send 4 bits reliably using only 7 bits, it has d(C) = 3.
1111111
1000101
1100010
0110001
1011000
0101100
0010110
0001011
0000000
1101001
0111010
1110100
1001110
1010011
0100111
0011101
OTHER CODESHamming Codes are examples of cyclic group codes – why?
BCH (Bose-Chaudhuri-Hocquenghem) codes are another class of cyclic group codes and generated by the coefficient sequences of certain irreducible polynomials over a finite field
Reed-Solomon Codes were the first class of BCH codes to be discovered. They were first used by NASA for space communications and are now used as error corrections in CD’s
Other codes include: Convolutional, Goethals, Golay, Goppa, Hadamard, Julin, Justesen, Nordstrom-Robinson, Pless double circulant, Preparata, Quadratic Residue, Rao-Reddy, Reed-Muller, t-designs and Steiner systems, Sugiyama, Trellis, Turbo, and Weldon codes. There are many waitingto be discovered and the number of open problems is huge.
COUNTING STRINGS
Let
n21 TTT
1m
be positive integers Let
Let A be an alphabet with
1T
2m
symbols of time duration
2T
n1 m,,m and let be positive real numbers
symbols of time duration
nm symbols of time duration nT
andn
0t,T,,T,T,m,,m,mt,CS n21n21 be the number of strings, made from the letters of A,whose time duration is
t
MORSE CODE MESSAGES
1T
1mm 21 A = {dot, dash} =
time duration of
Examples: If
,,
2T,1T 21
2n
12,1,1,11,CS
# messages whose duration t
2T time duration of
,
32,1,1,12,CS
62,1,1,13,CS ,,,,,
21 T,T,1,1t,CS
112,1,1,14,CS ,,,,,,,,,,
PROTEINSn1,...,i1,mi
A = {amino acids} = {Glycine, Alanine, Valine, Phenylalanine, Proline, Methionine, Isoleucine, Leucine, Aspartic Acid, Glutamic Acid, Lysine, Arginine, Serine, Threonine, Tyrosine, Histidine, Cystein, Aspargine, Glutatimine, Tryptophan}
H
weight of corresponding Peptide Unit of i-th amino acid arranged from lightest (i = 1) to heaviest (i = 20)
iT
20n
H
C CN
R
OUnit with Amino Acid Residue R
UnitUnit
Unit
Single Chain Protein with Three Units
HOH
...,...t,CS # Single Chain Proteins
with weight O)weight(Ht 2
RECURSIVE ALGORITHM
n21n21 T,,T,T,m,,m,mt,CS
21111 TT,...,T-tCSmm t
1T,0 t
1kk
k
1iiii TT,...,T-tCSmm
t
t
n
n
1iiii T,...,T-tCSmm
MATLAB PROGRAMfunction N = cs(t,m,T)
% function N = cs(t,m,T)
% Wayne Lawton 19 / 1 / 2003
% Inputs: t = time,
% m = array of n positive integers
% T = array of n increasing positive numbers
% Outputs: N = number of strings composed out of these
% m(i) symbols of duration T(i), i = 1,...,n and having duration <= t
%
k = sum(T <= t);
N = 0;
if k > 0
for j = 1:k
N = N + m(j)*(1+cs(t-T(j),m,T));
end
end
ASYMPTOTIC GROWTH
tn21n21 cXT,,T,T,m,,m,mt,CS
t
1Xmn
1i
Ti
i
Theorem 5 For large
where X is the unique real root of the equation
Example tt
251 )618.1(2,1,1,1t,CS cc
and c is some constant
Proof For integer T’s a proof based on linear algebra works and X is the largest eigenvalue of a matrix that represents the recursion or difference equation for CS Othewise the Laplace Transform is required. We discovered a new proof based on Information Theory
INFORMATION THEORY PROOFn1,...,i,/mp ii
n
n
n
n
mp
mp
mp
mp
mp
mp ,,,,,,,,,HH
2
2
2
2
1
1
1
1
to maximize THfor the symbols having time duration
We choose probabilities
is the Shannon information (or entropy) per symbol and
iTwhere
i2i
n
1ii2i mlogpplogp
n
1iiiTpT is the average duration per symbol
Clearly is the average information per time
INFORMATION THEORY PROOF
for some Lagrange multiplier
Since there is the constraint
hence
n1,...,j,p
pp
n
1ii
jj
1pn
1ii
where the denominator, called the partition function, is the sum of the numerators (why).
Substituting these probabilities into THgives
n1,...,j),Z(emp jλTjj
1)( Z hence λeX satisfies the rootcondition in Theorem 5. The proof is complete since the probabilities that maximal information are the ones that occur in the set of all sequences
MORE PROBLEMS?6. Compute H(1/2,1/3,1/6)
7. Show that H(X,Y) is maximal when X and Y are independent
9. Compute all Morse Code Sequences of length <= 5If dots have duration 1 and dashes have duration 2
10. Compute the smallest molecular weight W so that only at least 100 single strand proteins have weight <= W
8. Read [H] and explain what a triangular parity code is.
REFERENCES[BT] Carl Brandon and John Tooze, Introduction to Protein Structure, Garland Publishing, Inc., New York, 1991.
[H] Sharon Heumann, Coding theory and its application to the study of sphere-packing, Course Notes, October 1998 http://www.mdstud.chalmers.se/~md7sharo/coding/main/main.html
[SW] Claude E. Shannon and Warren Weaver, The Mathematical Theory of Communication, Univ. of Illinois Press, Urbana, 1949.
[K] Donald E. Knuth, The Art of Computer Programming, Volume 1 Fundamental Algorithms, Addison-Wesley, Reading, 1997.
[Ham] R. W. Hamming, Coding and Information Theory, Prentice-Hall, New Jersey, 1980.
[CS] J. H. Conway and N. J. A. Sloan, Sphere Packings, Lattices and Groups, Springer, New York, 1993.
[BC] James Ward Brown and Ruel V. Churchill, Complex Variables and Applications, McGraw-Hill, New York, 1996.
MATHEMATICAL APPENDIX
n21n21 T,,T,T,m,,m,mt,CSf(t)
0
stdtef(t)F(s)
If
1T1Ms
satisfies so its Laplace Transform
exists for complex s if
1TtMf(t)
ni
1i imM and t largest integer t then
The recursion for f implies that P(s)G(s)F(s) isTn
1i iem1P(s) )(f(t)dteG(s) 1
T
0
stn
sG
in
inTT
0
stsTn
1i isT
1 f(t)dteemMe(s)G s
MATHEMATICAL APPENDIX
0t,dseF(s)P.V.f(t)iγ
i-γ
sti2π
1
This allows F to be defined as a meromorphic function swith singularities to the left of a line
and
iRγ
Therefore, f is given by a Bromwich integral that can be computed by a contour integral using the method of residues, see page 235 [BC],
iRγ
iRγ
γst
1 sseF(s)Resf(t)
j
j1s
2s3s
1j,s j
4s
0are the singularities of F
The unique real singularity Xlogs e1 and for large t tcXf(t) thus proving Theorem 5