ece 566 - information theoryclasses.engr.oregonstate.edu/eecs/winter2020/ece566... · thus in...
Post on 02-Aug-2020
0 Views
Preview:
TRANSCRIPT
ECE 566 - INFORMATION THEORYProf. Thinh Nguyen1
LOGISTICS Textbook: Elements of Information Theory, Thomas Cover and Joy
Thomas, Wiley, second edition
Class email list: ece566-w20@engr.orst.edu
2
LOGISTICS
Grading Policy
Quiz 1: 15% Quiz 2: 15% Quiz 3: 15% Quiz 4: 15%
Lowest score quiz will be dropped
Homework: 15% Final: 40%
3
WHAT IS INFORMATION?
Information theory deals with the concept ofinformation, its measurement and itsapplications.
4
ORIGIN OF INFORMATION THEORY
Two schools of thoughts British
Semantic information : related to the meaning of messages Pragmatic information : related to the usage and effect of
messages American
Syntactic: related to the symbols from which messages arecomposed, and their interrelations
5
ORIGIN OF INFORMATION THEORY
Consider the following sentences1. Jacqueline was awarded the gold medal by the judges at
the national skating competition.2. The judges awarded Jacqueline the gold medal at the
national skating competition.3. There is a traffic jam on I-5 between Corvallis and
Eugene in Oregon4. There is a traffic jam on I-5 in Oregon
(1) and (2): syntactically different but semanticallyand pragmatically identical.
(3) and (4): syntactically and semantically different,(3) gives more precise information than (4)
(3) and (4) are irrelevant for Californians. 6
ORIGIN OF INFORMATION THEORY
The British tradition is closely related tophilosophy, psychology and biology. Influencedscientists include MacKay, Carnap, Bar-Hillel, … Very hard if not impossible to quantify
The American traditional deals with thesyntactic aspects of information. Influencedscientists include Shannon, Renyi, Gallager andCsiszar, … Can be rigorously quantified by mathematics
7
ORIGIN OF INFORMATION THEORY
Basic questions of information theory in the American tradition involve the measurement of syntactic information
The fundamental limits on the amount of information which can be transmitted
The fundamental limits on the compression of information which can be achieved
How to build information processing systems approaching these limits?
8
ORIGIN OF INFORMATION THEORY
H. Nyquist (1924) published an article whereinhe discussed how messages (characters) could besent over a telegraph channel with maximumpossible speed, but without distortion.
R. Hartley (1928) who first defined a measure ofinformation as the logarithm of the number ofdistinguishable messages one can representusing n characters, each can take s possibleletters.
9
ORIGIN OF INFORMATION THEORY
For a message of length n, we have the Hartley’s measure of information as:
For a message of length of 1, we have the Hartley’s measure of information as:
This definition corresponds with the intuitive idea that a message consisting of n symbols contains ntimes as much information as a message consisting of only one symbols.
)log()log()( snssH nnH ==
)log(1)log()( 11 sssH H ==
10
ORIGIN OF INFORMATION THEORY
Note that any function would satisfy our intuition. One can show that the only functions that satisfy such equation is of the form:
The choice of a is arbitrary, and is more a matter of normalization.
)()( snfsf n =
)(log)( ssf a=
11
AND NOW…, THE SHANNON’S INFORMATION THEORY(a.k.a communication theory, mathematical information theory, or in short as information theory)
12
CLAUDE SHANNON
“The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point.” (Claude Shannon 1948)
Channel Coding Theorem: It is possible to achieve near perfect communication
of information over a noisy channel.
In this course we will: Define what we mean by information Show how we can compress the information in a
source to its theoretically minimum value and show the tradeoff between data compression and distortion.
Prove the Channel Coding Theorem and derive the information capacity of different channels
1916-2001
13
)(log)()](log[)( 22 xpxpXEXHAx∑∈
−=−=
The little formula that starts it all
14
SOME FUNDAMENTAL QUESTIONS
What is the minimum number of bits that can be used to represent the following texts?
(that can be addressed by information theory)
"There is a wisdom that is woe; but there is a woe that is madness. And there is a Catskill eagle in some souls that can alike dive down into the blackest gorges, and soar out of them again and become invisible in the sunny spaces. And even if he for ever flies within the gorge, that gorge is in the mountains; so that even in his lowest swoop the mountain eagle is still higher than other birds upon the plain, even though they soar."
- Moby Dick, Herman Melville
15
SOME FUNDAMENTAL QUESTIONS
How fast can information be sent reliably over an error prone channel?
(that can be addressed by information theory)
"Thera is a wisdem that is woe; but thera is a woe that is madness. And there is a Catskill eagle in sme souls that can alike divi down into the blackst goages, and soar out of thea agein and become invisble in the sunny speces. And even if he for ever flies within the gorge, that gorge is in the mountains; so that even in his lotest swoap the mountin eagle is still highhr than other berds upon the plain, even though they soar."
- Moby Dick, Herman Melville
16
FROM QUESTIONS TO APPLICATIONS
17
SOME NOT SO FUNDAMENTAL QUESTIONS(that can be addressed by information theory)
How to make money? (Kelly’s Portfolio Theory)
What is the optimal dating strategy?
18
FUNDAMENTAL ASSUMPTION OFSHANNON’S INFORMATION THEORY
E.g., Newtonian universe is deterministic, all events can be predicted with 100% accuracy by Newton’s laws, given the initial conditions.
U N I V E R S E
Physical Model (Laws)
If there is no irreducible randomness, the description/history of our universe, therefore can be compressed into a few equations.
19
FUNDAMENTAL ASSUMPTION OFSHANNON’S INFORMATION THEORY
Some well-defined stochastic process that generates all the observations
U N I V E R S E
Probabilistic model
Thus in certain sense, Shannon information theory is somewhat unsatisfactory since it is based on the “ignorant model.” Nevertheless, it is a good approximation, depending on how accurate the probabilistic model used to describe the phenomenon of interest. 20
FUNDAMENTAL ASSUMPTION OFSHANNON’S INFORMATION THEORY
"There is a wisdom that is woe; but there is a woe that is madness. And there is a Catskill eagle in some souls that can alike dive down into the blackest gorges, and soar out of them again and become invisible in the sunny spaces. And even if he for ever flies within the gorge, that gorge is in the mountains; so that even in his lowest swoop the mountain eagle is still higher than other birds upon the plain, even though they soar."
- Moby Dick, Herman Melville
We know the texts above is not random, but we can build a probabilistic model for it.
For example, a first order approximation could be to build a histogram of different letters in the text, then approximate the probability that a particular letter appears based on the histogram (assuming iid).
21
WHY SHANNON INFORMATION IS SOMEWHATUNSATISFACTORY?
How much Shannon information does this picture contain?
22
ALGORITHMIC INFORMATION THEORY(KOLMOGOROV COMPLEXITY)
The Kolmogorov complexity K(x) of a sequence x is the minimum size of the program and its inputs needed to generate x.
Example: If x was a sequence of all ones, a highly compressible sequence, the program would simply be a print statement in a loop. On the other extreme, if x were a random sequence with no structure, then the only program that could generate it would contain the sequence itself. 23
REVIEW OF BASIC PROBABILITY
A discrete random variable X takes a value xfrom the alphabet A with probability p(x).
24
EXPECTED VALUE
If g(x) is real valued and defined on A then
∑∈
=Ax
X xgxpXgE )()()]([
25
SHANNON INFORMATION CONTENT
The Shannon Information Content of an outcome with probability p is –log2p
26
MINESWEEPER
Where is the bomb? 16 possibilities - needs 4 bits to specify
27
ENTROPY
∑∈
−=−=Ax
xpxpXpEXH ))((log)())]((log[)( 22
28
ENTROPY EXAMPLES
Bernoulli Random Variable
29
ENTROPY EXAMPLES
Four Colored Shapes A = [ ; ; ; ]
p(X) = [1/2;1/4;1/8;1/8]
30
DERIVATION OF SHANNON ENTROPY
∑=
=n
iii ppXH
12log)(
Intuitive Requirements:
1. We want H to be a continuous function of probabilities pi . That is, a small change in pi should only cause a small change in H
2. If all events are equally likely, that is, pi = 1/n for all i, then H should be a monotonically increasing function of n. The more possible outcomes there are, the more information should be contained in the occurrence of any particular outcome.
3. It does not matter how we divide the group of outcomes, the total of information should be the same.
31
DERIVATION OF SHANNON ENTROPY
32
DERIVATION OF SHANNON ENTROPY
33
DERIVATION OF SHANNON ENTROPY
34
DERIVATION OF SHANNON ENTROPY
35
LITTLE PUZZLE
Given twelve coins, eleven of which are identical, and one is either lighter or heavier than the rest. You have a scale that can determine whether what you put on the left pan is heavier, lighter, or the same as what you put on the right pan. What is the minimum number of weighing for you determine the odd coin, and whether it is heavier or lighter than the rest?
36
JOINT ENTROPY H(X,Y)
)],(log[),( YXpEYXH −=p(X, Y) Y = 0 Y = 1
X = 0 1/2 1/4
X = 1 0 1/4
p(X,Y) Y = 0 Y = 1
X = 0 1/4 1/4
X = 1 1/4 1/4
37
JOINT ENTROPY H(X,Y)2XY =Example: ]1 ,1[−∈X 0.9] ,1.0[=Xp
p(X,Y) Y = 1
X = -1 0.1
X = 1 0.9
38
CONDITIONAL ENTROPY H(Y|X)
)]|(log[)|( XYpEXYH −=p(Y|X) Y = 0 Y = 1
X = 0 2/3 1/3
X = 1 0 1
39
CONDITIONAL ENTROPY H(Y|X)
Example: ]1 ,1[−∈X 0.9] ,1.0[=Xp
p(Y|X) Y = 1
X = -1 1
X = 1 1
2XY =
p(X|Y) Y = 1
X = -1 0.1
X = 1 0.9 40
INTERPRETATIONS OF CONDITIONALENTROPY H(Y|X) H(Y|X): The average uncertainty/information in
Y when you know X
H(Y|X): The weighted average of row entropiesp(X,Y) Y=0 Y = 1 H(Y|X) = x p(X)X=0 1/2 1/4 H(1/3) 3/4X=1 0 1/4 H(1) 1/4
41
CHAIN RULES
For two RVs,
In general,
),...,|(),...,,( 121
121 XXXXHXXXH i
n
iiin −
=−∑=
)|()()|()(),( YXHYHXYHXHYXH +=+=
Proof:
Proof:
42
GRAPHICAL VIEW OF ENTROPY, JOINTENTROPY, AND CONDITIONAL ENTROPY
H(X,Y)
H(Y)H(X)
H(X|Y)H(Y|X)
)|()()|()(),( YXHYHXYHXHYXH +=+=
43
MUTUAL INFORMATION
The mutual information is the average amount of information that you get about X from observing the value of Y
),()()()|()();( YXHYHXHYXHXHYXI −+=−=
H(X,Y)
H(Y)H(X)
H(X|Y)H(Y|X)
I(X;Y)
44
MUTUAL INFORMATION
The mutual information is symmetrical
);();( XYIYXI =
Proof:
45
MUTUAL INFORMATION EXAMPLE
If you try to guess Y, you have a 50% chance of being correct
However, what if you know X? Best guess: choose Y = X What is the overall probability
of guessing correctly?
p(X, Y) Y = 0 Y = 1
X = 0 1/2 1/4
X = 1 0 1/4
H(X,Y)=1.5
H(Y)=1H(X)=0.811
H(X|Y)=0.5
H(Y|X)=0.689I(X;Y)=0.311
46
CONDITIONAL MUTUAL INFORMATION
)|,()|()|(),|()|()|;( ZYXHZYHZXHZYXHZXHZYXI −+=−=
Note: Z conditioning applies to both X and Y
47
CHAIN RULE FOR MUTUAL INFORMATION
),...,,|;();,...,,( 1211
21 XXXYXIYXXXI ii
n
iin −−
=∑=
Proof:
48
EXAMPLE OF USING CHAIN RULE FORMUTUAL INFORMATION
X +
Z
Y
Find I(X,Z;Y).
p(X, Z) Z = 0 Z = 1
X = 0 1/4 1/4
X = 1 1/4 1/4
49
CONCAVE AND CONVEX FUNCTIONS
f(x) is strictly convex over (a,b) if
1 0 ),,( )()1()())1(( <<∈≠∀−+<−+ λλλλλ bavuvfufvuf
Examples:
Technique to determine the convexity of a function:
50
JENSEN’S INEQUALITY
convex ])[()]([ XEfXfE ≥)(Xf
strictly convex ])[()]([ XEfXfE >)(Xf
Proof:
51
JENSEN’S INEQUALITY EXAMPLE
Mnemonic example:2)( xxf =
52
RELATIVE ENTROPY
Relative entropy of Kullback-Leibler Divergence between two probability mass vectors (functions) p and q is defined as:
)()](log[])()([log
)()(log)()||( XHxqE
xqxpE
xqxpxpqpD p
Axp −−===∑
∈
Property of )||( qpD
53
RELATIVE ENTROPY EXAMPLE
Rain Cloudy SunnyWeather at
Seattlep(x) 1/4 1/2 1/4
Weather at Corvallis
q(x) 1/3 1/3 1/3
)||( qpD
)||( pqD54
INFORMATION INEQUALITY
0)||( ≥qpD
Proof:
55
INFORMATION INEQUALITY
Uniform distribution has highest entropyProof:
Mutual Information is non-negativeProof:
0);( ≥YXI
||log)( AXH ≤
56
INFORMATION INEQUALITY
Conditioning reduces entropyProof:
Independence bound Proof:
∑=
≤n
iin XHXXXH
121 )(),...,,(
)()|( XHYXH ≤
57
INFORMATION INEQUALITY
Conditional independence bound
Proof:
Mutual information independence bound
Proof:
∑=
≥n
iiinn YXIYYYXXXI
12121 );(),...,,;,...,,(
∑=
≤n
iiinn YXHYYYXXXH
12121 )|(),...,,|,...,,(
58
SUMMARY
59
LECTURE 2: SYMBOL CODE
Symbol codes Non-singular codes Uniquely decodable codes Prefix codes (instantaneous codes)
Kraft Inequality Minimum Code Length
60
SYMBOL CODES
Mapping of letters from one alphabet to another
ASCII codes: computers do not process English language directly. Instead, English letters are first mapped into bits, then processed by the computer.
Morse codes
Decimal to binary conversion
Process of mapping from letters in one alphabet to letters in another is called coding. 61
SYMBOL CODES
Symbol code is a simple mapping. Formally, symbol code C is a mapping X D+
X = the source alphabet D = the destination alphabet D+ = set of all finite strings from D
Example:
{X,Y,Z} {0,1}+ , C(X) = 0, C(Y) = 10, C(Z) = 1
Codeword: the result of a mapping, e.g., 0, 10, … Code: a set of valid codewords. 62
SYMBOL CODES
Extension: C+ is a mapping X+ D+ formed by concatenating C(xi) without punctuation
Example: C(XXYXZY) = 001001110
Non-singular: x1≠x2 → C(x1) ≠C(x2)
Uniquely decodable: C+ is non-singular That is C+(x+) is unambiguous
63
PREFIX CODES
Instantaneous or Prefix code: No codeword is a prefix of another
Prefix → uniquely decodable → non-singular
Examples:
C1 C2 C3 C4W 0 00 0 0X 11 01 10 01Y 00 10 110 011Z 01 11 1110 0111
64
CODE TREE
Form a D-ary tree where D = |D|
D branches at each node Each branch denotes a symbol di in D Each leaf denotes a source symbol xi in X C(xi) = concatenation of symbols di along
the path from the root to the leaf that represents xi
Some leaves may be unused
X
W
Y
Z
0
1
1
00
01
1
C(WWYXZ) = 65
KRAFT INEQUALITY FOR PREFIX CODES
For any prefix code, we have:
where li is the length of codeword i and D = |D|. Proof:
∑ ≤−||
1X
i
liD
66
KRAFT INEQUALITY FOR UNIQUELYDECODABLE CODES
For any uniquely decodable code with codewords of lengths l1, l2, …, l|X|, then
Proof:∑ ≤−
||
1X
i
liD
67
CONVERSE OF KRAFT INEQUALITY
If , then there exists a prefix code with codelengths l1, l2, … l|X|.
Proof:
∑ ≤−||
1X
i
liD
68
EXAMPLE OF CONVERSE OF KRAFTINEQUALITY
69
OPTIMAL CODE
If l(C(x))= the length of C(x), then C is optimal if LC = E[l(C(X))] is as small as possible for a given p(X).
C is a uniquely decodable code → LC ≥H(X)/log2D
Proof:
70
SUMMARY
Symbol code
Kraft inequality for uniquely decodable code
Lower bound for any uniquely decodable code
71
LECTURE 3: PRACTICAL SYMBOL CODES
Fano code Shannon code Huffman code
72
FANO CODE
Put the probabilities in decreasing order Split as close to 50-50 as possible. Repeat with each half
73
H(p) = 2.81 bits
L F=2.89 bits
Not optimal!
Lopt =2.85 bits
CONDITIONS FOR OPTIMAL PREFIX CODE(ASSUMING BINARY PREFIX CODE) An optimal prefix code must satisfy:
p(xi) > p(xj) → l(xi) ≤ l(xj) (else swap them)
The two longest codewords must have the same length ( else chop a bit off the codeword)
In the tree corresponding to the optimum code, there must be two branches stemming from each intermediate node.
74
HUFFMAN CODE CONSTRUCTION
1. Take the two smallest p(xi) and assign each a different last bit. Then merge into a single symbol.
2. Repeat step 1 until only one symbol remains
75
HUFFMAN CODE EXAMPLE
76
HUFFMAN CODE EXAMPLE (CONT)Letter Probability Codeword
x1 0.4x2 0.2x3 0.2x4 0.1x5 0.1
77
HUFFMAN CODE IS OPTIMAL PREFIXCODE
78
We want to show that all of these codes (include c5) is optimal
HUFFMAN OPTIMALITY PROOF
79
HOW SHORT ARE THE OPTIMAL CODE? If l(x) = length(C(x)) then C is optimal if LC=E l(x)
is as small as possible.
We want to minimize
Subject to:
andl(xi) are integers
80
∑i
ii xlxp )()(
1)( ≤∑ −
i
xl iD
OPTIMAL CODES (NON-INTEGER LENGTH) Minimize
subject to:
81
∑i
ii xlxp )()(
1)( ≤∑ −
i
xl iD
Use Lagrange multiplier method:
SHANNON CODE
Round up optimal code lengths:
li satisfy the Kraft Inequality → Prefix code exists. Construction for such code is done earlier.
Average length:
Proof:
82
)(log iDi xpl −=
1log
)(log
)(+≤≤
DXHL
DXH
S
SHANNON CODE EXAMPLES
x1 x2 x3 x4
p(xi) 1/2 1/4 1/8 1/8
83
x1 x2
p(xi) 0.99 0.01
HUFFMAN CODE VS. SHANNON CODE
84
x1 x2 x3 x4
p(xi) 0.34 0.36 0.25 0.5Shannon 2 2 2 5Huffman 1 2 3 3
)(log ixp−
LS = 2.15 bits, LH = 1.94 bits
H(X) = 1.78 bits
SHANNON COMPETITIVE OPTIMALITY
l(x) is length of a uniquely decodable code, lS(x) is length of Shannon code then
85
cS cXlXlp −≤−< 12))()((
Proof:
DYADIC COMPETITIVE OPTIMALITY
If p(x) is dyadic ↔ log(p(xi)) is integer for all i, then
with equality iff l(X) ≡ lS(X)Proof:
86
))()(())()(( XlXlpXlXlp SS <≤<
SHANNON WITH WRONG DISTRIBUTION
If the real distribution of X is p(x) but you assign Shannon lengths using the distribution q(x) what is the penalty ?Answer: D(p||q)
Proof:
87
SUMMARY
88
LECTURE 4 Stochastic Processes Entropy Rate Markov Processes Hidden Markov Processes
89
STOCHASTIC PROCESS
Stochastic process {Xi}=X1, X2, X3, … Define entropy of {Xi} as
H({Xi} ) = H(X1, X2, X3, …)= H(X1) + H(X2|X1) + H(X3|X2,X1) + ….
Define entropy rate of {Xi} as
Entropy rate estimates the additional entropy per new sample. Gives a lower bound on number of code bits per sample. If the Xi are not i.i.d. the entropy rate limit may not exist.
90
nXXXHXH n
n
),...,,(lim)( 21_
∞→=
91
EXAMPLES
91
LIMIT OF CESARO MEAN
92
bakk→
∞→lim ba
n
n
kkn→∑
=∞→
0
1lim
Proof:
93
A stochastic process {Xi} is stationary iffnkaXaXaXPaXaXaXP nnkkknn , ),...,,(),...,,( 22112211 ∀======= +++
If {Xi} is stationary, then exists_)(XH
Proof:
STATIONARY PROCESS
93
94
BLOCK CODING IS MORE EFFICIENT THANSINGLE SYMBOL CODING
94
EXAMPLE OF BLOCK CODE
95
Discrete –valued Stochastic Process {Xi} is Markov iff
Time independent if
)|(),...,,|( 1110 −− = nnnn XXPXXXXP
abnn PbXaXP === − )|( 1
MARKOV PROCESS
96
STATIONARY MARKOV PROCESS
Finite state Markov process can be described using Markov chain diagram
97
a b
0.1
0.9
0.1
0.9
Any finite, aperiodic, irreducible Markov chain will converge to a stationary distribution regardless of starting distribution
STATIONARY MARKOV PROCESS
98
As
approaches
Each row is the stationary distribution
STATIONARY MARKOV PROCESS EXAMPLE
99
Pllp TTS ==
CHESS BOARD
100
EXAMPLE OF CODING FOR MARKOVSOURCE
a b
0.1
0.9
0.1
0.9
Find _)(XH a) Assuming the Markov model above with block code n = 2
b) Assuming i.i.d and symbol codec) Assuming i.i.d and block code with n = 2.d) Assuming i.i.d and block code with n = 3.
Which one of the above is minimum? Do you think when n = 1000, block code assuming i.i.d would be lower or higher than the answer in a)? Provide reason for your answer.
101
M users share wireless transmission channel For each user independent in each time slot
If the queue is non-empty, transmit with the probability q A new packet arrive for transmission with probability p
If two packets collide, they stay in the queue At time t, queue sizes are Xt = (n1, …, nM)
{Xt} is Markov since P(Xt|Xt-1) depends only on Xt-1
Transmit vector Yt
{Yt} is not a Markov since P(Yt) depends on Xt-1, not Yt-1.
{Yt} is called Hidden Markov Process
ALOHA WIRELESS EXAMPLE
102
ALOHA WIRELESS EXAMPLE
103
If Xi is a Markov Process, and Y = f(X), then Yi is a stationary Hidden Markov Process
What is entropy rate ?)(_YH
HIDDEN MARKOV PROCESS
104
HIDDEN MARKOV PROCESS
105
HIDDEN MARKOV PROCESS
106
SUMMARY Entropy rate
Stationary
Stationary Markov
Hidden Markov
107
LECTURE 5 Stream codes Arithmetic code Lempel-Ziv code
108
WHY NOT USE HUFFMAN CODE? (SINCE IT ISOPTIMAL)
Good It is optimal, i.e., shortest possible code
Bad Bad for skewed probability Employ block coding helps, i.e., use all N symbols in a
block. The redundancy is now 1/N. However, Must re-compute the entire table of block symbols if
the probability of a single symbol changes. For N symbols, a table of |X|N must be pre-calculated
to build the treeSymbol decoding must wait until an entire block is
received109
ARITHMETIC CODE
Developed by IBM (too bad, they didn’t patent it) Majority of current compression schemes use it
Good: No need for pre-calculated tables of big size Easy to adapt to change in probability Single symbol can be decoded immediately without
waiting for the entire block of symbols to be received.
110
BASIC IDEA IN ARITHMETIC CODING
Sort the symbols in lexical order
Represent each string x of length n by a unique interval [F(xi-1),F(xi))in [0,1). F(xi)is the cumulative distribution.
F(xi) – F(xi-1) represents the probability of xi occurring.
The interval [F(xi-1),F(xi)) can itself be represented by any number, called a tag, within the half open interval.
The significant bits of the tag 0.t1t2t3... is the code
of x. That is, 0.t1t2t3...tk000... is in the interval [F(xi-1),F(xi))111
1)(
1log)( +
=
xPxl
TAG IN ARITHMETIC CODE
Choose the tag
112
2)()(
2)()(
2)()())),(([ 1
1
1
11
_ xFxFxXPxFxXPxXPxxFT XiXiiX
i
k
ikiiX
+=
=+=
=+== −
−
−
=− ∑
ARITHMETIC CODING EXAMPLE
113
0
ARITHMETIC CODING EXAMPLE
114
EXAMPLE OF ARITHMETIC CODING
115
Symbol Code
1234
.5
.75
.8751.0
.25
.625
.8125
.9375
.010
.101
.1101
.1111
2344
0110111011111
( )xF ( )xT X BinaryIn( ) 11log +
xP
1 2 3 4P(X) 0.5 0.25 0.125 0.125
Message Code
11121314212223243132333441424344
.25
.125
.0625
.0625
.125
.0625
.03125
.03125
.0625
.03125
.015625
.015625
.0625
.03125
.015625
.015625
.125
.3125
.40625
.46875
.5625
.65625
.703125
.734375
.78125
.828125
.8515625
.8671875
.90625
.953125
.9765625
.984375
.001
.0101
.01101
.01111
.1001
.10101
.101101
.101111
.11001
.110101
.1101101
.1101111
.11101
.111101
.1111101
.1111111
3455456656775677
0010101011010111110011010110110110111111001110101110110111011111110111110111111011111111
( )xP ( )xT X ( ) BinaryinxT X( ) 11log +
xP
ARITHMETIC CODE FOR TWO-SYMBOLSEQUENCES
PROGRESSIVE DECODING
117
F(x)
UNIQUENESS OF THE ARITHMETIC CODE
Idea of proving arithmetic is a uniquely decodable code Show that the truncation of the tag lies entirely within
[F(xi-1, F(xi)), hence singular code Show that the truncation of the tag results in prefix code,
hence uniquely decodable. Proof:
118
UNIQUENESS OF THE ARITHMETIC CODE
119
UNIQUENESS OF THE ARITHMETIC CODE
120
EFFICIENCY OF THE ARITHMETIC CODE
Off at most by 2/N
Proof:
121
NNXXXHl
NXXXH N
AN 2),...,,(),...,,( 2121 +<≤
EFFICIENCY OF THE ARITHMETIC CODE
122
ADAPTIVE ARITHMETIC CODE
123
DICTIONARY CODING
index pattern
1 a2 b3 ab…n abc
Encoder codes the index
EncoderIndices
Decoder
Both encoder and decoder are assumed to have the same dictionary (table)
ZIV-LEMPEL CODING (ZL OR LZ) Named after J. Ziv and A. Lempel (1977).
Adaptive dictionary technique. Store previously coded symbols in a buffer. Search for the current sequence of symbols to code. If found, transmit buffer offset and length.
LZ77
a b c a b d a c a b d e e e fc
Search buffer Look-ahead buffer
Output triplet <offset, length, next>
123456783
8 0 13 d 0 e 2 f
2
Transmitted to decoder:
If the size of the search buffer is N and the size of the alphabet is Mwe need
bits to code a triplet.
Variation: Use a VLC to code the triplets!PKZip, Zip, Lharc,PNG, gzip, ARJ
DRAWBACKS OF LZ77 Repetetive patterns with a period longer than the
search buffer size are not found. If the search buffer size is 4, the sequence
a b c d e a b c d e a b c d e a b c d e …will be expanded, not compressed.
SUMMARY
Stream codes Arithmetic Lempel-Ziv
Encoder and decoder operate sequentiallyno blocking of input symbols required
Not forced to send ≥1 bit per input symbol Can achieve entropy rate even when H(X)<1 Require a Perfect Channel
A single transmission error causes multiple wrong output symbols
Use finite length blocks to limit the damage 128
LECTURE 6 Markov chains Data Processing theorem
You can’t create information from nothing (no free lunch) Fano’s inequality
Lower bound for error in estimating X from Y
129
MARKOV CHAIN
Given 3 random variables X,Y,Z, they form a Markov chain denoted as X→Y →Z if
A Markov chain X→Y →Z means that The only way that X affects Z is through the value of Y I (X ;Z |Y ) = 0 ⇔ H(Z|Y ) = H(Z |X ,Y ) If you already know Y, then observing X gives you no
additional information about Z If you know Y, then observing Z gives you no additional
information about X A common special case of a Markov chain is when Z
= f(Y)
130
)|(),|()|()|()(),,( yzpyxzpyzpxypxpzyxp =⇔=
MARKOV CHAIN SYMMETRY
X→Y →Z ↔ X and Z are conditionally independent given Y
X→Y →Z ↔ Z→Y →X
131
)|()|()|,( yzpyxpyzxp =
DATA PROCESSING THEOREM
If X→Y→Z then I(X ;Y) ≥ I(X ; Z) Processing Y cannot add new information about X
If X→Y→Z then I(X ;Y) ≥ I(X ; Y | Z) Knowing Z can only decrease the amount X tells you
about Y Proof:
132
NON-MARKOV: CONDITIONING INCREASEMUTUAL INFORMATION
Noisy channel Z = X+Y
133
LONG MARKOV CHAIN
X1→ X2 → X3 → …. Xn then mutual information increases as you get closer and closer
134
);(....);();();( 1413121 nXXIXXIXXIXXI ≥≥≥≥
SUFFICIENT STATISTICS
If pdf of X depends on a parameter θ and you extract a statistic T(X) from your observation, then θ → X → T(X ) ⇒ I (θ ;T(X )) ≤ I (θ ;X )
T(x) is sufficient for θ if the stronger condition:
135
))(|()),(|( )(
);())(;()(
XTXpXTXpXXT
XIXTIXTX
=⇔→→→⇔
=⇔→→→
θθθ
θθθθ
EXAMPLES SUFFICIENT STATISTICS
136
FANO’S INEQUALITY
If we estimate X from Y, what is ?
N is the size of the outcome set of XProof:
137
)(^XXP ≠
)1log()1)|((
)1log())()|(()(
^
−−
≥−−
≥≠≡N
YXHN
PHYXHXXPP ee
FANO’S INEQUALITY EXAMPLE
138
SUMMARY
Markov
Data Processing Theorem
Fano’s Inequality
139
LECTURE 7 Weak Law of Large Numbers The Typical Set Asymptotic Equipartition Principle
140
STRONG AND WEAK TYPICALITY
Strongly typical sequences: ABAACABCAABC Correct proportion H(X) = -(0.5log(0.5) + 0.5log(0.25)) = 2.5 bits
Weak typical sequences: BBBBBBBBBBAA Incorrect proportion
141
A B CP(X) 0.5 0.25 0.25
CONVERGENCE OF RANDOM NUMBERS
Almost Sure Convergence:
Example:
142
1)limPr( ==∞→
XX nn
CONVERGENCE OF RANDOM NUMBERS
Convergence in Probability:
Example:
143
0 ,0)|(Pr(|lim >∀→≥−∞→
εεXX nn
WEAK LAW OF LARGE NUMBERS
Given i.i.d let , then
144
{ },iX ∑=
=n
iin X
ns
1
1
,][][ µ== XEsE n21)(1)( σ
nXVar
nsVar n ==
As n increases, the variance of sn decreases, i.e., values of sn become clustered around the mean
WLLN
0)|(| , →≥−→ ∞→nn
probn sPs εµµ
PROOF OF WEAK LAW OF LARGENUMBERS
Chebyshev’s Inequality
WLLN
145
TYPICAL SET
is the i.i.d sequence of
146
nx niX i to1for }{ =
}|)()(log:|{ 1 εε <−−Χ∈= − XHxpnxT nnnn
EXAMPLE OF TYPICAL SET
Bernoulli with p
147
TYPICAL SETS
148
TYPICAL SET PROPERTIES
149
ASYMPTOTIC EQUIPARTITION PRINCIPLE
150
surprisingequally almost isevent any almost , and ,any For εε Nn >
SOURCE CODING AND DATA COMPRESSION
151
152
εεε −>∈ 1)( that ensure to,What nn TxPN
SMALLEST HIGH PROBABILITY SET
153
SUMMARY
154
LECTURE 8 Source and Channel Coding Discrete Memoryless Channels
Binary Symmetric Channel Binary Erasure Channel Asymmetric Channel
Channel Capacity
155
SOURCE AND CHANNEL CODING
Source Coding
Channel Coding156
DISCRETE MEMORYLESS CHANNEL
Discrete input, discrete output Channel matrix Q
Memoryless
157
BINARY CHANNELS
Binary symmetric channel
Binary erasure channel
Z channel
158
WEAKLY SYMMETRIC CHANNELS
1. All columns of Q have the same sum
2. Rows of Q are permutation of each other
159
SYMMETRIC CHANNEL
1. Rows of Q are permutations of each other
2. Columns of Q are permutations of each other
160
CHANNEL CAPACITY
Capacity of discrete memoryless channel
Capacity of n uses of channel:
161
);(max)(
YXICxp
=
),...,,,;,...,,(max1321321)( nnxp
n YYYYXXXXIn
C =
MUTUAL INFORMATION
162
MUTUAL INFORMATION IS CONCAVE IN
Mutual information is concaved in p(x) for a fixed p(y|x)
Proof:
163
)(xp
MUTUAL INFORMATION IS CONVEX IN
Mutual information is convex in given a fixed
Proof:
164
)|( xyp
)|( xyp)(xp
N USE CHANNEL CAPACITY
165
CAPACITY OF WEAKLY SYMMETRICCHANNEL
166
BINARY ERASURE CHANNEL
167
ASYMMETRIC CHANNEL CAPACITY
168
SUMMARY
169
LECTURE 9 Jointly Typical Sets Joint AEP Channel Coding Theorem
170
SIGNIFICANCE OF MUTUAL INFORMATION
171
JOINTLY TYPICAL SETS
172
} |),()),((log| ,|)()(log| ,|)()(log:|),{(
1
1
1
ε
ε
εε
<−−
<−−
<−−Χ∈=
−
−
−
YXHyxpnXHxpnXHxpnYyxJ
nn
n
nnnnn
JOINTLY TYPICAL EXAMPLE
173
p(x,y) 0 10 0.5 0.1251 0.125 0.25
JOINTLY TYPICAL DIAGRAM
174
JOINTLY TYPICAL SET PROPERTIES
175
JOINT AEP
176
)3),(()3),(( 2)),((2)1( then,t independen are ',' and ),'()(),'()( If
εε
εε −−+− ≤∈≤−
==YXInnnnYXIn Jyxp
YXypypxpxp
CHANNEL CODE
177
Noisy channelEncoder
Mw ,...2,1∈ nx ny Decoder)(ˆ nygw =
Mw ,...2,1ˆ ∈
ACHIEVABLE CODE RATES
178
CHANNEL CODING THEOREM
179
CHANNEL CODING THEOREM
180
SUMMARY
181
LECTURE 10 Channel Coding Theorem
182
CHANNEL CODING: MAIN IDEA
183
Noisy channel
xn yn
RANDOM CODE
184
RANDOM CODING IDEA
185
x1Noisy channel
xn yn
ERROR PROBABILITY OF RANDOM CODE
186
CODE SELECTION AND EXPURGATION
187
PROCEDURE SUMMARY
188
CONVERSE OF CHANNEL CODINGTHEOREM
189
LECTURE 11 Minimum Bit Error Rate Channel with Feedback Joint Source Channel Coding
190
MINIMUM BIT ERROR RATE
191
CHANNEL WITH FEEDBACK
192
CAPACITY WITH FEEDBACK
193
EXAMPLE: BINARY ERASURE CHANNELWITH FEEDBACK
194
JOINT SOURCE CHANNEL CODING
195
yn vnnv̂nynxnv
SOURCE CHANNEL PROOF (→)
196
SOURCE CHANNEL PROOF (←)
197
SEPARATION THEOREM
198
top related