lecture6: arithmetic codes - guceee.guc.edu.eg/courses/communications/comm901... · lecture6:...
TRANSCRIPT
SOURCE CODING PROF. A.M.ALLAM
LECTURES11/13/2018 1
Lecture6: ARITHMETIC CODES
In applications where the alphabet size is large; Pmax is generally quite small , and the
amount of deviation of the entropy from the average code length ( or in terms of a
percentage of the rate) is quite small
However, in cases where the alphabet size is small and the probability of occurrence of the
different letters is skewed, the value of Pmax can be quite large and the Huffman code can
become rather inefficient when compared to the entropy
It has been shown that the Huffman algorithm will generate a code whose rate is
within Pmax+0.086 of the entropy, where Pmax is the probability of the most frequently
occurring symbol
a1 0.95
a3 0.03
a2 0.02
0
1
0.8
0.20.2
0
1
a1 0.95
a2 0.02
a3 0.03
0
10
11
Ex: Find the Huffman code for the following source given the corresponding probabilities
symbolbitsxxxL /05.1203.0202.0195.0
symbolbitsH /335.003.0
1log03.0
02.0
1log02.0
8.0
1log95.0 22
ρ=0.715 bits/symbol, average =0.715/0.335= 213% i.e., to code this sequence we would need more
than twice the number of bits promised by the entropy
a1 0.95
a2 0.02
a3 0.03
0
10
11
SOURCE CODING PROF. A.M.ALLAM
LECTURES11/13/2018 2
Encoding the source symbols in longer blocks of symbols can get a rate closer to entropy
Letter Probability Code
a1a1 0.9025 0
a1a2 0.0190 111
a1a3 0.0285 100
a2a1 0.0190 1101
a2a2 0.0004 110011
a2a3 0.0006 110001
a3a1 0.0285 101
a3a2 0.0006 110010
a3a3 0.0009 110000
symbolbitsH /335.0
ρ=0.267 bits/symbol, average =0.715/0335= 82%
symbolbitssymbolbitsL /611.02/222.1
If we group the symbols in blocks of 8 , the redundancy
drops to acceptable values, the corresponding alphabet size
for this level of blocking is 𝟑𝟖=6561
A code of this size is impractical for a number of reasons:
1-Storage of a code like this requires memory that may not be available for many applications
2-While it may be possible to design reasonably efficient encoders, decoding a Huffman code of
this size would be a highly inefficient and time consuming procedure
Huffman's original algorithm is optimal for a symbol by symbol coding (i.e., a stream of
unrelated symbols) with a known input probability distribution
It is not optimal when the symbol by symbol restriction is dropped , or when the probability
mass function are unknown
Lecture6: ARITHMETIC CODES
This grouth is due to that, there must be a block for everypossible combination of symbols , so block number increasesexponentially with their length
SOURCE CODING PROF. A.M.ALLAM
LECTURES11/13/2018 3
We need a way of assigning codewords to particular sequence of length m without
having to generate codes for all sequences of that length. The arithmetic coding
technique fulfills this requirement
Arithmetic coding is similar to Huffman coding; they both achieve their compression by
reducing the average number of bits required to represent a symbol
Unlike Huffman coding, arithmetic coding provides the ability to represent symbols with
fractional values (floating point or rather fixed point representation)
Arithmetic coding is especially useful when dealing with:
1.Sources with small alphabets, such as binary sources
2.Alphabets with highly skewed probabilities
3.When it takes a stream of input symbols and replaces it with a single floating point
number in [1,0)
4.When the modeling and coding aspects of lossless compression are to be kept separate
Lecture6: ARITHMETIC CODES
In arithmetic coding a unique identifier or tag is generated for the sequence to be
encoded. This tag corresponds to a binary fraction , which becomes the binary code
for the sequence
A unique arithmetic code can be generated for a sequence of length m without the
need for generating codewords for all sequences of length m
SOURCE CODING PROF. A.M.ALLAM
4
-One possible set of tags for representing sequences of symbols are the numbers in the unit
interval [0 ,1)
Square brackets '[' and ']' mean the adjacent number is included
Parenthesis '('and ')' mean the adjacent number is excluded
-Because the number of numbers in the unit interval is infinite, it should be possible to assign a
unique tag to each distinct sequence of symbols
-In order to do that we need a function that will map sequences of symbols into the unit interval
This function is the Cumulative Distribution Function (CDF) of the random variable
associated with the source
-Consider A ={ a1, a2, … am } is the alphabet for a discrete source and X is a random variable,
we will use the mapping : iaX i )(
This mapping means that given a probability model for the source p we also have the
probability density function
Lecture6: ARITHMETIC CODES
(A) Generate a unique tag or identifier-In order to distinguish a sequence of symbols from another sequence of symbols we need a
unique identifier or tag
)()( iaPiXP
and the CDF of X is )()(1
kXPiFi
k
X
i.e., we map the symbols or letters to number
SOURCE CODING PROF. A.M.ALLAM
11/13/2018 5
Hence, for each symbol ai with a nonzero probability we have a
distinct value of FX(i) in the unit interval
Lecture6: ARITHMETIC CODES
SOURCE CODING PROF. A.M.ALLAM
LECTURES11/13/2018 6
Generating Tag Graphically:
Divide the unit interval into subintervals of the form
[𝐹𝑋(i− 1), 𝐹𝑋(i)), i= 1, . . ., m
Lecture6: ARITHMETIC CODES
Ex: For the alphabet source A ={ a1, a2, a3 } with P(a1)=0.7, P(a2)=0.1,
and P(a3)=0.2
Using the mapping equations, FX ( 1) = 0.7, FX ( 2) = 0.8, and FX ( 3) = 1
Basically, the procedure for generating the tag works by
reducing the size of the interval in which the tag resides
as more and more elements of the sequence are received
i=1, [𝐹𝑋(0), 𝐹𝑋(1)) [0 , 0.7)
i=2, [𝐹𝑋(1), 𝐹𝑋(2)) [0.7 , 0.8)
i=3, [𝐹𝑋(2), 𝐹𝑋(3)) [0.8 , 1)
We associate the subinterval [𝐹𝑋(i− 1), 𝐹𝑋(i)), with the
symbol ai ; a1 , a2, a3 respectively
𝐹𝑋(0)=0.0
𝐹𝑋(1)=0.7
𝐹𝑋(2)=0.8
𝐹𝑋(3)=1.0
a1
a2
a3
For sequence of symbols of
length one the tag is the
midpoint for each interval
0.35, 0.75, 0.9
SOURCE CODING PROF. A.M.ALLAM
7
If the first symbol in the input stream to be encoded is 𝑎k =𝑎1 ,
the tag lies in the interval [0 , 0.7)
The appearance of the first symbol in the sequence restricts
the interval containing the tag to one of these subintervals, a1
or a2 or a3
𝐹𝑋(0)=0.0
𝐹𝑋(1)=0.7
𝐹𝑋(2)=0.8
𝐹𝑋(3)=1.0
a1
a2
a3
a1
0.0
0.56
0.7
0.49
a1
a2
a3
Lecture6: ARITHMETIC CODES
The first partition as before corresponds to the symbol a1 , the
second partition corresponds to the symbol a2 , and the third
partition [0.56, 0.7) corresponds to the symbol a3
The first symbol is a1 this subinterval is now partitioned in
exactly the same proportions as the original interval yielding
the subintervals [0.0, 0.49), [0.49 ,0.56), and [0.56, 0.7)
If the first symbol in the input stream to be encoded is 𝑎k =𝑎2 ,
the tag lies in the interval [0.7 , 0.8)
If the first symbol in the input stream to be encoded is 𝑎k =𝑎3 ,
the tag lies in the interval [0.8 , 1)
Once the interval containing the tag has been determined, the
rest of the unit interval is discarded and this restricted interval
is again partitioned in exactly the same proportions as the
original interval
Suppose we want to encode a sequence of symbols a1a2a3
SOURCE CODING PROF. A.M.ALLAM
11/13/2018 8
(3) Each succeeding symbol causes the tag to be restricted to a subinterval that is further
partitioned in the same proportions as the original interval and so on
Lecture6: ARITHMETIC CODES
𝐹𝑋(0)=0.0
𝐹𝑋(1)=0.7
𝐹𝑋(2)=0.8
𝐹𝑋(3)=1.0
a1
a2
a3
a1
0.0
0.56
0.7
0.49
a1
a2
a3
a2
a1
a2
a3
0.546
0.539
0.56
0.49
a3
0.546
0.56
0.5558
0.5572
a1
a2
a3
Initial NEWSUB +(Final-Initial) FIRST SUB * RANGE NEW INTERVAL
SOURCE CODING PROF. A.M.ALLAM
11/13/2018 9
Lecture6: ARITHMETIC CODES
One popular choice is midpoint of the interval. Let’s use the midpoint of the final
total interval as the tag
The Midpoint Tag
= (0.546+ 0.56)/2=0.553
SOURCE CODING PROF. A.M.ALLAM
11/13/2018 10
Mathematical determination of the tag could be either the lower limit of the interval; or the
midpoint of the interval. Taking the midpoint one gets:
Lecture6: ARITHMETIC CODES
75.01.05.07.0)2(2
1)1()( 2 xXPXPaTX
9.02.05.01.07.0)3(2
1}2()1({)( 3 xxXPXPXPaTX
Ex: For the alphabet source A ={ a1, a2, a3 } with P(a1)=0.7, P(a2)=0.1, and P(a3)=0.2
or each ai , TX(ai ) will have a unique value. This value can be used as a unique tag for ai
35.07.05.0)1(2
10)( 1 xXPaTX
Sequence of symbols of length one
Generating Tag Mathematically
Using the mapping equations, FX ( 1) = 0.7, FX ( 2) = 0.8, and FX ( 3) = 1
We can get this result graphically in the previous example the first step
)1()(2
1)1()(
2
1)}({)(
1
1
iXPiFiXPkXPaT X
i
k
iX
0
SOURCE CODING PROF. A.M.ALLAM
LECTURES11/13/2018 11
Lecture6: ARITHMETIC CODES
Ex The outcomes of a roll of the die can be mapped into the numbers{ 1 , 2 ,…, 6}
For a fair die P(X) = m = 1/6 for m = 1, 2,…, 6
25.06/15.06/1)2(2
1)1()( 2 xXPXPaTX
0833.0)( 1 aTX
4166.0)( 3 aTX
5833.0)( 4 aTX
9166.0)( 6 aTX
75.0)2(2
1()(
4
1
5
XPkXPaTk
X5)+ 0.5
0
SOURCE CODING PROF. A.M.ALLAM
12
Lecture6: ARITHMETIC CODES
)(2
1)()(
1)(
i
i
ay
i
m
X aXPyPaTi
where y < ai means that y precedes ai in the ordering, and the superscript
denotes the length of the sequence
Ex: the sequence consists of two rolls of a die , the outcomes in order would be 11, 12, 13,…, 66
The tag for the sequence 13 would be
P(X= k) = 1/36 for k = 1, 2, . . . ,36
)13(2
1)12()11()13( XPXPXPTX
Note: To generate the tag for the sequence 13 we do not have to generate a
tag for every other possible message
But it requires that the probability of all sequences that is less than the sequence for
which the tag is being generated to be calculated explicitly which is lengthy work as the
requirement that we have codewords for all sequences of a given length (like Huffman)
Sequence of symbols of long length m
72/5)36/1(2
136/136/1)13( XT