ct516 advanced digital communications lecture 3: source

Analog to Digital Conversion Compression of Digital Data Information Theoretical Aspects Discrete Memoryless Source Huffman Coding Lempel-Ziv Coding

CT516 Advanced Digital CommunicationsLecture 3: Source Coding and Data Compression

Yash M. Vasavada

Associate Professor, DA-IICT, Gandhinagar

12th January 2017

Yash M. Vasavada (DA-IICT) CT516: Adv. Digital Comm. 12th January 2017 1 / 22


Overview of Today’s Talk1 Analog to Digital Conversion2 Compression of Digital Data3 Information Theoretical Aspects4 Discrete Memoryless Source5 Huffman Coding6 Lempel-Ziv Coding



Digital Representations of Analog Signals

Analog signals (video, voice, image, etc.) are continuous in natureSampling theorem says that if the analog signal is band limited, it canbe exactly recovered by its time domain samples taken sufficientlyclose together

→ Sampling analog signals makes them discrete in time→ For band-limited signal, sampling introduces no distortion

Quantization of sampled analog signals makes the samples discrete inamplitude

→ Quantization introduces some distortion. There is a trade off betweenthe bandwidth requirement and distortion



Sampling Theorem

Let x(t) be a band-limited signal with Fourier Transform X (f ) whichis zero if |f | > BTime domain signal x(t) can be perfectly reconstructed from itsuniformly spaced samples provided these samples are taken at a rateR > 2B. Here, 2B is called the Nyquist RateIf time-domain samples are collected at a rate less than 2B, aliasingoccurs and it is not possible to perfectly reconstruct the analog signalfrom its samplesProof of sampling theorem is based on the following two theorems:

→ Fourier Transform of the time domain impulse train whose impulses are

separated by1

2Bseconds is the frequency domain impulse train whose

consecutive samples are spaced 2B Hertz apart→ Multiplication of two time domain signals is same as product of these

two signals in frequency domain



QuantizationAnalog Samples → Digital Samples

This is to be covered in a subsequent lecture.



Source CodingData Compression Problem

Many digital data streams contain a lot of redundant information. Forexample, a digital image file may contain more zeros than ones:0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1We wish to squeeze out the redundant information to minimize theamount of data needed to be stored or transmittedDefinition of Data Compression Problem:

1 What are the good algorithms that achieve the maximal datacompression?

2 What is the maximum data compression that can be achieved if wewant to recover the exact bit sequence after decompression?



Source CodingData Compression Problem: Definitions

Source Alphabet: source is modeled as generating symbols each of whichcan take L symbols from a set {x1, x2, . . . , xL}. This set isthe source alphabet.

Symbol Probabilities: some of the source symbols are more likely than theothers. Probability P (X = xk) of kth symbol is denoted as

pk . Since the source alphabet is the universal set,K∑

k=1

pk = 1.

Discrete Memoryless Source (DMS): in DMS, each source symbol occursindependently of the other symbols. We will mostly assumethat the source is DMS

→ With more complicated notations, the theory for DMScan be extended to correlated sources



A Measure of Information Content of the SourceEntropy

What is the information generated when the source emits kth symbolxk with probability pk?

→ Information should be inversely proportional to pk→ Information, when measured in bits, should be a logarithmic function

Information generated is therefore log2

(1

pk

)= − log2 pk bits. This

is called self-information of the source.



A Measure of Information Content of the SourceEntropy

Entropy H(X ) is a measure of uncertainty in the source output, whichis modeled as a random variable XH(X ) is defined as the average information generated by the source,where average is taken over the entire source alphabet

H(X ) = −L∑

k=1

pk log2 pk , in bits

→ H(X ) is maximized if pk =1

L→ H(x) is minimized if pk = 1 for some k



A Measure of Information Content of the SourceEntropy for Binary RV

For binary RV X for which pX (x) =

{1− p, x = 0,

p, x = 1, entropy

H(X ) is given as H(X ) = −p log2 p − (1− p) log2 (1− p)



Source Codingfor DMS

A source code is a rule which maps symbols from the source alphabetto sequence of bitsFixed length source code maps source symbols into codewords of thesame length:

→ Example: x1 ⇒ 00, x2 ⇒ 01, x2 ⇒ 10, x4 ⇒ 11

Variable length source code maps source symbols into codewords ofdifferent lengths:

→ Example: x1 ⇒ 1, x2 ⇒ 01, x2 ⇒ 101, x4 ⇒ 11

Variable length source code is uniquely decodeable if no codeword is aprefix of another codeword:

→ Uniquely decodeable: x1 ⇒ 01, x2 ⇒ 000, x2 ⇒ 10, x4 ⇒ 111→ Counter-example: x1 ⇒ 0, x2 ⇒ 01, x2 ⇒ 10, x4 ⇒ 101



Source Codingfor DMS

Rate R of a source code is the average number of bits required torepresent a source symbolLet nk be the number of bits in the codeword which represents asource symbol xk .

→ Therefore, R =L∑

k=1

pknk

→ For fixed length codes, R ≥ dlog2 LeGoal of Source Encoding: find a uniquely decodeable code thatminimizes RStrategy: assign short codewords (small nk) to the most likelysymbols (large pk)



Source Coding Theorem

A fundamental theorem of the Information Theory: For a uniquelydecodeable source code that provides lossless compression (i.e.,perfect reconstruction of the source symbols at the source decoder),R ≥ H(X ).

→ Rate can be made arbitrarily close to, but not less than, H(X )

There are two widely used source coding methods which approach theperformance promised by this theorem:

1 Huffman Coding2 Lempel-Ziv Coding



Source Coding TheoremHuffman Coding Algorithm

Huffman coding is based upon the idea of allocating short codewords tohighly probable symbols. The algorithm for Huffman coding belongs to aclass of algorithms based upon the Greedy Approach.

1 List all symbols in descending order of probability2 Group the two lowest probability symbols into a new combined symbol3 Repeat steps 1 and 2 until only a single combined symbol remains4 Assign a 1 to each upper branch and 0 to each lower branch of the

resulting binary tree5 Codeword for each symbol is the sequence of bits from the root of the

tree to that symbol



Source Coding TheoremHuffman Coding Algorithm: An Example



Source Coding TheoremHuffman Coding Algorithm: Numerical Analysis



Source Coding TheoremHuffman Coding Algorithm: Summary

How do we get even closer to Entropy?

→ Use Huffman Algorithm on pairs of symbols, or→ Even better, on triplets of symbols, or quadruplets or,→ In general, m-tuple of symbols.

A limitation of Huffman Algorithm is that it requires the sourcesymbol probabilities to be known. These are not known exactly butthey can be estimated from a sufficiently large sample of sourceoutput. Mismatch between the estimated and exact symbolprobabilities contributes to a degraded performance of Huffman’sAlgorithm.Another limitation of Huffman Code is that the symbol probabilitiesare assumed to be fixed. If these probabilities change with time,Huffman Code may provide suboptimal data compression.



Lempel-Ziv CodingEncoding Algorithm

In Lempel-Ziv coding, the source sequence is sequentially parsed intostrings that have not appeared so far.

→ Continually parse the symbol sequence and insert a comma ondetecting the shortest sequence that has not been seen earlier

Example sequence: 1011010100010. . .Sequential parsing method in Lempel-Ziv encoding:

→ 1,011010100010. . .→ 1,0,11010100010. . .→ 1,0,11,010100010. . .→ 1,0,11,01,0100010. . .→ 1,0,11,01,010,0010. . .→ 1,0,11,01,010,00,10. . .→ 1,0,11,01,010,00,10,. . .



Lempel-Ziv CodingEncoding Algorithm

Prefix of each “phrase” (phrase is the symbol string between twocommas), i.e., the string consisting of all but the last symbol of eachphrase, must have occurred earlier.Encode this phrase by giving the location of the prefix and the valueof the last bit.Let C (n) be the number of phrases in the parsing of the inputsequence with n symbols.Thus, logC (n) bits are needed to describe the location of the prefixto the phrase and one bit to describe the last bit.In the above example, the code sequence is:(000, 1), (000, 0), (001, 1), (010, 1), (100, 0), (010, 0), (001, 0)

→ First number of each pair gives the index of the prefix and the secondnumber gives the last bit of the phrase.


ct516 advanced digital communications lecture 3: source

Documents