Data Compression and Huffman’s Algorithm 15-211 Fundamental Data Structures and Algorithms Klaus Sutner February 3, 2004

• Homework Number 4 is on its way … It's a bit hard conceptually, so don't


• Read Chapters 7 and 12

Data Compression

• Is one of the fundamental technologies of the Internet.

• Is necessary for faster data transmission.

• Useful even locally to keep smaller files or backup data.

• Types of compression Lossless – encodes the original information

exactly. Lossy – approximates the original information.

• Uses of compression Images over the web: JPEG Music: MP3 General-purpose: ZIP, GZIP, JAR, …

Lossy vs. Lossless

• What is the practical impact of lossy compression?

Compare two images

One image is 400K the other is 1100K. Which is which?

So where is the difference?

Another Example - SVD

Rank 1 Rank 8 Rank 16 Original

2231 bytes 4549 bytes

What can we conclude?

• There is definitely a trade-off.

• Lossless may not perform so well, but it retains 100% of the information.

• Lossy can perform extremely well, but is the compression worth the loss of information?

• So how do we decide which one to use?

Some Considerations

• What types of files would you use a lossless algorithm on?

• What types of files would you use a lossy algorithm on?

• What types of files would you use a lossless algorithm on? Discrete data (text file e.g.)

• What types of files would you use a lossy algorithm on? Analog data (images, music).

Question #1

• Is there a lossless compression algorithm that can compress any file?


• Absolutely not!

• Why not?

Count binary strings of length N.

Question #2

• Is there a best possible way to compress files?

• Is there an algorithm that always produces the smallest compressed file possible?


No optimal compression

• Suppose you wish to compress the first 10,000 digits of Pi.

• In case they slipped your mind…

Pi 10000


How about a program?

long a[35014],b,c=35014,d,e,f=1e4,g,h;main(){ for(;b=c-=14;h=printf("%04ld",e+d/f)) for(e=d%=f;g=--b*2;d/=g) d=d*b+f*(h?a[b]:f/5), a[b]=d%--g;}


• This C program is just 143 characters long!

• And it “decompresses” into the first 10,000 digits of Pi.

long a[35014],b,c=35014,d,e,f=1e4,g,h; main(){for(;b=c-=14;h=printf("%04ld",e+d/f)) for(e=d%=f;g=--b*2;d/=g) d=d*b+f*(h?a[b]:f/5), a[b]=d%--g;}

Program Size Complexity

• There is an interesting idea here: Find the shortest program that

computes a certain output. A very important idea in theoretical

computer science. Can be used to define incompressible data (no shorter program will produce these data). Excellent source of examples/counterexamples.

PSC versus Physics

In fact, PSC pops up naturally when one studies the physical limits of computation.

Crucial problem: How much heat must we dissipate when we perform a computation?

This is a HUGE problem for super-computers: detractors would say that a super-computer is a big refrigerator plus a few chips and disks.

PSC versus Physics

Surprisingly, we only need to dissipate energy when we erase a bit.

Everything else can be done without energy cost (reversible computation), or at least with little cost.

Erasing information cannot be avoided in general.

BUT: before you erase, you can compress garbage bits, thus lowering the thermodynamic cost.

The limit for compression is given by PSC.

PSC and Compression

Unfortunately, there is no algorithm that, given some binary string x, would compute the shortest program p(x) that generates x.

Also note that the shortest program might take a long time to generate x.

So, for data compression PSC is quite useless.

Extra credit

• Come up with the (a) shortest Java program that computes the first 10,000 digits of Pi and writes them to the screen.

• Incidentally, I don’t know how or why pitiny.c works.

Getting close

• In practice, the best we can hope for is a program that does good compression in interesting cases. Text files Numerical data Voice Music Images Video …

How does compression work?

• Lossy algorithms are generally mathematically based. They work by applying transforms. Eg. JPEG – discrete cosine transform

• Lossy algorithms attempt to approximate the original data.

• Lossless algorithms cannot do that since they need to maintain the original data.

• So what can they do?

How does compression work?

• They need to analyze the file and take advantage of certain properties it might have.

• Or its structure.

• We’ll look at two important lossless compression methodsHuffman compression LZW compression

Interlude:Bit-level Representation of Data

Bits and Bytes

• All data is stored on a computer as a sequence of 0’s and 1’s, called bits.

• This is a very natural way to represent data, for the following reason:

• A computer cannot, in general, infer 10 different values from the intensity of a signal.

• It can however infer 2 different values very easily. I.e. whether the signal is high or low.

Bits and Bytes

• The problem: If we use sequences of just 0’s and 1’s instead of 0…9 to represent data, regardless of the convenience, aren’t we using a lot more space?

• To address this issue, let’s consider a specific question…

Quiz Break

Bits and Bytes

• Suppose you had a text file (say, the complete works of Shakespeare) and you know that it has 32 different symbols and a total of 100,000 characters.

• How much space would be needed to represent this in base 10?

• How about base 2?

• In big-Oh terms, how much more space is needed by the base 2 representation?

Bits and Bytes

• Okay, so we’ve established that’s it’s easiest to store data as a sequence of 0’s and 1’s, but how does that help us?

• In particular, how do I take a text file and store it on the computer?

• To do this we need to invent a code.



Fix some alphabet A. The elements of A are characters or letters.

A (binary) code for A is a map C from A to binary sequences.

Apply C pointwise to define the code of a word over A.

Thus any word over A is transformed by C to a binary sequence.


• “badcae” maps to 001 000 011 010 000 100

• Really just 001000011010000100



• Suppose we A = {a,b,c,d,e}.

• We can use the following 3-bit code:


We need to be able to go from binary sequences back to words over alphabet A.

Note that not every binary sequence may be the code of a word over A.

What properties must C have so we can decode?

Clearly, any two words over A must translate into different binary sequences under C.

Fixed Length Codes

Easy case: All codewords C(a) have the same length.

Important Example: ASCII (7-bit and 8-bit)

Can use specialized hardware to digest whole blocks of bits.

Very simple, but not particularly flexible.


Is fixed length necessary for decoding?

Clearly not: the following table defines a code and all codewords have different lengths.

In fact, this code is instantaneously decodable: as soon as we have read enough bits for a letter we can determine the right letter.




Is instantaneous decodability necessary?

Or may it happen that we have to read a large part of the coded message before we can determine the first letter?

Note that this would probably cause a number of efficiency problems.




Try to decode


You'll need to do some look-ahead.



Prefix (Free) Codes

There is a nice class of codes that are easily decodable:

No codeword C(a) is allowed to be a prefix of another codeword C(b) where a and b are letters.

How would you construct a decoder for a prefix code?

Good Prefix Codes

If we know nothing about the text to be encoded, we may as well use a fixed length code.

But if we are given the frequency distribution of the letters in A we can do better:

Frequent letters should get short codewords.

And, of course, we are not allowed to violate the prefix condition.


450 bits

615 bits

205 chars


110100000110Prefix code (optimal)

100011010001000Fixed-length code



Huffman’s Algorithm

Tree representation

• Represent prefix free codes as full binary trees

• Full: every node Is a leaf, or Has exactly 2 children.

• The encoding is then a (unique) path from the root to a leaf.










a=1, b=001, c=000, d=01

Why a full binary tree?

• A node with no sibling can be moved up 1 level, improving the code.

• An optimal code for a string can always be represented by a full binary tree.











Encoding cost

• Alphabet: A Symbol: c Symbol Frequency: f(c) Depth in tree T: d(c) (d(c) is also number of bits to encode c )

• Encoding cost:

• Q: How to construct a full binary tree that minimizes K ?

Huffman’s Algorithm

• Huffman’s algorithms will give you an optimal prefix free code by constructing an appropriate tree.

• Data structure used: A Priority Queue.

• insert(element, priority) inserts an element with a given priority into the queue.

• deleteMin() returns the element with least priority.

Huffman’s Algorithm

1. Compute f(c) for every symbol c C

2. insert(c, f(c)) into priority queue Q

3. for i = 1 to |C| - 1 (while Q is not empty)

4. z = new TreeNode()

5. x = z.left = Q.deleteMin()

6. y = z.right = Q.deleteMin()

7. f(z) = f(x) + f(y)

8. Q.insert(z, f(z))

9. return Q.deleteMin()







Huffman’s Algorithm

• Is a greedy algorithm that constructs an optimal prefix free code for a given piece of data

• Does it really generate an optimal prefix free code?

• Yes, but the proof is beyond the scope of today’s lecture. But see it in recitation…

Huffman’s Algorithm

• Why is it greedy?

• Because at each iteration in the loop, it picked the two “optimal” trees in the priority queue with which to create a new node without considering their implications from a global standpoint.

Back to Bits and Bytes

• Notice that Huffman’s algorithm, in the setting we studied it, can only compress files of characters since it needs to know what the alphabet is in order to count the frequencies.

• Do we need to modify the algorithm in order to compress arbitrary files?

• Take a minute to think about this.

Bits and Bytes

• No, we don’t!

• Suppose we have a file F to compress. We can treat F as a stream of bits.

• So we read the first byte and consider it in the context of our predefined alphabet. ASCII in this case.

• Implicitly, we then end up treating every file as a text file.

• Is that a good idea? What about images?

Bits and Bytes

• It doesn’t matter!

• So long as we reproduce the original bit sequence after decompression.

• We can treat the file as containing just the characters {a,b,c,d} if we want, it won’t affect the correctness of our algorithm.

• It will, however, affect the performance.

• Why?

Huffman compression

• Huffman trees provide a straightforward method for file compression. 1. Read the file and compute frequencies 2. Use frequencies to build Huffman codes 3. Encode file using the codes 4. Write the codes (or tree) and encoded

file into the output file.

Sometimes students find this to be tricky…


• Reading the file twice is a pain. Once to compute frequencies, and again

to do the compression.

• It is possible to build an adaptive Huffman tree that adjusts itself as more data becomes available.

Beating Huffman

• How about doing better than Huffman!

• Impossible! Huffman’s algorithm gives the optimal

prefix code!

• Right. But who says we have to use a prefix



• Suppose we have a file containing abcdabcdabcdabcdabcdabcd…


• This could be expressed very compactly as abcd1000