15-211 fundamental structures of computer science march 23, 2006 ananda guna lempel-ziv compression

60
15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Upload: sheila-newton

Post on 30-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

15-211Fundamental Structuresof Computer Science

March 23, 2006

Ananda Guna

Lempel-Ziv Compression

Page 2: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

In this lecture

Recap of HuffmanFrequency based

LZW CompressionDictionary basedCompressiondecompression

Lossy CompressionSingular value decomposition

Page 3: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Compression so far

Device an algorithm that encode the characters according to their frequencies

That is, low frequency chars get longer codes and higher frequency characters get shorter codes

This is the idea of Huffman Algorithm

Page 4: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Huffman Compression Process

char F(c)

a 10

b 20

c 35

d 40

abbabbcbdbacddcccddaaacccbbaaddddbbdbbabbccb count Build

tree

105

d65

30 c

a b

code table

char code

a 000

b 001

c 01

d 1

12a30b31c21d11

Write header

12a30b31c21d11 000001001000…

Write data

Page 5: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Huffman Decompression Process

12a30b31c21d11 000001001000…

Read header

12a30b31c21d11 000001001000…

char code

a 000

b 001

c 01

d 1

Read data

decode

abbabbcbdbacddcccddaaacccbbaaddddbbdbbabbccb

Original file

Page 6: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Questions about Huffman

Is Huffman Tree unique? How do we know we get the optimal compression

using a huffman tree? What are the compression ratios of following files

when huffman compression is applied? (ignore header info) 1 MB file with all the same character 1 MB file made up of only two distinct characters 1 MB file with 4 distinct characters, all with same

probability 1MB file with ASCII characters randomly distributed

Is Huffman the only way to compress a file?

Page 7: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Dictionary-BasedCompression

Page 8: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Dictionary-based methods

Idea is simple:Keep track of “words” that we have seen and

assign them a unique code, when we see them again, simply replace them with the code.

When we see new “words”, expand the dictionary by adding the new words

We can maintain dictionary entries (word, code)

and make new additions to the dictionary as we read the input file.

Selecting a data structure What data structures are good for dictionaries? What data structure is good if we don’t know in

advance, words in the dictionary?

Page 9: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Lempel & Ziv (1977/78)

LZW Compression

Page 10: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Lemple-Ziv-Welch(LZW) Algorithm

Suppose we have n possible characters in the dictionary, each labeled 1,2,…n.

We start with a trie that contains a root and n childrenone child for each possible charactereach child labeled 1…n

We read the file and when we see a new character we add that to the trie and emit a code for the new string

We continue this until the whole file is read and we have a dictionary of “words” and codewords.

Page 11: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

LZW example

Suppose our alphabet consists only of the four letters:{a, b, c, d}

We start by assigning a=1, b=2, c=3, d=4how many bytes needed to encode

a,b,c,d? Let’s consider the compression of

the stringbaddad

Page 12: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

LZW: Compression example

baddadInput:^

a bDictionary:

Output:

10 32

c d

Page 13: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

LZW: Compress example

baddadInput:^

a bDictionary:

Output:

10 32

c d

10335

4

a

5

d

6

d

7

a

Page 14: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

LZW output

So, the inputbaddad

compresses to10335

which can be given in bit form, …or compressed again using

Huffman (cool idea!)

Page 15: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Extending the dictionary

So what if we continue to compress more of the string.

Suppose we have baddadbaddadbaddad 10335

What is the encoded file?

Page 16: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

LZW: Uncompress example

10335Input:^

a bDictionary:

Output:

10 32

c d

Page 17: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Byte LZW: Uncompress example

10335Input:^

a bDictionary:

Output:

10 32

c d

baddad

4

a

5

d

6

d

7

a

Page 18: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

LZW AlgorithmAn alternative presentation

(w/o tries)

Page 19: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Getting off the ground

Suppose we want to compress a file containing only letters a, b, c and d.

It seems reasonable to start with a dictionary

a:0 b:1 c:2 d:3

At least we can then deal with the first letter.

And the receiver knows how to start.

Page 20: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Growing pains

Now suppose the file starts like so:

a b b a b b …

We scan the a, look it up and output a 0.

After scanning the b, we have seen the word ab. So, we add it to the dictionary

a:0 b:1 c:2 d:3 ab:4

Page 21: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Growing pains

We output a 1 for the b. Then we get another b.

a b b a b b …

output 1, and add bb it to the dictionary

a:0 b:1 c:2 d:3 ab:4 bb:5

Page 22: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

So?

Right, so far zero compression.

But now we get a followed by b, and ab is in the dictionary

a b b a b b …

so we output 4, and put bab into the dictionary

… d:3 ab:4 bb:5 ba:6 bab:7

Page 23: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

And so on

Suppose the input continues

a b b a b b b b a …

We output 5, and put bbb into the dictionary

… ab:4 bb:5 ba:6 bab:7 bbb:8

Page 24: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

More Hits

As our dictionary grows, we are able to replace longer and longer blocks by short code numbers.

a b b a b b b b a …

0 1 1 4 5 6

And we increase the dictionary at each step by adding another word.

Page 25: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

More importantly

Since we extend our dictionary in such a simple way, it can be easily reconstructed on the other end.Start with the same initialization, thenRead one code number after the other,

look up the each one in the dictionary, and extend the dictionary as you go along.

So we don’t need to send the dictionary (or codes) with the compressed file (unlike in Huffman)

Page 26: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

LZW Compression-Formally

where each prefix is in the dictionary.

We stop when we fall out of the dictionary:

a1 a2 a3 …. ak b

We scan a sequence of symbols

a1 a2 a3 …. ak

Page 27: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Again: Extending

We output the code for a1 a2 a3 …. ak and

put a1 a2 a3 …. ak b into the dictionary.

Then we set

a1 = b

And start all over.

Page 28: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Another Example

Let's take a closer look at an example.

Assume alphabet {a,b,c}.

The code for aabbaabb is 0 0 1 1 3 5.

The decoding starts with dictionary

a:0, b:1, c:2

Page 29: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Moving along

The first 4 code words are already in D.

0 0 1 1 3 5

and produce output a a b b.

As we go along, we extend D:

a:0, b:1, c:2, aa:3, ab:4, bb:5

For the rest we get

a a b b a a b b

Page 30: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Done

We have also added to D:

ba:6, aab:7

But these entries are never used.

Everything is easy, since there is already an entry in Dictionary for each code number when we encounter it.

Page 31: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Is this it?

Unfortunately, no.

It may happen that we run into a code word without having an appropriate entry in Dictionary.

But, it can only happen in very special circumstances, and we can manufacture the missing entry.

Page 32: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

A Bad Run

Consider input

a a b b b a a ==> 0 0 1 5 3

After reading 0 0 1, dictionary looks like this:

a:0, b:1, c:2, aa:3, ab:4

Page 33: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Disaster

The next code is 5, but it’s not in D.

a:0, b:1, c:2, aa:3, ab:4

How could this have happened?

Can we recover?

Page 34: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

… narrowly averted

This problem only arises when

• the input contains a substring …s s s

… …

• s s was just added to the dictionary.

Here s is a single symbol, but a (possibly empty) word.

Page 35: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

… narrowly averted

But then the fix is to output

x + first(x)

where x is the last decompressed word, and first(x) the first symbol of x.

And, we also update the dictionary to contain this new entry.

Page 36: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Example

In our example we had

• s = b

• w = empty

The output and new dictionary word is bb.

Page 37: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Another Example

aabbbaabbaaabaababb ==> 0 0 1 5 3 6 7 9 5

Decoding (dictionary size: initial 3, final 11)

a 0

a + 0 aa

b + 1 ab

bb - 5 bb

aa + 3 bba

bba + 6 aab

aab + 7 bbaa

aaba - 9 aaba

bb + 5 aabab

Page 38: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

The problem cases

code position in D

a 0

a + 0 aa 3

b + 1 ab 4

bb - 5 bb 5

aa + 3 bba 6

bba + 6 aab 7

aab + 7 bbaa 8

aaba - 9 aaba 9

bb + 5 aabab 10

Page 39: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Pseudo Code: Compression

• Initialize dictionary D to all words of length 1. That is the alphabet

• Read all input characters:

•output code words from D,

•extend D whenever a new word appears.

• New code words: just an integer counter.

Page 40: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Compression Algorithm

initialize D;

c = nextchar; // next input character

W = c; // a string

while( c = nextchar ) {

if( W+c is in D ) // dictionary

W = W + c;

else

output code(W); add W+c to D; W = c;

}

output code(W)

Page 41: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Pseudo Code: Decompression

Initialize dictionary D with all words of length 1.

Read all code words and

- output corresponding words from D,

- extend D at each step.

This time the dictionary is of the form

( integer, word )

Keys are integers, values words.

Page 42: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Decompression Algorithm

initialize D;

pc = nextcode; // first code word

pw = word(pc); // corresponding word

output pw;

First code word is easy: codes only a single symbol.

Remember as pc (previous code) and pw (previous word).

Page 43: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Decompression Algorithm

while( c = nextcode ) {

if( c is in D ) {

cw = word(c);

pw = word(pc);

ww = pw + first(cw);

insert ww in D;

output cw;

}

else {

Page 44: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

The hard case

else {

pw = word(pc);

cw = pw + first(pw);

insert cw in D;

output cw;

}

pc = c;

}

Page 45: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Implementation - Tries

• Tries are the best way to implement LZW

• In the LZW situation, we can add the new word to the trie dictionary in O(1) steps after discovering that the string is no longer a prefix of a dictionary word.

• Just add a new leaf to the last node touched.

Page 46: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

LZW details

In reality, one usually restricts the code words to be 12 or 16 bit integers.

Hence, one may have to flush the dictionary ever so often.

But we won’t bother with this.

Page 47: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

LZW details

Lastly, LZW generates as output a stream of integers.

It makes perfect sense to try to compress these further, e.g., by Huffman.

Page 48: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Summary of LZW•LZW is an adaptive, dictionary based compression method.

•Encoding is easy in LZW, but uses a special data structure (trie).

•Decoding is slightly complicated, requires no special data structures, just a trie.

•Further Reading at:

http://www.dogma.net/markn/articles/lzw/lzw.htm

Page 49: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Lossy Compressionwith SVD

Page 50: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Data Compression

We have studied two important lossless data compression algorithmsHuffman Code.Lemple-Ziv Dictionary Method.

Lossy CompressionWhat if we can compress an image by

“degrading” the image a bit?Lossy compression techniques are used in jpeg

and gif compression algorithmsNext we will discuss a method to do a lossy

compression using a matrix decomposition method known as SVD

Page 51: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Singular Value Decomposition(SVD)

Suppose A is an mxn matrix We can find a decomposition of the

matrix A such that A = U S VT, whereU and V are orthonormal matrices (i.e.

UUT = I and V VT = I, where I-identity matrix

S is a diagonal matrix such that S = diag(s1, s2, s3, … sk, 0,0,…0), where si ‘s are called the singular values of A and k is the rank of A. It is possible to choose U and V such that s1> s1> …. > sk

Note: Do not worry about all this Math if you have not done linear algebra

Page 52: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Expressing A as a sum

A = s1 U1V1T + s2 U2V2

T + ….+ sK UKVKT

where Ui and Vi are ith column of U and V respectively Bit of a knowledge about block matrix

multiplication will convince you that this sum is indeed equal to A.

The key idea in SVD compression is that we can select any number of terms we need from the above sum to “approximate” A

Thinking about image as a matrix A, more terms we pick, more clarity we get with the image

Compression comes from saving as fewer vectors as possible to get a decent image.

Page 53: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Breaking down an image

Page 54: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Red, Green and Blue Images

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

Page 55: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

The Red matrix representation of the image (16x16 matrix)

173 165 165 165 148 132 123 132 140 156 173 181 181 181 189 173 198 189 189 189 181 165 148 165 165 173 181 198 206 198 181 165 206 206 206 206 198 189 181 181 198 206 206 222 231 214 181 165 231 222 206 198 189 181 181 181 206 222 222 222 231 222 198 181 231 214 189 173 165 165 173 181 181 189 198 222 239 231 206 214 206 189 173 148 148 148 148 165 156 148 165 198 222 231 214 239 181 165 140 123 123 115 115 123 140 148 140 148 165 206 239 247 165 82 66 82 90 82 90 107 123 123 115 132 140 165 198 231 123 198 74 49 57 82 82 99 107 115 115 123 132 132 148 214 239 239 107 82 82 74 90 107 123 115 115 123 115 115 123 198 255 90 74 74 99 74 115 123 132 123 123 115 115 140 165 189 247 99 99 82 90 107 123 123 123 123 123 132 140 156 181 198 247 239 165 132 107 148 140 132 132 123 132 148 140 140 156 214 198 231 165 156 132 156 156 140 140 140 148 148 132 140 156 222 247 239 222 181 181 140 156 140 148 148 148 140 132 156 206 222 214 198 181 181 181 181 173 148 156 148 140 140 165 198 222 239

Apply SVD to this matrix and get a close enough approximation using as fewer columns of U and V as possible.

Do the same for Green and Blue parts and reconstruct the matrix

Page 56: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Implementation (compression)

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

SVD

SVD

SVD

COMPRESSION STEP

Compressed file stores U, V and S for the Rank selected for each of the colors R , G and B and header bytes

Page 57: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Implementation (decompression)

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

DECOMPRESSION

Compressed file stores U, V and S for the Rank selected for each of the colors R , G and B and the bmp header

Page 58: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Some Samples (128x128)

Original mage 49K Rank 1 approx 825 bytes

Page 59: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

Samples ctd…

Rank 16 approx 13K Rank 8 approx 7K

Page 60: 15-211 Fundamental Structures of Computer Science March 23, 2006 Ananda Guna Lempel-Ziv Compression

SVD compression using Matlab

A=imread(‘image.bmp’);imagesc(A);R=A(:,:,1); G=A(:,:,2); B=A(:,:,3);[U,S,V]=svd(R);Ar=sum(S(i,i)*U(:,i)*V(:,i)T, i=1…k);// similarly find Ag and AbA(:,:,1)=Ar; A(:,:,2)=Ag; A(:,:,3)=Ab;imagesc(A); // rank k approximation