1 15-211 fundamental data structures and algorithms margaret reid-miller 24 february 2005 lzw...

67
1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

Upload: amice-gregory

Post on 05-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

1

15-211Fundamental Data Structures and Algorithms

Margaret Reid-Miller24 February 2005

LZW Compression

Page 2: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

Last Time…

Page 3: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

3

Huffman Trees

Page 5: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

5

Huffman Compression

Huffman trees provide a straightforward method for file compression.

1. Read the file and compute frequencies2. Use frequencies to build Huffman codes3. Encode file using the codes4. Write the codes (or tree) and encoded

file into the output file.

Page 6: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

6

Huffman decompression

Decompression reverses the process.§ Read the header in the compressed

file, and build the code tree§ Read the rest of the file, decode

using the tree§ Write to output

Page 7: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

7

Beating Huffman

How about doing better than Huffman!

Impossible! Huffman’s algorithm gives the optimal

prefix code!

Right. But who says we have to use a prefix

code?

Page 8: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

8

Example

Suppose we have a file containingabcdabcdabcdabcdabcdabcd… abcdabcd

This could be expressed very compactly asabcd^1000

Page 9: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

Dictionary-BasedCompression

Page 10: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

10

Dictionary-based methods

Here is a simple idea: Remember of “words” that we have seen, and replace

them with a code number when we see them again. The code is presumably shorter than the word

If words repeat this should produce nice compression.

and make additions to the dictionary as we read the input file.

Page 11: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

11

Dictionary-based Methods

As we read the input file we keep adding new words to the dictionary to get more and more abbreviations:

( word, code )

Since we will always use the longest applicable abbreviation, the set of current words is prefix (so it looks like tries might be useful).

Page 12: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

12

Fred Hacker’s Algorithm…

Fred now knows what to do…

( <the-whole-file>, 1 )

Transmit 1, done.

Page 13: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

13

Right?

Fred’s algorithm provides excellent compression, but…

…the receiver does not know what is in the dictionary! And sending the dictionary is the same

as sending the entire uncompressed file

Thus, we can’t decompress the “1”.

Page 14: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

14

Hence…

…we need to build our dictionary in such a way that the receiver can rebuild the dictionary easily.

Page 15: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

15

Lempel & Ziv (1977/78)

Page 16: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

LZW Compression:The Binary Version

LZW=variant of Lempel-Ziv Compression, by Terry Welch (1984)

Page 17: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

17

Maintaining a Dictionary

We need a way of incrementally building up a dictionary during compression in such a way that…

…someone who wants to uncompress can “rediscover” the very same dictionary

And we already know that a convenient way to build a dictionary incrementally is to use a trie.

Page 18: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

18

Getting off the ground

Suppose we want to compress a file containing only letters a, b, c and d.

It seems reasonable to start with a dictionary

a:0 b:1 c:2 d:3

At least we can then deal with the first letter.

And the receiver knows how to start.

Page 19: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

19

Growing pains

Now suppose the file starts like so:

a b b a b b …

We scan the a, look it up and output a 0.

After scanning the b, we have seen the word ab. So, we add it to the dictionary

a:0 b:1 c:2 d:3 ab:4

Page 20: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

20

Growing pains

We already scanned the first b.

a b b a b b …

Then we get another b. bb is not in the dictionary.

So we output a 1 for the first b, and add bb to the dictionary

a:0 b:1 c:2 d:3 ab:4 bb:5

Page 21: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

21

So?

Right, so far zero compression.

We already scanned the second b.

a b b a b b …

Next we get an a. As ba is not in the dictionary, we output 1 for the b and put ba in the dictionary

… d:3 ab:4 bb:5 ba:6

Still zero compression.

Page 22: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

22

But now…

We already scanned a.

a b b a b b …

We scan the next b, and ab : 4 is in the dictionary.

We scan the next b, and don’t find abb in the dictionary. We output 4, and put abb into the dictionary.

… d:3 ab:4 bb:5 ba:6 abb:7

We got compression, because 4 is shorter than ab.

Page 23: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

23

Suppose the input continues

a b b a b b b b a …

We scan the next b, and bb:5 is in the dictionary

We scan the next b, and don’t find bbb in the dictionary. We output 5, and put bbb into the dictionary

… ab:4 bb:5 ba:6 abb:7 bbb:8

And so on

Page 24: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

24

More Hits

As our dictionary grows, we are able to replace longer and longer blocks by short code numbers.

a b b a b b b b a …

0 1 1 4 5 6

And we increase the dictionary at each step by adding another word.

Page 25: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

25

More importantly

Since we extend our dictionary in such a simple way, it can be easily reconstructed on the other end.

Start with the same initialization, then Read one code number after the other,

look up the each one in the dictionary, and extend the dictionary as you go along.

Page 26: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

26

Again: Extending

where each prefix is in the dictionary.

We stop when we fall out of the dictionary:

a1 a2 a3 …. ak b

We scan a sequence of symbols

a1 a2 a3 …. ak

Page 27: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

27

Again: Extending

We output the code for a1 a2 a3 …. ak and

put a1 a2 a3 …. ak b into the dictionary.

Then we set

a1 = b

And start all over.

Page 28: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

28

Decoding

Let's take a closer look at an example.

Assume alphabet {a,b,c}.

The code for aabbaabb is 0 0 1 1 3 5.

The decoding starts with dictionary D:

a:0, b:1, c:2

Page 29: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

29

Moving along

The first 4 code words are already in D.

0 0 1 1 3 5

and produce output a a b b.

As we go along, we extend D:

a:0, b:1, c:2, aa:3, ab:4, bb:5

For the rest we get

a a b b a a b b

Page 30: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

30

Done

We have also added to D:

ba:6, aab:7

But these entries are never used.

Everything is easy, since there is already an entry in D for each code number when we encounter it.

Page 31: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

31

One more detail…

One detail remains: how to build the dictionary for compression (decompression is easy).

We need to be able to scan through a sequence of symbols and check if they form a prefix of a word already in the dictionary.

Could use a balanced tree, but then each new symbol would launch a new search.

Page 32: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

32

Tries!

a b

10 32

c d

4

a

5

d

6

d

a:0 b:1 c:2 d:3 ba:4 ad:4 dd:6

Corresponds to dictionary

Page 33: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

33

Tries

Even better: in the LZW situation, we can add the new word to the trie dictionary in O(1) steps after discovering that the string is no longer a prefix of a dictionary word.

Just add a new leaf to the last node touched.

Page 34: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

Pretty Pictures

Page 35: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

35

LZW: 4 Letter Example

Suppose our entire character set consists only of the four letters: {a, b, c, d}

Let’s consider the compression of the string baddad

Page 36: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

36

Byte LZW: Compress example

baddadInput:^

a bDictionary:

Output:

10 32

c d

Page 37: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

37

Byte LZW: Compress example

baddadInput:^

a bDictionary:

Output:

10 32

c d

1

4

a

Page 38: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

38

Byte LZW: Compress example

baddadInput:^

a bDictionary:

Output:

10 32

c d

10

4

a

5

d

Page 39: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

39

Byte LZW: Compress example

baddadInput:^

a bDictionary:

Output:

10 32

c d

103

4

a

5

d

6

d

Page 40: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

40

Byte LZW: Compress example

baddadInput:^

a bDictionary:

Output:

10 32

c d

1033

4

a

5

d

6

d

7

a

Page 41: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

41

Byte LZW: Compress example

baddadInput:^

a bDictionary:

Output:

10 32

c d

10335

4

a

5

d

6

d

7

a

Page 42: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

42

Byte LZW output

So, the input baddad

compresses to 10335

or compressed again using Huffman

Page 43: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

43

Byte LZW: Uncompress example

The uncompress step for LZW is the most complicated part of the entire process.

Page 44: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

44

Byte LZW: Uncompress example

10335Input:^

a bDictionary:

Output:

10 32

c d

Page 45: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

45

Byte LZW: Uncompress example

10335Input:^

a bDictionary:

Output:

10 32

c d

b

Page 46: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

46

Byte LZW: Uncompress example

10335Input:^

a bDictionary:

Output:

10 32

c d

ba

4

a

Page 47: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

47

Byte LZW: Uncompress example

10335Input:^

a bDictionary:

Output:

10 32

c d

bad

4

a

5

d

Page 48: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

48

Byte LZW: Uncompress example

10335Input:^

a bDictionary:

Output:

10 32

c d

badd

4

a

5

d

6

d

Page 49: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

49

Byte LZW: Uncompress example

10335Input:^

a bDictionary:

Output:

10 32

c d

baddad

4

a

5

d

6

d

7

a

Page 50: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

50

Decoding difficulty

When we decode, is the code number always in the dictionary?

Unfortunately, no.

It may happen that we run into a code number without having an appropriate entry in D.

But, it can only happen in very special circumstances, and we can manufacture the missing entry.

Page 51: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

51

A Bad Run

Consider input

a a b b b a a ==> 0 0 1 5 3

After reading 0 0 1, we output

a a b

and extend D with codes for aa and ab

0:a, 1:b, 2:c, 3:aa, 4:ab

Page 52: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

52

Disaster

We have read 0 0 1 from the input

0 0 1 5 3

The dictionary is

0:a, 1:b, 2:c, 3:aa, 4:ab

The next code number to read is 5, but it’s not in D.

How could this have happened?

Can we recover?

Page 53: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

53

… narrowly averted

This problem only arises when on the compressor end:

• the input contains a substring

…s s s …

• compressor read s , output code c for s , and added c+1: s s to the dictionary.

• Here s is a single symbol, but a (possibly empty) word.

Page 54: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

54

… narrowly averted (pt. 2)

On the decompressor end, D contains

c: s

• but does not contain c+1: s s

• the decompressor has already output

x = s

and is now looking at unknown code number c+1.

Page 55: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

55

… narrowly averted (pt. 3)

But then the fix is to output

x + first(x)

where x is the last decompressed word, and first(x) the first symbol of x.

Because x=s was already output, we get the required

s s s

We also update the dictionary to contain the new entry x+first(x) = s s.

Page 56: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

56

In our example we have read 0 0 1 from the input

0 0 1 5 3

The last decompressed word is b, and the next code number to read is 5. Thus

• s = b

• = empty

•The next word to output and add to D is

s s = bb

Example

Page 57: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

57

Summary

Let x be the last word output.

Ordinarily, D contains a word y matching to the next input code number.

We output y and extend D with

x+ first (y)

But sometimes the encoder immediately uses what was last added to the dictionary.

Then it must be x = s and we output

x + first(x) = s s

Page 58: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

58

Example (extended)

0 0 1 5 3 6 7 9 5 aabbbaabbaaabaababb s s s s

Input Output add to D

0 a

0 + a 3:aa

1 + b 4:ab

5 - bb 5:bb

3 + aa 6:bba

6 + bba 7:aab

7 + aab 8:bbaa

9 - aaba 9:aaba

5 + bb 10:aabab

s = a = ab

Page 59: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

59

Pseudo Code: Compression

Initialize dictionary D to all words of length 1.

Read all input characters:

output code words from D,

extend D whenever a new word appears.

New code words: just an integer counter.

Page 60: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

60

Less Pseudo

initialize D;

c = nextchar; // next input character

W = c; // a string

while( c = nextchar ) {

if( W+c is in D ) // dictionary

W = W + c;

else

output code(W); add W+c to D; W = c;

}

output code(W)

Page 61: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

61

Pseudo Code: Decompression

Initialize dictionary D with all words of length 1.

Read all code words and

- output corresponding words from D,

- extend D at each step.

This time the dictionary is of the form

( integer, word )

Keys are integers, values words.

Page 62: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

62

Less Pseudo

initialize D;

c = nextcode; // first code number

x = word(c); // corresponding word

output x;

First code number is easy: codes only a single symbol.

Remember x (previous word).

Page 63: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

63

More Less Pseudo

while ( c = nextcode ) {

if ( c is in D ) {

y = word(c);

ww = x + first(y);

insert ww in D;

}

else {

Page 64: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

64

The hard case

else {

y = x + first(x);

insert y in D;

}

output y;

x = y;

}

Page 65: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

65

LZW details

In reality, one usually restricts the code words to be 12 or 16 bit integers.

Hence, one may have to flush the dictionary ever so often.

Thus it is important to conserve code numbers (see below).

Page 66: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

66

LZW details

Lastly, LZW generates as output a stream of integers.

It makes perfect sense to try to compress these further, e.g., by Huffman.

Page 67: 1 15-211 Fundamental Data Structures and Algorithms Margaret Reid-Miller 24 February 2005 LZW Compression

67

Summary of LZW

LZW is an adaptive, dictionary based compression method.

Encoding is easy in LZW, but uses a special data structure (trie).

Decoding is slightly complicated, requires no special data structures.