eem 480 lecture 11 hashing and dictionaries. symbol table symbol tables are used by compilers to...

59
EEM 480 EEM 480 Lecture 11 Lecture 11 Hashing and Dictionaries Hashing and Dictionaries

Upload: james-powers

Post on 02-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

EEM 480EEM 480

Lecture 11Lecture 11

Hashing and DictionariesHashing and Dictionaries

Page 2: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Symbol TableSymbol Table

Symbol tables are used by compilers to keep track of information about variables functions class names type names temporary variables etc.

Typical symbol table operations are Insert, Delete and Search It's a dictionary structure!

Page 3: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Symbol Table

What kind of information is usually stored in a symbol table? Type ( int, short, long int, float, …) storage class (label, static symbol, external def,

structure tag,..) size scope stack frame offset register

We also need a way to keep track of reserved words.

Page 4: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Symbol TableSymbol TableWhere is a symbol table stored?

array/linked listsimple, but linear lookup timeHowever, we may use a sorted array

for reserved words, since they are generally few and known in advance.

balanced treeO(log n) lookup time

hash tablemost common implementationO(1) amortized time for dictionary

operations

Page 5: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

HashingHashing

Depends on mapping keys into positions in a table called hash table

Hashing is a technique used for performing insertions, deletions and searches in constant average time

Page 6: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

HashingHashing

•In this example john maps 3

•Phil maps 4 …

•Problem :

•How mapping will be done?

•If two items maps the same place what happens?

Page 7: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

A Plan For HashingA Plan For Hashing

Save items in a key-indexed table. Index is a function of the Save items in a key-indexed table. Index is a function of the key.key.

Hash function.Hash function. Method for computing table index from key.Method for computing table index from key.

Collision resolution strategy. Collision resolution strategy. Algorithm and data structure to handleAlgorithm and data structure to handle two keys that hash to two keys that hash to

the same index.the same index.

If there is nIf there is no space limitationo space limitation TTrivial hash function with key as address.rivial hash function with key as address.

If there is nIf there is no time limitationo time limitation TTrivial collision resolution = sequential search.rivial collision resolution = sequential search.

Limitations on both time and space: hashing (the real Limitations on both time and space: hashing (the real world)world)

Page 8: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

HashingHashing Hash tables

use array of size m to store elements given key k (the identifier name), use a

function h to compute index h(k) for that key

collisions are possible two keys hash into the same slot.

Hash functions is easy to compute avoids collisions (by breaking up

patterns in the keys and uniformly distributing the hash values)

Page 9: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

HashingHashing

NomenclatureNomenclature k  is a key h(k) is the hash function m  is the size of the hash table n  is the number of keys in the hash

table

Page 10: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

What is HashWhat is Hash

(in Wikipedia) (in Wikipedia) HashHash is an American dish  is an American dish consisting of a mixture of beef (often corned consisting of a mixture of beef (often corned beef or roast beef), onions, potatoes, beef or roast beef), onions, potatoes, and spicesand spices that are mashed together into a that are mashed together into a coarse, chunky paste, and then cooked, coarse, chunky paste, and then cooked, either alone, or with other ingredients.  either alone, or with other ingredients. 

Is it related with our definition????Is it related with our definition???? to chop any patterns in the keys soto chop any patterns in the keys so that the that the

results are uniformly distributed results are uniformly distributed

Page 11: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

What is HashingWhat is Hashing

  Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value. It is also used in many encryption algorithms.

Page 12: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

HashingHashing

When the key is a string, we generally use the ASCII values of its characters in someway: Examples for k = c1c2c3...cx       

h(k) = (c1128(x-1)+c2

128(x-2)+...+cx128*0) mod  m

h(k) = (c1+c2+...+cx) mod m

h(k) = (h1(c1)+h2(c2)+...hx(cx)) mod m, where each hi is an independent hash function. 

Page 13: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Finding A Hash FunctionFinding A Hash Function

Goal: scramble the keys.Goal: scramble the keys. Each table position equally likely for each key.Each table position equally likely for each key.

Ex: Ex: Vatandaşlık Numarası for 10000 personVatandaşlık Numarası for 10000 person Bad: Bad: The Whole Number Since 10000 will not be used The Whole Number Since 10000 will not be used

foreverforever Better: last three digits.Better: last three digits. But every number is even But every number is even The Best : Use 2,3,4,5 digitsThe Best : Use 2,3,4,5 digits

Ex: date of birth.Ex: date of birth. Bad: first three digits of birth year.Bad: first three digits of birth year. Better: birthday.Better: birthday.

Ex: phone numbers.Ex: phone numbers. Bad: first three digits.Bad: first three digits. Better: last three digits.Better: last three digits.

Page 14: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Hash FunctionHash Function

Truncation Ignore part of the key and use the

remaining part directly as the index. Example: if the keys are 8-digit

numbers and the hash table has 1000 entries, then the first, fourth and eighth digit could make the hash function.

Not a very good method : does not distribute keys uniformly

Page 15: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Hash FunctionHash Function

Folding Break up the key in parts and combine

them in some way Example : if the keys are 9 digit

numbers, break up a key into three 3-digit numbers and add them up.

Ex ISBN 0-321-37319-7 Ex ISBN 0-321-37319-7 Divide them to three as 321 373 and 197 Divide them to three as 321 373 and 197 Add them : 891 use it as mod 500 = 491Add them : 891 use it as mod 500 = 491

Page 16: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Hash FunctionHash Function

Middle square Compute k*k and pick some digits from

the resulting number Example : given a 9-digit key k, and a hash

table of size 1000 pick three digits from the middle of the number k*k. 

Ex 175344387 – 344*344= 118336 -----183 or 833

Works fairly well in practice if the keys do not have many leading or trailing zeroes.

Page 17: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Hash FunctionHash Function

Division h(k)=k mod m Fast Not all values of m are suitable for this.

For example powers of 2 should be avoided because then k mod m is just the least significant digits of k

Good values for m are prime numbers .

Page 18: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Hash FunctionHash Function

Multiplication h(k)=int(m *(k * c - int(k * c) ) , 0<c<1 In English : Multiply the key k by a constant c, 0<c<1 Take the fractional part of k * c Multiply that by m Take the floor of the result

The value of m does not make a difference Some values of c work better than others A good value for c :

5 1

2

Page 19: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Hash FunctionHash Function

Multiplication Example: Suppose the size of the table, m, is 1301. For k=1234,   h(k)=850 For k=1235,   h(k)=353 For k=1236,   h(k)=115 For k=1237,   h(k)=660 For k=1238,   h(k)=164 For k=1239,   h(k)=968 For k=1240,   h(k)=471

Page 20: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Hash FunctionHash Function

Universal Hashing Worst-case scenario: The chosen keys all hash

to the same slot. This can be avoided if the hash function is not

fixed: Start with a collection of hash functions with

the property that for any given set of inputs they will scatter the inputs among the range of the function well

Select one at random and use that. Good performance on average: the probability

that the randomly chosen hash function exhibits the worst-case behavior is very low.

Page 21: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

When Collusion Occurs...When Collusion Occurs... Collusion Occurs when more than one item has been mapped Collusion Occurs when more than one item has been mapped

to the same locationto the same location Ex n = 10 m = 10 Use mod 10Ex n = 10 m = 10 Use mod 10 9 will be mapped to 99 will be mapped to 9 769 will be mapped to 9 769 will be mapped to 9

In probability theory, the birthday problemIn probability theory, the birthday problem or b or birthdayirthday paradoxparadox pertains to the probability that in a setpertains to the probability that in a set of randomly chosen of randomly chosen people some pair of them will have the same birthday. In a people some pair of them will have the same birthday. In a group of 23 (or more) randomly chosen people, there is more group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will both have than 50% probability that some pair of them will both have been born on the same day. For 57 or more people, the been born on the same day. For 57 or more people, the probability is more than 99%, reaching 100% as the number of probability is more than 99%, reaching 100% as the number of people reaches 366. The mathematics behind this problem people reaches 366. The mathematics behind this problem leads to a well-known cryptographic attack called the birthday leads to a well-known cryptographic attack called the birthday attack. attack.

When collusion occurs an algorithm has to map the second, When collusion occurs an algorithm has to map the second, third, ...n’th item to a definitive places in the mapthird, ...n’th item to a definitive places in the map

In order to read data from the map the same algorithm has In order to read data from the map the same algorithm has been used to retrieve it.been used to retrieve it.

Page 22: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Resolving CollusionResolving Collusion

Chaining  Put all the elements that collide in a

chain (list) attached to the slot. The hash table is an array of linked lists The load factor indicates the average

number of elements stored in a chain. It could be less than, equal to, or larger than 1.

Page 23: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

What is Load Factor?What is Load Factor?

Given a hash table of size Given a hash table of size mm, , and and nn elements elements stored in it, we stored in it, we define the define the load factorload factor of the of the table table as as ==nn//mm  (lambda)  (lambda)

The load factor gives us The load factor gives us an indication of howan indication of how full the table is.full the table is.

The possible values of the load The possible values of the load factor dependfactor depend on the method we use on the method we use for resolving collisions. for resolving collisions.

Page 24: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Return to Resolving Collision Return to Resolving Collision Chaining ctd.Chaining ctd.

Chaining puts elements that hash to the same slot in a linked list

•Separate chaining: array of M linked lists.

•Hash: map key to integer i between 0 and M-1.•Insert: put at front of ith chain.

constant time•Search: only need to search ith chain.

proportional to length of chain

Page 25: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

ChainingChaining

Insert/Delete/Lookup in expected O(1)time Keep the list doubly-linked to facilitate

deletions Worst case of lookup time is linear.

However, this assumes that the chains are kept small.

If the chains start becoming too long, the table must be enlarged and all the keys rehashed.

Page 26: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Chaining PerformanceChaining Performance

Search cost is proportional to length of chain. Trivial: average length = N / M. Worst case: all keys hash to same chain.

Theorem. Let λ= N / M > 1 be average length of list which is called loading factor. Average search cost : 1+ λ/2

What is the choice of M M too large too many empty chains. M too small chains too long. Typical choice: = N / M ~ 10 constant-time

search/insert.

Page 27: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Chaining PerformanceChaining Performance

Analysis of successful search: Expected number e of elements examined

during a successful search for key k     = one more than the expected number of elements examined when k was inserted.

it makes no difference whether we insert at the beginning or the end of the list.

Take the average, over the n items in the table, of 1 plus the expected length of the chain to which the ith element was added:

Page 28: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Open AddressingOpen Addressing

Open addressing Store all elements within the table The space we save from the chain

pointers is used instead to make the array larger.

If there is a collision, probe the table in a systematic way to find an empty slot.

If the table fills up, we need to enlarge it and rehash all the keys.

Page 29: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Open AddressingOpen Addressing

hash function:hash function: ( (hh((kk) + ) + i i ) mod ) mod mm for  for ii=0, 1,...,m-1=0, 1,...,m-1 InsertInsert : Start with the location where the key  : Start with the location where the key

hashed andhashed and do a sequential search for an empty do a sequential search for an empty slot.slot.

SearchSearch : Start with the location where the key  : Start with the location where the key hashedhashed and do a sequential search until you and do a sequential search until you either find the key(success) or find an empty slot either find the key(success) or find an empty slot (failure).(failure).

DeleteDelete : ( : (lazy deletionlazy deletion) follow same route but ) follow same route but mark slotmark slot as DELETED rather than EMPTY, as DELETED rather than EMPTY, otherwise subotherwise sub sequentsequent searches will fail. searches will fail.

Page 30: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Hash Table without Linked-Hash Table without Linked-ListList Linear probing: array of size M. Hash: map key to integer i between 0 and M-1. Insert: put in slot i if free, if not try i+1, i+2,

etc. Search: search slot i, if occupied but no match,

try i+1, i+2, etc. Cluster. Contiguous block of items. Search through cluster using elementary

algorithm for arrays.

Page 31: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Open Address Lineer ProbingOpen Address Lineer Probing

Advantage: very easy to implement Disadvantage: primary clustering Long sequences of used slots build up with

gaps between them. Every insertion requires several probes and adds to the cluster.

The average length of a probe sequence when inserting is

2

1 11

2 1

Page 32: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Quadratic ProbesQuadratic Probes

Probe the table at slots (h(k) + i2) mod m   

for i =0, 1,2, 3, ..., m-1 Ease of computation: Not as easy as linear probing.

Do we really have to compute a power? Clustering Primary clustering is avoided, since the

probes are not sequential. 

Page 33: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Search Quadratic ProbingSearch Quadratic Probing

3 + 0^2 = 33 + 1^2 = 43 + 2^2 = 73 + 3^2 = 123 + 4^2 = 33 + 5^2 = 123 + 6^2 = 73 + 7^2 = 4

3 + 8^2   = 33 + 9^2   = 43 + 10^2 = 73 + 11^2 = 123 + 12^2 = 33 + 13^2 = 123 + 14^2 = 73 + 15^2 = 4

Probe sequence for hash value 3 in a table of size 16:

Page 34: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Quadrature ProbingQuadrature Probing Probe sequence for hash value 3 in a table ofProbe sequence for hash value 3 in a table of size 19:size 19:

3 + 0^2 = 3

3 + 1^2 = 4

3 + 2^2 = 7

3 + 32 = 12

3 + 42 = 0

3 + 52 = 9

3 + 62 = 1

3 + 72 = 14

3 + 82 = 10

3 + 92 = 8

Page 35: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Quadrature ProbingQuadrature Probing

Disadvantage:  secondary clustering: if h(k1)==h(k2) the probing sequences

for k1 and k2 are exactly the same. Is this really bad?

In practice, not so much It becomes an issue when the load factor is

high.

Page 36: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Double HashingDouble Hashing

The hash function is (h(k)+i h2(k)) mod m In English: use a second hash function to

obtain the next slot. The probing sequence is:

h(k),  h(k)+h2(k),  h(k)+2h2(k),  h(k)+3h3(k), ...

Performance : Much better than linear or quadratic probing. Does not suffer from clustering BUT requires computation of a second function

Page 37: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Double HashingDouble Hashing

The choice of h2(k) is important It must never evaluate to zero

consider   h2(k)=k mod 9    for k=81 The choice of m is important If it is not prime, we may run out of

alternate locations very fast.

Page 38: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

RehashingRehashing

After 70% of table is full, double the After 70% of table is full, double the size of the hash table.size of the hash table.

Don’t forget to have prime numberDon’t forget to have prime number

Page 39: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Lempel-Ziv-Welch (LZW) Compression AlgorithmLempel-Ziv-Welch (LZW) Compression Algorithm

Introduction to the LZW AlgorithmIntroduction to the LZW Algorithm

Example 1: Encoding using LZWExample 1: Encoding using LZW

Example 2: Decoding using LZWExample 2: Decoding using LZW

LZW: Concluding NotesLZW: Concluding Notes

Page 40: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Introduction to LZWIntroduction to LZW

As mentioned earlier, static coding schemes require some As mentioned earlier, static coding schemes require some knowledge about the data before encoding takes place.knowledge about the data before encoding takes place.

Universal coding schemes, like LZW, do not require Universal coding schemes, like LZW, do not require advance knowledge and can build such knowledge on-the-advance knowledge and can build such knowledge on-the-fly.fly.

LZW is the foremost technique for general purpose data LZW is the foremost technique for general purpose data compression due to its simplicity and versatility.compression due to its simplicity and versatility.

It is the basis of many PC utilities that claim to It is the basis of many PC utilities that claim to “double the “double the capacity of your hard drive”capacity of your hard drive”

LZW compression uses a code table, with 4096 as a LZW compression uses a code table, with 4096 as a common choice for the number of table entries.common choice for the number of table entries.

Page 41: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Introduction to LZW (cont'd)Introduction to LZW (cont'd)

Codes 0-255 in the code table are always assigned to Codes 0-255 in the code table are always assigned to represent single bytes from the input file.represent single bytes from the input file.

When encoding begins the code table contains only the When encoding begins the code table contains only the first 256 entries, with the remainder of the table being first 256 entries, with the remainder of the table being blanks.blanks.

Compression is achieved by using codes 256 through 4095 Compression is achieved by using codes 256 through 4095 to represent sequences of bytes.to represent sequences of bytes.

As the encoding continues, LZW identifies repeated As the encoding continues, LZW identifies repeated sequences in the data, and adds them to the code table.sequences in the data, and adds them to the code table.

Decoding is achieved by taking each code from the Decoding is achieved by taking each code from the compressed file, and translating it through the code table compressed file, and translating it through the code table to find what character or characters it represents.to find what character or characters it represents.

Page 42: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

LZW Encoding AlgorithmLZW Encoding Algorithm

1 Initialize table with single character strings1 Initialize table with single character strings

2 P = first input character2 P = first input character

3 WHILE not end of input stream3 WHILE not end of input stream

4 C = next input character4 C = next input character

5 IF P + C is in the string table5 IF P + C is in the string table

6 P = P + C6 P = P + C

7 ELSE7 ELSE

8   output the code for P8   output the code for P

99 add P + C to the string table add P + C to the string table

10 P = C10 P = C

11 END WHILE11 END WHILE

12 output code for P12 output code for P

Page 43: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 1: Compression using LZWExample 1: Compression using LZW

Example 1: Use the LZW algorithm to compress the stringExample 1: Use the LZW algorithm to compress the string

BABAABAAABABAABAAA

Page 44: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 1: LZW Compression Step 1Example 1: LZW Compression Step 1

BABAABAAABABAABAAA P=AP=AC=C=emptyempty

ENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

output codeoutput code representingrepresenting codewordcodeword stringstring

6666 BB 256256 BABA

Page 45: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 1: LZW Compression Step 2Example 1: LZW Compression Step 2

BABAABAAABABAABAAA P=BP=BC=C=emptyempty

ENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

output codeoutput code representingrepresenting codewordcodeword stringstring

6666 BB 256256 BABA

6565 AA 257257 ABAB

Page 46: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 1: LZW Compression Step 3Example 1: LZW Compression Step 3

BABAABAAABABAABAAA P=AP=AC=C=emptyempty

ENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

output codeoutput code representingrepresenting codewordcodeword stringstring

6666 BB 256256 BABA

6565 AA 257257 ABAB

256256 BABA 258258 BAABAA

Page 47: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 1: LZW Compression Step 4Example 1: LZW Compression Step 4

BABAABAAABABAABAAA P=AP=AC=C=emptyempty

ENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

output codeoutput code representingrepresenting codewordcodeword stringstring

6666 BB 256256 BABA

6565 AA 257257 ABAB

256256 BABA 258258 BAABAA

257257 ABAB 259259 ABAABA

Page 48: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 1: LZW Compression Step 5Example 1: LZW Compression Step 5

BABAABAAABABAABAAA P=AP=AC=AC=A

ENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

output codeoutput code representingrepresenting codewordcodeword stringstring

6666 BB 256256 BABA

6565 AA 257257 ABAB

256256 BABA 258258 BAABAA

257257 ABAB 259259 ABAABA

6565 AA 260260 AAAA

Page 49: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 1: LZW Compression Step 6Example 1: LZW Compression Step 6

BABAABAAABABAABAAA P=AAP=AAC=C=emptyempty

ENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

output codeoutput code representingrepresenting codewordcodeword stringstring

6666 BB 256256 BABA

6565 AA 257257 ABAB

256256 BABA 258258 BAABAA

257257 ABAB 259259 ABAABA

6565 AA 260260 AAAA

260260 AAAA

Page 50: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

LZW DecompressionLZW Decompression

The LZW decompressor creates the same string table The LZW decompressor creates the same string table during decompression.during decompression.

It starts with the first 256 table entries initialized to single It starts with the first 256 table entries initialized to single characters. characters.

The string table is updated for each character in the input The string table is updated for each character in the input stream, except the first one.stream, except the first one.

Decoding achieved by reading codes and translating them Decoding achieved by reading codes and translating them through the code table being built.through the code table being built.

Page 51: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

LZW Decompression AlgorithmLZW Decompression Algorithm

1 Initialize table with single character strings1 Initialize table with single character strings2 OLD = first input code2 OLD = first input code3 output translation of OLD3 output translation of OLD4 WHILE not end of input stream4 WHILE not end of input stream5 NEW = next input code5 NEW = next input code6  IF NEW is not in the string table6  IF NEW is not in the string table7 S = translation of OLD7 S = translation of OLD8   S = S + C8   S = S + C9 ELSE9 ELSE10  S = translation of NEW10  S = translation of NEW11 output S11 output S12   C = first character of S12   C = first character of S13   OLD + C to the string table13   OLD + C to the string table14 OLD = NEW14 OLD = NEW15 END WHILE15 END WHILE

Page 52: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 2: LZW Decompression 1Example 2: LZW Decompression 1

Example 2: Use LZW to decompress the output sequence of Example 2: Use LZW to decompress the output sequence of

Example 1: Example 1:

<66><65><256><257><65><260>. <66><65><256><257><65><260>.

Page 53: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 2: LZW Decompression Step 1Example 2: LZW Decompression Step 1

<66><65><256><257><65><260> Old = 65 S = A<66><65><256><257><65><260> Old = 65 S = A

New = 66 C = ANew = 66 C = A

ENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

stringstring codewordcodeword stringstring

BB

AA 256256 BABA

Page 54: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 2: LZW Decompression Step 2Example 2: LZW Decompression Step 2

<66><65><256><257><65><260> Old = 256 S = BA<66><65><256><257><65><260> Old = 256 S = BANew = 256 C = New = 256 C =

BB

ENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

stringstring codewordcodeword stringstring

BB

AA 256256 BABA

BABA 257257 ABAB

Page 55: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 2: LZW Decompression Step 3Example 2: LZW Decompression Step 3

<66><65><256><257><65><260> Old = 257 S = AB<66><65><256><257><65><260> Old = 257 S = ABNew = 257 C = New = 257 C =

AAENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

stringstring codewordcodeword stringstring

BB

AA 256256 BABA

BABA 257257 ABAB

ABAB 258258 BAABAA

Page 56: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 2: LZW Decompression Step 4Example 2: LZW Decompression Step 4

<66><65><256><257><65><260> Old = 65 S = A<66><65><256><257><65><260> Old = 65 S = ANew = 65 C = ANew = 65 C = A

ENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

stringstring codewordcodeword stringstring

BB

AA 256256 BABA

BABA 257257 ABAB

ABAB 258258 BAABAA

AA 259259 ABAABA

Page 57: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

Example 2: LZW Decompression Step 5Example 2: LZW Decompression Step 5

<66><65><256><257><65><260> Old = 260 S = AA<66><65><256><257><65><260> Old = 260 S = AANew = 260 C = New = 260 C =

AAENCODER OUTPUTENCODER OUTPUT STRING TABLESTRING TABLE

stringstring codewordcodeword stringstring

BB

AA 256256 BABA

BABA 257257 ABAB

ABAB 258258 BAABAA

AA 259259 ABAABA

AAAA 260260 AAAA

Page 58: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

LZW: Some NotesLZW: Some Notes

This algorithm compresses repetitive sequences of data This algorithm compresses repetitive sequences of data well.well.

Since the codewords are 12 bits, any single encoded Since the codewords are 12 bits, any single encoded character will expand the data size rather than reduce it. character will expand the data size rather than reduce it.

In this example, 72 bits are represented with 72 bits of In this example, 72 bits are represented with 72 bits of data. After a reasonable string table is built, compression data. After a reasonable string table is built, compression improves dramatically. improves dramatically.

Advantages of LZW over Huffman:Advantages of LZW over Huffman: LZW requires no prior information about the input data stream. LZW requires no prior information about the input data stream. LZW can compress the input stream in one single pass.LZW can compress the input stream in one single pass. Another advantage of LZW its simplicity, allowing fast executionAnother advantage of LZW its simplicity, allowing fast execution ..

Page 59: EEM 480 Lecture 11 Hashing and Dictionaries. Symbol Table Symbol tables are used by compilers to keep track of information about variables functions class

LZW: LimitationsLZW: Limitations

What happens when the dictionary gets too large (i.e., when all the What happens when the dictionary gets too large (i.e., when all the 4096 locations have been used)?4096 locations have been used)?

Here are some options usually implemented:Here are some options usually implemented:

Simply forget about adding any more entries and use the table as Simply forget about adding any more entries and use the table as is.is.

Throw the dictionary away when it reaches a certain size. Throw the dictionary away when it reaches a certain size.

Throw the dictionary away when it is no longer effective at Throw the dictionary away when it is no longer effective at compression.compression.

Clear entries 256-4095 and start building the dictionary again.Clear entries 256-4095 and start building the dictionary again.

Some clever schemes rebuild a string table from the last N Some clever schemes rebuild a string table from the last N input characters.input characters.