dictionaries and hash tables. dictionary a dictionary, in computer science, implies a container that...

44
Dictionaries and Hash Tables

Upload: ferdinand-barker

Post on 31-Dec-2015

239 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Dictionaries and Hash Tables

Page 2: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Dictionary

A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows for quick retrieval.

– Items must be stored in a way that allows them to be located with the key

– Not necessary to store the items in order Unordered dictionary Ordered dictionary

Page 3: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Dictionary ADT

Operations in a Dictionary ADT:int size()bool isEmpty()iter elements()iter keys()pos find( key )iter findAll( key )void insertItem( key, elem )void removeElement( key )void removeAllElements( key )

Page 4: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Dictionary Examples

Natural language dictionary• word is key

• element contains word, definition, pronunciation, etc.

Web pages• URL is key

• html or other file is element

Any typical database (e.g. student record)• has one or more search keys

• each key may require own organizational dictionary

Page 5: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Implementing a Dictionary

There are many ways a dictionary can be implemented. Some of them are:– Log file or Audit Trail– Ordered Dictionary and Binary search trees– Hash table

Page 6: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Log File or Audit Trail

This is the simplest way to implement a dictionary. It uses an unordered vector, list or sequence to store the key-element pairs.void insertItem( key, elem )

Each new item is appended at the end – O(1)

pos find( key ) Scan the entire list and examine each key – O(n)

void removeElement( key ) Scan the entire list to find the item, then remove it – O(n)

This allows for fast insertions. However, find and retrieval are slow.

– Good solution for storing items that are stored frequently but retrieved rarely such as archiving database and operating systems transactions.

– Storing log file

Page 7: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Ordered Dictionary ADT

All of the Dictionary operations, e.g. find(k), insertItem(k,e), removeElement(k)

Additional operationspos closestBefore( key )

pos closestAfter( key )

Page 8: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Look-Up Tables

A look-up table is an implementation of an ordered dictionary ( eg. trigonometry table )

Here is an example, where all items are stored in a vector, in ascending order of the keys.

0 1 2 3 4

A

5 6 7 8 9 10

13 265 3716 2115

Page 9: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Lookup Table Performance

In a look-up table, inserting or removing may require shifting elements

0 1 2 3 4

A

5 6 7 8 9 10

13 265 3716 2115

0 1 2 3 4

A

5 6 7 8 9 10

13 265 3716 21152

Example:Insert an item with a key of 2

n elements shifted to make room

insertItem(k,e) takes O(n) time in the worst caseremoveElement(k) takes O(n) time in the worst case

Page 10: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Lookup Table – find(k)

However, since the items in a lookup table are ordered, we can implement find(k) with a binary search algorithmA binary search algorithm (or binary chop) is a technique for finding a particular value in a linear array, by ruling out half of the data at each step. A binary search finds the median, makes a comparison to determine whether the desired value comes before or after it, and then searches the remaining half in the same manner. A binary search is an example of a divide and conquer algorithm.

Page 11: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

0 1 2 3 4

A

5 6 7 8 9 10 11 12 13 14 15

Binary Search

5 124 148 972 22 3319 3727 282517

Example: find(22)

low highmid

0 1 2 3 4

A

5 6 7 8 9 10 11 12 13 14 15

22 3319 3727 2817

mid highlow

5 124 148 972 25

A 2217mid highlow

5 124 148 972 33 3727 282519

A

low = mid = high

5 124 148 972 33 3727 28252217 19

Page 12: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Binary Search Algorithm

Algorithm BinarySearch( A, k, low, high)if low > high then return Nullelse mid = (low + high) / 2 if ( k == key(mid) ) then return Position(mid) else if ( k < key(mid) ) then return BinarySearch( A, k, low, mid – 1 ) else return BinarySearch( A, k, mid + 1, high )

Page 13: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Hash Tables

In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that the hash table uses to locate the desired value.

This is considered the most efficient way to implement a dictionary.

Page 14: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Hash Table

Page 15: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Bucket Arrays

A Bucket array for a hash table, is an array A of size N, where each cell of A is thought of as a ‘bucket’, and N defines the capacity of the array.Example

– Small company with less than 100 employees– Each employee has an ID number in the range 0–99– Store employee records in an array, so that the employee ID

number matches the array index

EMPTY

01Turing, A.

02Babbage, C. EMPTY

04Gates,

W.

0 1 2 3 4

A …

Page 16: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Bucket Arrays

If the keys are unique, then searches, insertions and removals in the bucket array take worst-case time of O(1).

However, bucket arrays have 2 drawbacks. – It requires a capacity of N (which is the

maximum number of elements possible– The key has to be a integer in the range [0, N-1]

Page 17: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Hash Functions

A good hash function is essential for good hash table performance. If a hash function tends to produce similar values, slow searches will result.

Example– Small company with less than 100 employees– Already uses a 5-digit ID number

A simple hash function for this example is ( ID % 100 )

EMPTY

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A …

Page 18: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Hash Functions

A hash function is a way of creating a small digital "fingerprint" from any kind of data. The function chops and mixes the data to create the fingerprint, often called a hash value. A good hash function is one that yields few hash collisions in expected input domains.

To do this, the index into the hash table's array is generally calculated in two steps:

– A generic hash value is calculated to map the key to an integer ( hash code )

– This value is reduced to a valid array index ( compression map )

Page 19: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Hash Code

Take an arbitrary key k and assigning it to an integer value h. Then h is know as the hash code or hash value of k.

key -> integer

This integer h does not need to be in the range of the array that is being used for hashing and may even be a negative number, but we want the set of hash codes assigned to our keys to avoid collisions as much as possible.

Hash coding can be done in many ways:

– Integer cast

– Summing components

– Polynomial accumulations

Page 20: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Hash code – Integer Cast

int hashCode( int key ){ return key; }

int hashCode( char key ){ return hashCode( int(key) ); // cast it

// to an integer }

Page 21: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Hash code – Summing Components

If the long int has twice as many bits as the int datatype, e.g. 32 bits for int, 64 bits for long

Treat the high-order bits as an integer and the low-order bits as an integer, then sum them

int hashCode( long key ){ typedef unsigned long ulong; return hashCode( int( ulong(key) >> 32 ) + int( key ) ); }

Page 22: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Hash code – Summing Components Applied to Strings

One approach is to sum the ASCII values of all the chars in the string– Problem: too many collisions because many

different words will have the same result– For example, stop, tops, pots, spot

ASCII

s = 115t = 116o = 111p = + 112

Hashcode = 454

Page 23: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Hash code – Polynomial Accumulation

Better approach for string keys– Modify each char’s ASCII value by a number based on its

position in the string– Then sum the results– Where x represents a char, k is the total number of chars, and a is a constant (but not 1), the following formula can be used:

x0ak-1 + x1ak-2 + … + xk-2a + xk-1

s = 115 * 103 = 115000t = 116 * 102 = 11600o = 111 * 101 = 1110p = 112 * 100 = + 112

Hashcode = 127822

Example, assume thatthe string is “stop” and a = 10

Page 24: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Compression Maps

This is the second part of the hash function action. Once we have a hash code, we need to map it to an integer in the range of array index numbers

This can me accomplished in many ways:– Truncation– Truncation and Summation– Division method– MAD method

Page 25: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Compression Maps - Truncation

One way would be to simply ignore parts of the key and use the remaining part.

Eg:employee number: 15436578bucket size: 1000possibility 1: k = last 3 digits = 578possibility 2: k = digits 4, 6 and 8 = 358

This is a fast scheme, but it fails to give an even distribution of keys throughout the table.

Page 26: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Compression Maps – Truncation and Summation

This method might use a combination of truncating and summing parts of the key.

Eg:employee number: 15496578bucket size: 1000possibility: k = partition into 3, and together and truncate if necessary.k = 154 + 965 + 78 = 1197 = 197

This provides a better spread than simple truncation, but it still does not prevent collision.

Page 27: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Compression Maps - Division Method

int k = hashCode( key );int index = abs(k) % ARRAY_SIZE;

It has been found that the size of the array should be a prime number. This reduces the number of collisions and spreads out the distribution of hashed values

Example Keys = {200,210,220,230,…,600} IF Array size = 100 - a non-prime number produces collisions for

each hash code IF Array size = 101 - a prime number produces less collisions

for each hash code

Page 28: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Compression Maps - MAD Method

This is another method to convert the hash code into a known range. MAD stands for “Multiply, Add, and Divide” where

a and b are non-negative integers (a % ARRAY_SIZE) must not be 0 a and b are chosen at random when the program is written

int k = hashCode( key );int i = abs(a * k + b) % ARRAY_SIZE;

–Example:Keys = {200,210,220,230,…,600}where a=8, b=7, array size = 100200 => (8*200+7) % 100 => 7210 => (8*210+7) % 100 => 87220 => (8*220+7) % 100 => 67230 => (8*230+7) % 100 => 47

Page 29: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Collisions

There is no restriction as to the key being unique or for the hash function to generate a unique value. This means that there is a chance that there might be more than one element that wants to be mapped to the same position. This would create a collision.

Page 30: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Collisions

Two different keys are mapped to the same location in the array

Best approach – minimize collisions by picking a good hash function

Example– A bad hash function is ( key % 100 ) because it is

too likely to cause collisions . key % 101 is better.

Page 31: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Collisions

If two keys hash to the same index, the corresponding records cannot be stored in the same location. So, if it's already occupied, we must find another location to store the new record, and do it so that we can find it when we look it up later on.Example

– Previous hash function of ( ID % 100 ) is too likely to cause collisions

EMPTY

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A

38104McNealy,

S.

!

Page 32: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Collision Handling

There are a number of collision resolution techniques, but the most popular are chaining and open addressing.

Two different approaches– Chaining

– Open addressing

Page 33: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Chaining

Separate chaining is a method for dealing with collisions. The hash table is an array of linked lists. Data elements that hash to the same value are stored in a linked list originating from the index equivalent of their hash value.

– Each location in the hash table holds a pointer to a list

– Each list can hold many items

– As long as the hash function is good, the lists will be small because there will be few collisions

Page 34: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Separate Chaining Example

90 next NULL12 next 38 next 25 next

0

A

12

3456

7

89

101112

36 next NULL10 next

41 next NULL28 next 54 next

18 next NULL

Page 35: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Open Addressing

This is a method where only one item is always stored in one bucket. If multiple elements map to same bucket, some method must be used to find an empty bucket• Linear probing

h’(k) = ( h(k) + j ) mod N where j = 0, 1, 2, 3, . . .

»Keep adding 1 to rank to find empty bucket

• Quadratic probing

h’(k) = ( h(k) + j² ) mod N where j = 0, 1, 2, 3, . . .

• Double hashing

h’(k) = ( h(k) + j * h’’(k) ) mod N where j = 0, 1, 2, 3, . . .

where h’’(k) = q – (k mod q )

Page 36: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Linear Probing

If a bucket is already occupied, then try the next available bucket

EMPTY

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A

38104McNealy,

S.

!

Page 37: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Linear Probing

If a bucket is already occupied, then try the next available bucket

EMPTY

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A

38104McNealy,

S.

!

38104McNealy,

S.

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A …

Page 38: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Linear Probing – insertItem(k,e)

If a location is already occupied, then try the next available location

Example:– h(k) = ( (k % cap) + j ) mod cap where j = 0, 1, 2, 3, . . .– Insert the following keys into hash table A

{13,26,5,37,16,21,15}

0 1 2 3 4

A

5 6 7 8 9 10

13 26 5 37 16 2115

Page 39: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Linear Probing – Using Lazy Deletes

Problem: – If the find() operation is looking for a key, it stops looking when it gets

to an empty location and assumes the key isn’t there– If multiple items with the same key are stored in the hash table with

linear probing and then one of them is deleted, a “hole” is created, and find() might stop prematurely

0 1 2 3 4

A

5 6 7 8 9 10

13 26 37 16 2115

• Solution: Implement removeElement so that it never deletes an item, it just marks the location “FREE”

FREE EMPTYEMPTYEMPTYEMPTY

Page 40: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Quadratic Probing

Quadratic Probing is another open addressing strategy to deal with collisions. It uses the following formula:

h(k) = ( (k % cap) + j² ) mod cap where j = 0, 1, 2, 3, . . .

Example: {13,26,5,37,16,21,15}((37 % 11) + 02) % 11 = 4 //collision((37 % 11) + 12) % 11 = 5 //collision((37 % 11) + 22) % 11 = 8 //OK

0 1 2 3 4

A

5 6 7 8 9 10

13 26 5 3716 2115

Page 41: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Quadratic Probing Pros and Cons

Advantages– Avoids clustering

Disadvantages– Creates secondary clustering – a different pattern of

filled array locations

– If the load factor is 0.5 or more, an empty location may not be found even if one exists

Page 42: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Double Hashing

Double hashing is another alternative to linear probing where, if there’s a collision, then a second, different hash function h' is usedh’(k) = ( h(k) + j * h’’(k) ) mod N where j = 0, 1, 2, 3, . . .

and where h’’(k) = q – (k mod q )h(k) = ( (key % cap) + (j * ( q – ( key % q ) ) ) ) % cap

where j = 0, 1, 2, 3, . .

Example: {13,26,5,37}Let q = 7

h(k) = ( (37 % 11) + (j * ( 7 – ( 37 % 7 ) ) ) ) % 11h(k) = h(37) + 0*(…) = 37 % 11 = 4 //collisionh(k) = (4 + (7 – (37 % 7)) % 11 = 9 //OK

0 1 2 3 4

A

5 6 7 8 9 10

13 26 5 37

Page 43: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Load Factor

The load factor of a hashing table is the ratio of the number of items in the hash table to the number of buckets and is expressed by ( lambda )

– Expresses how “full” the hash table has become– Should always be kept below 0.75– Example

capacity = 11

items stored = 7

load factor = 7/11 = 0.64

Page 44: Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows

Rehashing

Maximum load factor, based on experimental data:– 0.5 for open addressing schemes– 0.9 for separate chaining

If the load factor is above that threshold, then the table should be resized

– New table should be at least double the old table so that the time cost can be amortized

– Hash function should be modified– Rehash the data – take each item out of the old array and

insert it into the new one using the new hash function