hashing - international university of sarajevo

35
Hashing

Upload: others

Post on 03-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hashing - International University of Sarajevo

Hashing

Page 2: Hashing - International University of Sarajevo

Hash Tables • We’ll discuss the hash table ADT which supports only a

subset of the operations allowed by binary search

trees.

• The implementation of hash tables is called hashing.

• Hashing is a technique used for performing insertions,

2

• Hashing is a technique used for performing insertions,

deletions and finds in constant average time (i.e. O(1))

• This data structure, however, is not efficient in

operations that require any ordering information

among the elements, such as findMin, findMax and

printing the entire table in sorted order.

Page 3: Hashing - International University of Sarajevo

General Idea• The ideal hash table structure is merely an array of some fixed

size, containing the items.

• A stored item needs to have a data member, called key, that will be used in computing the index value for the item.

– Key could be an integer, a string, etc

– e.g. a name or Id that is a part of a large employee structure

• The size of the array is TableSize.

3

• The size of the array is TableSize.

• The items that are stored in the hash table are indexed by values from 0 to TableSize – 1.

• Each key is mapped into some number in the range 0 to TableSize – 1.

• The mapping is called a hash function.

Page 4: Hashing - International University of Sarajevo

Hashing• A hash function maps a value to an index in the table. The

function provides access to an element much like an index provides access to an array element.

• Like a binary search tree, a hash table provides an implementation of the Set and Map interfaces.

• A binary search tree can access data stored by value with O(log2n) average search time.

• We would like to design a storage structure that yields O(1) average retrieval time. In this way, access to an item is independent of the number of other items in the collection.

Page 5: Hashing - International University of Sarajevo

Introduction to Hashing (continued)

• A hash table is an array of references.

– Associated with the table is a hash function that

takes a key as an argument and returns an integer

value.value.

– By using the remainder after dividing the hash

value by the table size, we have a mapping of the

key to an index in the table.

Page 6: Hashing - International University of Sarajevo

Introduction to Hashing (concluded)

Hash Value: hf(key) = hashValue

HashTable index: hashValue % n

hf(key) = hashValue 0hf(key) = hashValue

key entry

0

1

i

n-1

hashValue % n = i

Page 7: Hashing - International University of Sarajevo

Using a Hash Function• Consider the hash function

hf(x) = x, where x is a nonnegative integer (the identity function). Assume the table is the array tableEntry with n = 7 elements.

hf(22) = 22 22 % 7 = 1

hf(4) = 4 4 % 7 = 4

0

1

4

6

2

3

5

tableEntry[1]

tableEntry[4]

Page 8: Hashing - International University of Sarajevo

Using a Hash Function (concluded)

• With hash function hf() and table size n, the

table index for a key is i = hf(key)%n. Collisions

occur for any two keys that differ by a multiple

of n.of n.

0

1

4

6

2

3

5

tableEntry[1]

tableEntry[5]

hf(5) = 5

hf(33) = 33

5 % 7 = 5

33 % 7 = 5

hf(22) = 22

hf(36) = 36

22 % 7 = 1

36 % 7 = 1

Page 9: Hashing - International University of Sarajevo

Designing Hash Functions

• Some general design principles guide the

creation of all hash functions.

– Evaluating a hash function should be efficient.

– A hash function should produce uniformly – A hash function should produce uniformly

distributed hash values. This spreads the hash

table indices around the table, which helps

minimize collisions.

Page 10: Hashing - International University of Sarajevo

Designing Hash Functions (continued)

• The Java programming language provides a

general hashing function with the hashCode()

method in the Object superclass.

public int hashCode()public int hashCode()

{ … }

Page 11: Hashing - International University of Sarajevo

Designing Hash Tables

• When two or more data items hash to the

same table index, they cannot occupy the

same position in the table.

– We are left with the option of locating one of the

items at another position in the table (linear items at another position in the table (linear

probing) or of redesigning the table to store a

sequence of colliding keys at each index (chaining

with separate lists) .

Page 12: Hashing - International University of Sarajevo

Hashing

• Open-bucket Hash tables store only one element per position

– Collision resolution is required

– Single step collision resolution potentially produces long contiguous sequences of unoccupied locations: problem contiguous sequences of unoccupied locations: problem not so great if different step size applied; Make sure step size is not exact divisor of table size

• Closed-bucket Hash tables may store one or more elements per position

Page 13: Hashing - International University of Sarajevo

Closed Bucket v Open Bucket

• Use Closed Bucket for unbounded sets and maps

• Closed Bucket Tables nearly always preferable as use

dynamic allocation

• Use Open Bucket Tables where memory is scarce or • Use Open Bucket Tables where memory is scarce or

number of entries is bounded

Page 14: Hashing - International University of Sarajevo

Hash Tables

• Closed bucket hash table

– Collection of data elements associated with each

bucket

• Open bucket hash table

– Single data element stored in each bucket

Page 15: Hashing - International University of Sarajevo

Closed Bucket Hash Table

Hfn(keyval1) ���� 2

Hfn(keyval2) ���� n-1

Hfn(keyval3) ���� 0

Ht [1]

Ht [2]

Ht [0]

keyval1

keyval3

keyval1keyval5

Hfn(keyval6) ���� n-3

Hfn(keyval4) ���� n-2

Hfn(keyval5) ���� 2

Hfn(keyval7) ���� n-1

. . .

Ht [n-2]

Ht [n-1]

Ht [n-3] keyval6

keyval4

keyval2 keyval2keyval7

. . .

Page 16: Hashing - International University of Sarajevo

Open Bucket Hash Tables

• Each bucket holds at most 1 item

• Implementation as array of element type

String[] hashTable = new String[TABLESIZE];

• Needs collision resolution strategy

– This means that array must be treated as cyclic

Page 17: Hashing - International University of Sarajevo

Open Bucket Hash Table

Hashtable [1]

Hashtable [2]

Hashtable [0]keyvalue3

keyvalue1

Hashtable [n-2]

Hashtable [n-1]

. . .

v

. . .

keyvalue4

keyvalue2

One element per bucket

Page 18: Hashing - International University of Sarajevo

Collision Resolution

• If, when an element is inserted, it hashes to the same value as an already inserted element, then we have a collision and need to resolve it.– E.g. For TableSize = 17, the keys 18 and 35 hash to the same value

– 18 mod 17 = 1 and 35 mod 17 = 1

18

• There are several methods for dealing with this:

– Separate chaining

– Open addressing

• Linear Probing

• Quadratic Probing

• Double Hashing

Page 19: Hashing - International University of Sarajevo

Separate Chaining

• The idea is to keep a list of all elements that hash to the same value.

– The array elements are pointers to the first nodes of the lists.

– A new item is inserted to the front of the list.

19

– A new item is inserted to the front of the list.

• Advantages:

– Better space utilization for large items.

– Simple collision handling: searching linked list.

– Overflow: we can store more items than the hash table size.

– Deletion is quick and easy: deletion from the linked list.

Page 20: Hashing - International University of Sarajevo

Separate Chaining• Each hash table cell holds a pointer to a

linked list of records with same hash value

(i, j, k in figure)

• Collision: Insert item into linked list

• To Find an item: compute hash value, then

do Find on linked list

nil

nil

nil

k1

k2 k3 k4

i

j

nil

# of keys occupying

the slot

20

do Find on linked list

• Can use a linked-list for Find/Insert/Delete

in linked list

• Can also use BSTs: O(log N) time instead of

O(N). But lists are usually small – not worth

the overhead of BSTs

nil

nil

nil

k5 k6k

* Hash(k1) = i

* Hash(k2)=Hash(k3)=Hash(k4) = j

* Hash(k5)=Hash(k6)=k

Page 21: Hashing - International University of Sarajevo

Example

0

1

2

3

0

81 1

Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81

hash(key) = key % 10.

21

3

4

5

6

7

8

9

64 4

25

36 16

49 9

Page 22: Hashing - International University of Sarajevo

Chaining with Separate Lists• Chaining with separate lists defines the hash

table as an indexed sequence of linked lists.

Each list, called a bucket, holds a set of items

that hash to the same table location.

Page 23: Hashing - International University of Sarajevo

Chaining with Separate Lists

(concluded)• Chaining with separate lists is generally faster

than linear probing since chaining only searches items that hash to the same table location.

• With linear probing, the number of table • With linear probing, the number of table entries is limited to the table size, whereas the linked lists used in chaining grow as necessary.

• To delete an element, just erase it from the associated list.

Page 24: Hashing - International University of Sarajevo

Collision Resolution by Open Addressing

• Linked lists can take up a lot of space…

• Open addressing (or probing): When collision occurs, try

alternative cells in the array until an empty cell is found

• Given an item X, try cells h0(X), h1(X), h2(X), …, hi(X)

• hi(X) = (Hash(X) + F(i)) mod TableSize

• Define F(0) = 0

24

• F is the collision resolution function. Three possibilities:

– Linear: F(i) = i

– Quadratic: F(i) = i^2

– Double Hashing: F(i) = i*Hash2(X)

Page 25: Hashing - International University of Sarajevo

Open Addressing I: Linear Probing

• Main Idea: When collision occurs, scan down the array one

cell at a time looking for an empty cell

• hi(X) = (Hash(X) + i) mod TableSize (i = 0, 1, 2, …)

• Compute hash value and increment until free cell is found

25

• In-Class Example: Insert {18, 19, 20, 29, 30, 31} into empty

hash table with TableSize = 10 using:

– (a) separate chaining

– (b) linear probing

Page 26: Hashing - International University of Sarajevo

Open Addressing II: Quadratic Probing

• Main Idea: Spread out the search for an empty slot Increment

by i^2 instead of I

• hi(X) = (Hash(X) + i2) mod TableSize (i = 0, 1, 2, …)

– No primary clustering but secondary clustering possible

• Example 1: Insert {18, 19, 20, 29, 30, 31} into empty hash table

with TableSize = 10

26

with TableSize = 10

• Example 2: Insert {1, 2, 5, 10, 17} with TableSize = 16

Page 27: Hashing - International University of Sarajevo

Open Addressing III: Double Hashing

• Idea: Spread out the search for an empty slot by using a second hash function

– No primary or secondary clustering

• hi(X) = (Hash(X) + i*Hash2(X)) mod TableSize for i = 0, 1, 2, …

• E.g. Hash2(X) = R – (X mod R)

– R is a prime smaller than TableSize

27

• Try this example: Insert {18, 19, 20, 29, 30, 31} into empty hash table with TableSize = 10 and R = 7

• No clustering but slower than quadratic probing due to Hash2

Page 28: Hashing - International University of Sarajevo

Hash Table Performance

• The worst case linear probe or chaining with

separate lists occurs when all data items hash

to the same table location. If the table

contains n elements, the search time is O(n), contains n elements, the search time is O(n),

no better than that for the sequential search.

Page 29: Hashing - International University of Sarajevo

Hash Table Performance

• A good hash function provides a uniform distribution of hash values.

• Hash table performance is measured by using the load factor λ = n/m, where n is the the load factor λ = n/m, where n is the number of elements in the hash table and m is the number of buckets.

– For linear probe, 0 ≤ λ ≤ 1.

– For chaining with separate lists, it is possible that λ > 1.

Page 30: Hashing - International University of Sarajevo

Load Factor

• The load factor is the number of elements in a table divided by the table size.

• In a closed bucket hash table too high a load factor means • In a closed bucket hash table too high a load factor means buckets likely to have more than 2 entries

• Too low a load factor means high space consumption.

• Choose a prime number of buckets to avoid problems with patterns in keys

Page 31: Hashing - International University of Sarajevo

Hash Table Performance

(continued)• Assume that the hash function uniformly

distributes indices around the hash table.

– We can expect λ = n/m elements in each bucket.

• On the average, an unsuccessful search makes λ

comparisons before arriving at the end of a list and comparisons before arriving at the end of a list and

returning failure.

• Mathematical analysis shows that the average number

of probes for a successful search is approximately 1 +

λ/2.

Page 32: Hashing - International University of Sarajevo

Hash Tables vs Search Trees

• Hash Tables are good if you would only like to perform ONLY Insert/Delete/Find

• Hash Tables are not good if you would also like to get a sorted ordering of the keys

– Keys are stored arbitrarily in the Hash Table

32

– Keys are stored arbitrarily in the Hash Table

• Hash Tables use more space than search trees– Our rule of thumb: Time versus space tradeoff

• Search Trees support sort/predecessor/successor/min/max operations, which cannot be supported by a Hash Table

Page 33: Hashing - International University of Sarajevo

Summary

• A hash table is based on an array.

• The range of key values is usually greater than the size of the array.

• A key value is hashed to an array index by a hash function.

• An English-language dictionary is a typical example of a database that can be

efficiently handled with a hash table.

• The hashing of a key to an already filled array cell is called a collision.

• Collisions can be handled in two major ways: open addressing and separate

chaining.

• In open addressing, data items that hash to a full array cell are placed in another cell

in the array.in the array.

• In separate chaining, each array element consists of a linked list. All data items

hashing to a given array index are inserted in that list.

• We discussed three kinds of open addressing: linear probing, quadratic probing, and

double hashing.

• In linear probing the step size is always 1, so if x is the array index calculated by the

hash function, the probe goes to x, x+1, x+2, x+3, and so on.

• The number of such steps required to find a specified item is called the probe

length.

• In linear probing, contiguous sequences of filled cells appear. These are called

primary clusters, and they reduce performance.

Page 34: Hashing - International University of Sarajevo

In quadratic probing the offset from x is the square of the step number, so the probe

goes to x, x+1, x+4, x+9, x+16, and so on.

• Quadratic probing eliminates primary clustering, but suffers from the less severe

secondary clustering.

• Secondary clustering occurs because all the keys that hash to the same value follow

the same sequence of steps during a probe.

• All keys that hash to the same value follow the same probe sequence because the

step size does not depend on the key, but only on the hash value.

• In double hashing the step size depends on the key, and is obtained from a secondary

hash function.

••

If the secondary hash function returns a value s in double hashing, the probe goes to

x, x+s, x+2s, x+3s, x+4s, and so on, where s depends on the key, but remains

constant during the probe.

• The load factor is the ratio of data items in a hash table to the array size.

• The maximum load factor in open addressing should be around 0.5. For double

hashing at this load factor, searches will have an average probe length of 2.

Page 35: Hashing - International University of Sarajevo

• Search times go to infinity as load factors approach 1.0 in open addressing.

• It's crucial that an open-addressing hash table does not become too full.

• A load factor of 1.0 is appropriate for separate chaining.

• At this load factor a successful search has an average probe length of 1.5, and an

unsuccessful search, 2.0.

• Probe lengths in separate chaining increase linearly with load factor.

•A string can be hashed by multiplying each character by a different power of a

constant, adding the products, and using the modulo (%) operator to reduce the result

to the size of the hash table.to the size of the hash table.

• To avoid overflow, the modulo operator can be applied at each step in the process, if

the polynomial is expressed using Horner's method.

• Hash table sizes should generally be prime numbers. This is especially important in

quadratic probing and separate chaining.

• Hash tables can be used for external storage. One way to do this is to have the

elements in the hash table contain disk-file block numbers.