hashing. motivating applications large collection of datasets datasets are dynamic (insert, delete)...

Hashing

Motivating Applications

• Large collection of datasets

• Datasets are dynamic (insert, delete)

• Goal: efficient searching/insertion/deletion

• Hashing is ONLY applicable for exact-match searching

Direct Address Tables

• If the keys domain is U Create an array T of size U

• For each key K add the object to T[K]

• Supports insertion/deletion/searching in O(1)

Direct Address Tables

Alg.: DIRECT-ADDRESS-SEARCH(T, k)return T[k]

Alg.: DIRECT-ADDRESS-INSERT(T, x)T[key[x]] ← x

Alg.: DIRECT-ADDRESS-DELETE(T, x)T[key[x]] ← NIL

• Running time for these operations: O(1)

Drawbacks>> If U is large, e.g., the domain of integers, then T is large (sometimes infeasible)>> Limited to integer values and does not support duplication

Solution is to use hashing tables

Direct Access Tables: ExampleExample 1:

Example 2:

U is the domain

K is the actual number of keys

Hashing

• A data structure that maps values from a certain domain or range to another domain or range

Hash function3

15

20

55

Domain: String values

Domain: Integer values

Hashing

• A data structure that maps values from a certain domain or range to another domain or range

Hash function

Domain: numbers [950,000 … 960,000]

Student IDs

950000…..

960000Domain: numbers [0 … 10,000]

Range

0…..

10000

Hash Tables

• When K is much smaller than U, a hash table requires much less space than a direct-address table– Can reduce storage requirements to |K|

– Can still get O(1) search time, but on the average case, not the worst case

Hash Tables: Main Idea

• Use a hash function h to compute the slot for each key k

• Store the element in slot h(k)

• Maintain a hash table of size m T [0…m-1]

• A hash function h transforms a key into an index in a hash table T[0…m-1]:

h : U → {0, 1, . . . , m - 1}

• We say that k hashes to slot h(k)

Hash Tables: Main Idea

U(universe of keys)

K(actualkeys)

0

m - 1

h(k3)

h(k2) = h(k5)

h(k1)h(k4)

k1

k4 k2

k5k3

Hash Table (of size m)

>> m is much smaller that U (m <<U)

>> m can be even smaller than |K|

Example

• Back to the example of 100 students, each with 9-digit SSN

• All what we need is a hash table of size 100

What About Collisions

U(universe of keys)

K(actualkeys)

0

m - 1

h(k3)

h(k2) = h(k5)

h(k1)h(k4)

k1k4 k2

k5k3

Collisions!

• Collision means two or more keys will go to the same slot

Handling Collisions

• Many ways to handle it– Chaining

– Open addressing• Linear probing

• Quadratic probing

• Double hashing

Chaining: Main Idea

• Put all elements that hash to the same slot into a linked list (Chain)

• Slot j contains a pointer to the head of the list of all elements that hash to j

Chaining - Discussion• Choosing the size of the hash table– Small enough not to waste space

– Large enough such that lists remain short

– Typically 10% -20% of the total number of elements

• How should we keep the lists: ordered or not?– Usually each list is unsorted linked list

Insertion in Hash TablesAlg.: CHAINED-HASH-INSERT(T, x)

insert x at the head of list T[h(key[x])]

• Worst-case running time is O(1)

• May or may not allow duplication based on

the application

Deletion in Hash Tables

Alg.: CHAINED-HASH-DELETE(T, x)

delete x from the list T[h(key[x])]

• Need to find the element to be deleted.

• Worst-case running time:– Deletion depends on searching the corresponding

list

Searching in Hash Tables

Alg.: CHAINED-HASH-SEARCH(T, k)

search for an element with key k in list T[h(k)]

• Running time is proportional to the length of

the list of elements in slot h(k)

What is the worst case and average case??

Analysis of Hashing with Chaining:Worst Case

• All keys will go to only one chain

• Chain size is O(n)

• Searching is O(n) + time to apply h(k)

0

m - 1

T

chain

Analysis of Hashing with Chaining:Average Case

• With good hash function and uniform distribution of keys– Any given element is equally likely to hash into any of

the m slots

• All chain will have similar sizes

• Assume n (total # of keys), m is the hash table size– Average chain size O (n/m)

0

m - 1

T

chain

chain

chain

chain

Average Search Time O(n/m): The common case

• If m (# of slots) is proportional to n (# of keys):

– m = O(n)

– n/m = O(1)

Searching takes constant time on average

Analysis of Hashing with Chaining:Average Case

Hash Functions

Hash Functions • A hash function transforms a key (k) into a table address (0…

m-1)

• What makes a good hash function?(1) Easy to compute(2) Approximates a random function: for every input, every output is

equally likely (simple uniform hashing)(3) Reduces the number of collisions

Hash Functions• Goal: Map a key k into one of the m slots in the hash table

• Make table size (m) a prime number– Avoids even and power-of-2 numbers

• Common functionh(k) = F(k) mod m

Some function or operation on K (usually generates an integer)

The output of the “mod” is number [0…m-1]

Examples of Hash FunctionsCollection of images

F(k): Sum of the pixels colors

h(k) = F(k) mod m

Collection of strings

F(k): Sum of the ascii values

h(k) = F(k) mod m

Collection of numbersF(k): just return k

h(k) = F(k) mod m

hashing. motivating applications large collection of datasets datasets are dynamic (insert, delete)...

Documents

hash tables

hash functions slide

slot hk slide

slot slide

o1 slide

hash function h

range hash function

hash table t0m