hashing. motivating applications large collection of datasets datasets are dynamic (insert, delete)...
TRANSCRIPT
Hashing
Motivating Applications
• Large collection of datasets
• Datasets are dynamic (insert, delete)
• Goal: efficient searching/insertion/deletion
• Hashing is ONLY applicable for exact-match searching
Direct Address Tables
• If the keys domain is U Create an array T of size U
• For each key K add the object to T[K]
• Supports insertion/deletion/searching in O(1)
Direct Address Tables
Alg.: DIRECT-ADDRESS-SEARCH(T, k)return T[k]
Alg.: DIRECT-ADDRESS-INSERT(T, x)T[key[x]] ← x
Alg.: DIRECT-ADDRESS-DELETE(T, x)T[key[x]] ← NIL
• Running time for these operations: O(1)
Drawbacks>> If U is large, e.g., the domain of integers, then T is large (sometimes infeasible)>> Limited to integer values and does not support duplication
Solution is to use hashing tables
Direct Access Tables: ExampleExample 1:
Example 2:
U is the domain
K is the actual number of keys
Hashing
• A data structure that maps values from a certain domain or range to another domain or range
Hash function3
15
20
55
Domain: String values
Domain: Integer values
Hashing
• A data structure that maps values from a certain domain or range to another domain or range
Hash function
Domain: numbers [950,000 … 960,000]
Student IDs
950000…..
960000Domain: numbers [0 … 10,000]
Range
0…..
10000
Hash Tables
• When K is much smaller than U, a hash table requires much less space than a direct-address table– Can reduce storage requirements to |K|
– Can still get O(1) search time, but on the average case, not the worst case
Hash Tables: Main Idea
• Use a hash function h to compute the slot for each key k
• Store the element in slot h(k)
• Maintain a hash table of size m T [0…m-1]
• A hash function h transforms a key into an index in a hash table T[0…m-1]:
h : U → {0, 1, . . . , m - 1}
• We say that k hashes to slot h(k)
Hash Tables: Main Idea
U(universe of keys)
K(actualkeys)
0
m - 1
h(k3)
h(k2) = h(k5)
h(k1)h(k4)
k1
k4 k2
k5k3
Hash Table (of size m)
>> m is much smaller that U (m <<U)
>> m can be even smaller than |K|
Example
• Back to the example of 100 students, each with 9-digit SSN
• All what we need is a hash table of size 100
What About Collisions
U(universe of keys)
K(actualkeys)
0
m - 1
h(k3)
h(k2) = h(k5)
h(k1)h(k4)
k1k4 k2
k5k3
Collisions!
• Collision means two or more keys will go to the same slot
Handling Collisions
• Many ways to handle it– Chaining
– Open addressing• Linear probing
• Quadratic probing
• Double hashing
Chaining: Main Idea
• Put all elements that hash to the same slot into a linked list (Chain)
• Slot j contains a pointer to the head of the list of all elements that hash to j
Chaining - Discussion• Choosing the size of the hash table– Small enough not to waste space
– Large enough such that lists remain short
– Typically 10% -20% of the total number of elements
• How should we keep the lists: ordered or not?– Usually each list is unsorted linked list
Insertion in Hash TablesAlg.: CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]
• Worst-case running time is O(1)
• May or may not allow duplication based on
the application
Deletion in Hash Tables
Alg.: CHAINED-HASH-DELETE(T, x)
delete x from the list T[h(key[x])]
• Need to find the element to be deleted.
• Worst-case running time:– Deletion depends on searching the corresponding
list
Searching in Hash Tables
Alg.: CHAINED-HASH-SEARCH(T, k)
search for an element with key k in list T[h(k)]
• Running time is proportional to the length of
the list of elements in slot h(k)
What is the worst case and average case??
Analysis of Hashing with Chaining:Worst Case
• All keys will go to only one chain
• Chain size is O(n)
• Searching is O(n) + time to apply h(k)
0
m - 1
T
chain
Analysis of Hashing with Chaining:Average Case
• With good hash function and uniform distribution of keys– Any given element is equally likely to hash into any of
the m slots
• All chain will have similar sizes
• Assume n (total # of keys), m is the hash table size– Average chain size O (n/m)
0
m - 1
T
chain
chain
chain
chain
Average Search Time O(n/m): The common case
• If m (# of slots) is proportional to n (# of keys):
– m = O(n)
– n/m = O(1)
Searching takes constant time on average
Analysis of Hashing with Chaining:Average Case
Hash Functions
Hash Functions • A hash function transforms a key (k) into a table address (0…
m-1)
• What makes a good hash function?(1) Easy to compute(2) Approximates a random function: for every input, every output is
equally likely (simple uniform hashing)(3) Reduces the number of collisions
Hash Functions• Goal: Map a key k into one of the m slots in the hash table
• Make table size (m) a prime number– Avoids even and power-of-2 numbers
• Common functionh(k) = F(k) mod m
Some function or operation on K (usually generates an integer)
The output of the “mod” is number [0…m-1]
Examples of Hash FunctionsCollection of images
F(k): Sum of the pixels colors
h(k) = F(k) mod m
Collection of strings
F(k): Sum of the ascii values
h(k) = F(k) mod m
Collection of numbersF(k): just return k
h(k) = F(k) mod m