![Page 1: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/1.jpg)
Data Structuresand
AlgorithmsCourse slides: Hashing
www.mif.vu.lt/~algis
![Page 2: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/2.jpg)
2
Data Structures for Sets
Many applications deal with sets.
Compilers have symbol tables (set of vars, classes) Dictionary is a set of words. Routers have sets of forwarding rules. Web servers have set of clients, etc.
A set is a collection of members
No repetition of members Members themselves can be sets
![Page 3: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/3.jpg)
3
Data Structures for Sets
Examples
Set of first 5 natural numbers: {1,2,3,4,5} {x | x is a positive integer and x < 100} {x | x is a CA driver with > 10 years of driving
experience and 0 accidents in the last 3 years}
![Page 4: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/4.jpg)
4
Set Operations
Unary operation: min, max, sort, makenull, …
Binary operations
Member Set
Member Order (=, <, >)
Find, insert, delete, split, …
Set Find, insert, delete, split, …
Union, intersection, difference, equal, …
![Page 5: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/5.jpg)
5
Observations
Set + Operations define an ADT.
A set + insert, delete, find
A set + ordering
Multiple sets + union, insert, delete
Multiple sets + merge
Etc.
Depending on type of members and choice of operations, different implementations can have different asymptotic complexity.
![Page 6: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/6.jpg)
6
Dictionary ADTs
Maintain a set of items with distinct keys with:
find (k): find item with key k insert (x): insert item x into the dictionary remove (k): delete item with key k
Where do we use them:
Symbol tables for compiler Customer records (access by name) Games (positions, configurations) Spell checkers Peer to Peer systems (access songs by name), etc.
![Page 7: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/7.jpg)
7
Naïve Implementations
The simplest possible scheme to implement a dictionary is “log file” or “audit trail”.
Maintain the elements in a linked list, with insertions occuring at the head.
The search and delete operations require searching the entire list in the worst-case.
Insertion is O(1), but find and delete are O(n).
A sorted array does not help, even with ordered keys. The search becomes fast, but insert/delete take O(n).
![Page 8: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/8.jpg)
8
Hash Tables: Intuition
Hashing is function that maps each key to a location in memory.
A key’s location does not depend on other elements, and does not change after insertion.
unlike a sorted list A good hash function should be easy to compute.
With such a hash function, the dictionary operations can be implemented in O(1) time
![Page 9: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/9.jpg)
9
Hash Tables: Intuition
Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements are assigned keys from the set of small natural numbers. That is, U ⊂ Z+ and U is relatively small. ⏐ ⏐
If no two elements have the same key, then this dictionary can be implemented by storing its elements in the array T[0, ... , U ⏐ ⏐ - 1]. This implementation is referred to as a direct-access table since each of the requisite DICTIONARY ADT operations - Search, Insert, and Delete - can always be performed in Θ(1) time by using a given key value to index directly into T, as shown:
![Page 10: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/10.jpg)
10
Hash Tables: Intuition
The obvious shortcoming associated with direct-access tables is that the set U rarely has such "nice" properties. In practice, U can be quite large. This will lead to wasted memory if the ⏐ ⏐number of elements actually stored in the table is small relative to U . ⏐ ⏐
Furthermore, it may be difficult to ensure that all keys are unique. Finally, a specific application may require that the key values be real numbers, or some symbols which cannot be used directly to index into the table.
An effective alternative to direct-access tables are hash tables. A hash table is a sequentially mapped data structure that is similar to a direct-access table in that both attempt to make use of the random- access capability afforded by sequential mapping.
![Page 11: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/11.jpg)
11
Hash Tables: Intuition
![Page 12: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/12.jpg)
Hash Tables
All search structures so far
Relied on a comparison operation
Performance O(n) or O( log n)
Assume I have a function
f ( key ) ® integer ie one that maps a key to an integer
What performance might I expect now?
![Page 13: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/13.jpg)
Hash Tables - Structure
Simplest case:
Assume items have integer keys in the range 1 .. m Use the value of the key itself
to select a slot in a direct access table in which to store the item
To search for an item with key, k,just look in slot k If there’s an item there,
you’ve found it
If the tag is 0, it’s missing.
Constant time, O(1)
![Page 14: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/14.jpg)
14
Hashing : the basic idea Map key values to hash table addresses
keys -> hash table address
This applies to find, insert, and remove
Usually: integers -> {0, 1, 2, …, Hsize-1}Typical example: f(n) = n mod Hsize
Non-numeric keys converted to numbers
For example, strings converted to numbers as Sum of ASCII values
First three characters
![Page 15: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/15.jpg)
15
Hashing : the basic idea9
10
20
39
4
14
8
Perm # (mod 9)
Student Records
![Page 16: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/16.jpg)
Hash Tables - Choosing the Hash Function
Uniform hashing
Ideal hash function P(k) = probability that a key, k, occurs
If there are m slots in our hash table,
a uniform hashing function, h(k), would ensure:
or, in plain English,
the number of keys that map to each slot is equal
S P(k) =k | h(k) = 0
S P(k) = ....k | h(k) = 1
S P(k) =k | h(k) = m-1
1m
Read as sum over all k such that h(k) = 0
![Page 17: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/17.jpg)
Hash Tables - A Uniform Hash Function
If the keys are integers randomly distributed in [ 0 , r ), then
is a uniform hash function
Most hashing functions can be made to map the keys to [ 0 , r ) for some r, eg adding the ASCII codes for characters mod 255 will give values in [ 0, 256 ) or [ 0, 255 ]
Replace + by xor
same range without the mod operation
Read as 0 £ k < r h(k) = mk
r
![Page 18: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/18.jpg)
Hash Tables - Reducing the range to [ 0, m )
We’ve mapped the keys to a range of integers 0 £ k < r
Now we must reduce this range to [ 0, m )
where m is a reasonable size for the hash table
Strategies
Division - use a mod function Multiplication Universal hashing
![Page 19: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/19.jpg)
Hash Tables - Reducing the range to [ 0, m )
Division
Use a mod function
h(k) = k mod m
Choice of m? Powers of 2 are generally not good!
h(k) = k mod 2n selects last n bits of k All combinations are not generally equally likely
Prime numbers close to 2n seem to be good choices eg want ~4000 entry table, choose m = 4093
0110010111000011010
k mod 28 selects these bits
![Page 20: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/20.jpg)
Hash Tables - Reducing the range to [ 0, m )
Multiplication method
Multiply the key by constant, A, 0 < A < 1 Extract the fractional part of the product ( kA - ëkA û )
Multiply this by m h(k) = ëm * ( kA - ëkA û )û
Now m is not critical and a power of 2 can be chosen So this procedure is fast on a typical digital computer Set m = 2p
Multiply k (w bits) by ëA•2w û ç 2w bit product Extract p most significant bits of lower half A = ½(Ö5 -1) seems to be a good choice (see Knuth)
![Page 21: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/21.jpg)
Hash Tables - Reducing the range to [ 0, m )
Universal Hashing
A determined “adversary” can always find a set of data that will defeat any hash function
Hash all keys to same slot ç O(n) search
Select the hash function randomly (at run time)from a set of hash functions
Reduced probability of poor performance
Set of functions, H, which map keys to [ 0, m )
H, is universal, if for each pair of keys, x and y,the number of functions, h Ì H,for which h(x) = h(y) is |H |/m
![Page 22: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/22.jpg)
Hash Tables - Reducing the range to ( 0, m ]
Universal Hashing
A determined “adversary” can always find a set of data that will defeat any hash function
Hash all keys to same slot ç O(n) search
Select the hash function randomly (at run time)from a set of hash functions
---------
Functions are selected at run time Each run can give different results
Even with the same data
Good average performance obtainable
![Page 23: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/23.jpg)
Hash Tables - Reducing the range to ( 0, m ]
Universal Hashing
Can we design a set of universal hash functions?
Quite easily
Key, x = x0, x1, x2, ...., xr
Choose a = <a0, a1, a2, ...., ar>a is a sequence of elements chosen randomly from { 0, m-1 }
ha(x) = S aixi mod m
There are mr+1 sequences a,so there are mr+1 functions, ha(x)
Theorem
The ha form a set of universal hash functions
Proof:See Cormen
![Page 24: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/24.jpg)
Hash Tables - Constraints
Constraints
Keys must be unique
Keys must lie in a small range
For storage efficiency,keys must be dense in the range
If they’re sparse (lots of gaps between values),a lot of space is used to obtain speed Space for speed trade-off
![Page 25: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/25.jpg)
Hash Tables - Relaxing the constraints
Keys must be unique
Construct a linked list of duplicates “attached” to each slot
If a search can be satisfiedby any item with key, k,performance is still O(1)
but
If the item has some other distinguishing featurewhich must be matched,we get O(nmax)
where nmax is the largest number of duplicates - or length of the longest chain
![Page 26: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/26.jpg)
Hash Tables - Relaxing the constraints
Keys are integers
Need a hash functionh( key ) ® integer
ie one that maps a key to an integer
Applying this function to thekey produces an address
If h maps each key to a uniqueinteger in the range 0 .. m-1then search is O(1)
![Page 27: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/27.jpg)
Hash Tables - Hash functions
Form of the hash function
Example - using an n-character key
int hash( char *s, int n ) { int sum = 0; while( n-- ) sum = sum + *s++; return sum % 256; }returns a value in 0 .. 255
xor function is also commonly used sum = sum ^ *s++;
But any function that generates integers in 0..m-1 for some suitable (not too large) m will do
As long as the hash function itself is O(1) !
![Page 28: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/28.jpg)
Hash Tables - Collisions
Hash function
With this hash function
int hash( char *s, int n ) { int sum = 0; while( n-- ) sum = sum + *s++; return sum % 256; }
hash( “AB”, 2 ) andhash( “BA”, 2 )return the same value!
This is called a collision
A variety of techniques are used for resolving collisions
![Page 29: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/29.jpg)
Hash Tables - Collision handling
Collisions
Occur when the hash function maps two different keys to the same address
The table must be able to recognise and resolve this
Recognise Store the actual key with the item in the hash table
Compute the address k = h( key )
Check for a hit if ( table[k].key == key ) then hit
else try next entry
Resolution Variety of techniques
![Page 30: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/30.jpg)
Hash Tables - Linked lists
Collisions - Resolution
Linked list attached to each primary table slot
h(i) == h(i1)
h(k) == h(k1) == h(k2)
Searching for i1 Calculate h(i1)
Item in table, i,doesn’t match
Follow linked list to i1
If NULL found, key isn’t in table
![Page 31: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/31.jpg)
Hash Tables - Overflow area
Overflow area Linked list constructed
in special area of tablecalled overflow area
h(k) == h(j)
k stored first
Adding j Calculate h(j)
Find k
Get first slot in overflow area
Put j in it
k’s pointer points to this slot
Searching - same as linked list
![Page 32: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/32.jpg)
h’(x) - second hash function
Hash Tables - Re-hashing Use a second hash function Many variations
General term: re-hashing
h(k) == h(j)
k stored first
Adding j Calculate h(j)
Find k
Repeat until we find an empty slot Calculate h’(j)
Put j in it
Searching - Use h(x), then h’(x)
![Page 33: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/33.jpg)
Hash Tables - Re-hash functions
The re-hash function Many variations
Linear probing h’(x) is +1
Go to the next slotuntil you find one empty
Can lead to bad clustering
Re-hash keys fill in gapsbetween other keys and exacerbatethe collision problem
![Page 34: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/34.jpg)
Hash Tables - Re-hash functions
The re-hash function Many variations
Quadratic probing h’(x) is c i2 on the ith probe
Avoids primary clustering
Secondary clustering occurs All keys which collide on h(x) follow the same sequence
First
a = h(j) = h(k)
Then a + c, a + 4c, a + 9c, ....
Secondary clustering generally less of a problem
![Page 35: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/35.jpg)
Hash Tables - Collision Resolution Summary
Chaining
+ Unlimited number of elements
+ Unlimited number of collisions
- Overhead of multiple linked lists
Re-hashing
+ Fast re-hashing
+ Fast access through use of main table space
- Maximum number of elements must be known
- Multiple collisions become probable
Overflow area
+ Fast access
+ Collisions don't use primary table space
- Two parameters which govern performance need to be estimated
![Page 36: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/36.jpg)
Hash Tables - Collision Resolution Summary
Re-hashing
+ Fast re-hashing
+ Fast access through use of main table space
- Maximum number of elements must be known
- Multiple collisions become probable
Overflow area
+ Fast access
+ Collisions don't use primary table space
- Two parameters which govern performance need to be estimated
![Page 37: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/37.jpg)
Hash Tables - Summary so far ...
Potential O(1) search time
If a suitable function h(key) ® integer can be found Space for speed trade-off
“Full” hash tables don’t work (more later!) Collisions
Inevitable Hash function reduces amount of information in key
Various resolution strategies Linked lists Overflow areas Re-hash functions
Linear probing h’ is +1 Quadratic probing h’ is +ci2
Any other hash function! or even sequence of functions!
![Page 38: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/38.jpg)
38
Hashing:
Choose a hash function h; it also determines the hash table size.
Given an item x with key k, put x at location h(k).
To find if x is in the set, check location h(k).
What to do if more than one keys hash to the same value. This is called collision.
We will discuss two methods to handle collision:
Separate chaining
Open addressing
![Page 39: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/39.jpg)
39
Maintain a list of all elements that hash to the same value
Search -- using the hash function to determine which list to traverse
Insert/deletion–once the “bucket” is found through Hash, insert and delete are list operations
Separate chaining
class HashTable {…… private:
unsigned int Hsize;
List<E,K> *TheList;
……
find(k,e)HashVal = Hash(k,Hsize);if
(TheList[HashVal].Search(k,e))then return true;else return false;
14
42
29
20
1
36
5623
16
24
31
177
0123456789
10
![Page 40: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/40.jpg)
40
Insertion: insert 53
14
42
29
20
1
36
5623
16
24
31
177
0123456789
10
53 = 4 x 11 + 953 mod 11 = 9
14
42
29
20
1
36
5623
16
24
53
177
0123456789
1031
![Page 41: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/41.jpg)
41
Analysis of Hashing with Chaining
Worst case
All keys hash into the same bucket
a single linked list.
insert, delete, find take O(n) time.
Average case
Keys are uniformly distributed into buckets
O(1+N/B): N is the number of elements in a hash table, B is the number of buckets.
If N = O(B), then O(1) time per operation.
N/B is called the load factor of the hash table.
![Page 42: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/42.jpg)
42
Open addressing
If collision happens, alternative cells are tried until an empty cell is found.
Linear probing :Try next available position
0123456789
10
42
9
14
1
16
24
31
287
![Page 43: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/43.jpg)
43
Linear Probing (insert 12)
0123456789
10
42
9
14
1
16
24
31
287
12 = 1 x 11 + 112 mod 11 = 1
0123456789
10
42
9
14
1
16
24
31
287
12
![Page 44: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/44.jpg)
44
Search with linear probing (Search 15)
15 = 1 x 11 + 415 mod 11 = 4
0123456789
10
42
9
14
1
16
24
31
287
12
NOT FOUND !
![Page 45: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/45.jpg)
45
Deletion in Hashing with Linear Probing
Since empty buckets are used to terminate search, standard deletion does not work.
One simple idea is to not delete, but mark.
Insert: put item in first empty or marked bucket.
Search: Continue past marked buckets.
Delete: just mark the bucket as deleted.
Advantage: Easy and correct.
Disadvantage: table can become full with dead items.
![Page 46: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/46.jpg)
46
Deletion with linear probing: LAZY (Delete 9)
9 = 0 x 11 + 99 mod 11 = 9
0123456789
10
42
9
14
1
16
24
31
287
12
FOUND !
0123456789
10
42
D
14
1
16
24
31
287
12
![Page 47: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/47.jpg)
47
remove(j) { i = j;
empty[i] = true;i = (i + 1) % D; // candidate for swappingwhile ((not empty[i]) and i!=j) {
r = Hash(ht[i]); // where should it go without collision? // can we still find it based on the rehashing strategy?
if not ((j<r<=i) or (i<j<r) or (r<=i<j))
then break; // yes find it from rehashing, swap
i = (i + 1) % D; // no, cannot find it from rehashing
}if (i!=j and not empty[i])then {
ht[j] = ht[i];remove(i);
}}
Eager Deletion: fill holes
Remove and find replacement:Fill in the hole for later searches
![Page 48: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/48.jpg)
48
Eager Deletion Analysis (cont.)
If not full After deletion, there will be at least two holes
Elements that are affected by the new hole are Initial hashed location is cyclically before the new
hole
Location after linear probing is in between the new hole and the next hole in the search order
Elements are movable to fill the hole
Next hole in the search orderNew hole
Initialhashed location
Location after linear probing
Next hole in the search order
Initialhashed location
![Page 49: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/49.jpg)
49
Eager Deletion Analysis (cont.) The important thing is to make sure that if a
replacement (i) is swapped into deleted (j), we can still find that element. How can we not find it?
If the original hashed position (r) is circularly in between deleted and the replacementj r i
j ri
jr i
i rWill not find i past the empty green slot!
j i r i r
Will find i
![Page 50: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/50.jpg)
50
Quadratic Probing
Solves the clustering problem in Linear Probing
Check H(x)
If collision occurs check H(x) + 1
If collision occurs check H(x) + 4
If collision occurs check H(x) + 9
If collision occurs check H(x) + 16
...
H(x) + i2
![Page 51: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/51.jpg)
51
Quadratic Probing (insert 12)
0123456789
10
42
9
14
1
16
24
31
287
12 = 1 x 11 + 112 mod 11 = 1
0123456789
10
42
9
14
1
16
24
31
287
12
![Page 52: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/52.jpg)
52
Double Hashing
When collision occurs use a second hash function
Hash2 (x) = R – (x mod R) R: greatest prime number smaller than table-size
Inserting 12
H2(x) = 7 – (x mod 7) = 7 – (12 mod 7) = 2 Check H(x) If collision occurs check H(x) + 2 If collision occurs check H(x) + 4 If collision occurs check H(x) + 6 If collision occurs check H(x) + 8 H(x) + i * H2(x)
![Page 53: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/53.jpg)
53
Double Hashing (insert 12)
0123456789
10
42
9
14
1
16
24
31
287
12 = 1 x 11 + 112 mod 11 = 1
7 –12 mod 7 = 2
0123456789
10
42
9
14
1
16
24
31
287
12
![Page 54: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/54.jpg)
54
Rehashing
If table gets too full, operations will take too long.
Build another table, twice as big (and prime).
Next prime number after 11 x 2 is 23
Insert every element again to this table
Rehash after a percentage of the table becomes full (70% for example)
![Page 55: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/55.jpg)
55
Good and Bad Hashing Functions
Hash using the wrong key
Age of a student
Hash using limited information
First letter of last names (a lot of A’s, few Z’s)
Hash functions choices :
keys evenly distributed in the hash table
Even distribution guaranteed by “randomness”
No expectation of outcomes
Cannot design input patterns to defeat randomness
![Page 56: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/56.jpg)
56
Examples of Hashing Function
B=100, N=100, keys = A0, A1, …, A99
Hashing(A12) = (Ascii(A)+Ascii(1)+Ascii(2)) / B
H(A18)=H(A27)=H(A36)=H(A45) …
Theoretically, N(1+N/B)= 200
In reality, 395 steps are needed because of collision
How to fix it?
Hashing(A12) = (Ascii(A)*22+Ascii(1)*2+Ascci(2))/B
H(A12)!=H(A21)
Examples: numerical keys
Use X2 and take middle numbers
![Page 57: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/57.jpg)
Collision Frequency
Birthdays or the von Mises paradox
There are 365 days in a normal year Birthdays on the same day unlikely?
How many people do I need before “it’s an even bet”(ie the probability is > 50%)that two have the same birthday?
View the days of the year as the slots in a hash table
the “birthday function” as mapping people to slots
Answering von Mises’ question answers the question about the probability of collisions in a hash table
![Page 58: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/58.jpg)
Distinct Birthdays
Let Q(n) = probability that n people have distinct birthdays
Q(1) = 1
With two people, the 2nd has only 364 “free” birthdays
The 3rd has only 363, and so on:
Q(2) = Q(1) * 364
365
Q(n) = Q(1) * 364
365
364
365
365-n+1
365* * … *
![Page 59: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/59.jpg)
Coincident Birthdays
Probability of having two identical birthdays
P(n) = 1 - Q(n)
P(23) = 0.507
With 23 entries,table is only23/365 = 6.3%full!
0.0000.1000.2000.3000.4000.5000.600
0.7000.8000.9001.000
0 20 40 60 80
![Page 60: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/60.jpg)
Hash Tables - Load factor
Collisions are very probable!
Table load factor must be kept low
Detailed analyses of the average chain length(or number of comparisons/search) are available
Separate chaining
linked lists attached to each slot gives best performance
but uses more space!
n
m
n = number of items
m = number of slots
![Page 61: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/61.jpg)
Hash Tables - General Design
Choose the table size
Large tables reduce the probability of collisions!
Table size, m
n items
Collision probability a = n / m
Choose a table organisation
Does the collection keep growing? Linked lists (....... but consider a tree!)
Size relatively static? Overflow area or
Re-hash
Choose a hash function
....
![Page 62: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/62.jpg)
Hash Tables - General Design
Choose a hash function
A simple (and fast) one may well be fine ...
Read your text for some ideas!
Check the hash function against your data
Fixed data Try various h, m
until the maximum collision chain is acceptable
Known performance
Changing data Choose some representative data
Try various h, m until collision chain is OK
Usually predictable performance
![Page 63: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/63.jpg)
Hash Tables - Review
If you can meet the constraints
Hash Tables will generally give good performance
O(1) search
Like radix sort, they rely on calculating an address from a key
But, unlike radix sort, relatively easy to get good performance with a little experimentation
\ not advisable for unknown data
collection size relatively static
memory management is actually simpler All memory is pre-allocated!
![Page 64: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/64.jpg)
64
Collision Functions
Hi(x)= (H(x)+i) mod B
Linear pobing
Hi(x)= (H(x)+ci) mod B (c>1)
Linear probing with step-size = c
Hi(x)= (H(x)+i2) mod B
Quadratic probing
Hi(x)= (H(x)+ i * H2(x)) mod B
![Page 65: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/65.jpg)
65
Analysis of Open Hashing
Effort of one Insert?
Intuitively – that depends on how full the hash is Effort of an average Insert?
Effort to fill the Bucket to a certain capacity?
Intuitively – accumulated efforts in inserts Effort to search an item (both successful and
unsuccessful)?
Effort to delete an item (both successful and unsuccessful)?
Same effort for successful search and delete? Same effort for unsuccessful search and delete?
![Page 66: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/66.jpg)
Picking a Hash Function In practice, hash functions often use the bit representation
of values.
e.g. compute the binary representation of the characters in the search key and return the sum modulus the number of buckets. e.g. distribute first names over 59 buckets. Use ASCII values of the
letters e.g. John (74 + 111 + 104 + 110 = 399), 399 modulo 59 = bucket 45.
In practice schemes are more complicated than the above. See the big white algorithm book for more details (Cormen, Leiserson, Rivest).
For a hash table the idea is that there will be n entries.
n = the number of actual values. the hope is that each of the n values maps to a different index (no
collisions).
For DB hash indexing the index is divided into buckets.
each bucket maps to a disk page (which maps to a disk block). The hash function maps index entries to buckets with the intent that
no bucket overflows.
![Page 67: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/67.jpg)
Hash Function Problems
Poor distribution of values to buckets.
This can happen because the hash function is either not random or not uniform.
Solution: Change the hash function.
Skewed data distribution.
The distribution of search key values is skewed. That is, there is not a uniform distribution of values – many incidences of
some values and few of others.
Solution: There is no way this can be addressed by changing the hash function. If this is a problem then a hash index may not be a good choice.
Collisions
Overflow may be caused by inserts of data entries with the same hash value (but different search key values).
Solution: Static hashing does not address this.
Extendible hashing and linear hashing deal with collisions to some extent.
![Page 68: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/68.jpg)
Static Hashing Hash function
A uniform and random hash function h evenly maps search key values to a primary page.
With N buckets the hash function maps a search key value to buckets 0 to N-1.
Primary pages
The number of buckets (pages) is pre-determined. Each bucket is a disk page.
Overflow pages
As the DB grows primary pages become full. Additional data is placed in overflow pages chained to the primary pages. Finding a value involves searching these pages.
Performance
Generally a search requires one disk access. Insert or delete require two disk accesses. The number of primary pages is fixed.
The structure becomes inefficient as the file grows (because of the overflow pages).
It wastes space if the file shrinks.
Re-hashing removes overflow until the DB grows again. This process is time-consuming and the index is inaccessible during it.
![Page 69: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/69.jpg)
Extendible Hashing The efficiency of static hashing declines as the file grows.
Overflow pages increase search time. One solution would be to use a range of hash functions based on a
bit value and double the number of buckets (and the function range) whenever an overflow page is needed. Such a reorganization is expensive.
Is it possible to make local changes?
Use a directory of pointers to buckets.
Double the directory size when required. Only split the pages that have overflowed. The directory need only consist of an array of pointers to pages so
is relatively compact. The array index represents the value computed by the hash
function. At any time the array size is determined by how many bits of the
hash result are being used. Usually the last d (least significant) bits are used.
![Page 70: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/70.jpg)
Basic Structure
The array index is the last two bits of the hash value. Note that the values in the cells represents the hash value (not the
search key value). Assume that three records fit on a page (so each bucket is a disc page). There are only four pages of data and none of the pages is full – only two
bits are required as an index.
64 16
00 1 17 5
01
10
11 6
31 15
![Page 71: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/71.jpg)
Inserting Values
Insert 14 and 9 into the index shown in the previous example. 14 fits but inserting 9 causes the page to overflow. Double the directory and split the overflowing bucket (01) into two.
Distribute the entries based on the last three digits of the hash value.
Directory pointers to the existing buckets are added for the other three new three digit hash values.
64 16
00 1 17 5
01 9
10
11 6 14
31 15
insert
![Page 72: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/72.jpg)
Inserting Values Example64 16
1 17 9
000
001
010 6 14
011
100
101 31 15
110
111
5
new bucket Note how the directory has doubled in size but only one new
bucket has been created. The directory is small (each entry consists of a value and a
pointer). New index pages (buckets) are kept to a minimum.
![Page 73: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/73.jpg)
Keeping Track of Bits
After the directory is doubled some pages will be referenced by two directory entries.
If the referenced pages become full, subsequent insertions will require a new page to be allocated but will not require doubling the directory.
If pages that are referenced by only one directory entry overflow the directory will have to be doubled again.
Contrast inserting 4 and 12 (x00) into the example with inserting 25 (x01).
How do we keep track of whether or not an insert requires that the directory is doubled?
Record the global depth of the hashed file. The number of bits required for the directory.
Also record the local depth of each page. The number of bits needed to reference a particular page.
![Page 74: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/74.jpg)
Local and Global Depth2
64 16
3
1 17 9
000
001 2
010 6 14
011
100 2
101 31 15
110
111 3
5
3
local depthglobal depth
![Page 75: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/75.jpg)
Summary
Create initial hash file and directory. Each directory entry points to a different bucket. All local depths are equal to the global depth.
Insert values If no overflow occurs insert entry and finish. If the bucket overflows compare the local depth of the
bucket with the global depth. If the local depth is less than the global depth create a new
bucket and distribute the entries with no change to the directory. Increment the local depth of the split buckets.
If the local depth is the same as the global depth double the directory, and split the bucket. Increment the local depth of the split buckets and the global depth of the directory.
Delete values If the deletion empties the bucket it can be merged and
the local depth decremented. If the local depth of all buckets is less than the global
depth the directory can be halved. In practice this is often not done.
![Page 76: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/76.jpg)
Performance Performance is identical to static hashing if the directory fits in
memory. If the directory does not fit in memory an additional disk access is required.
It is likely that the directory will fit in memory.
Collisions and Performance Collisions at low global depth are dealt with by doubling the directory, and
the range of the hash function, and making local index changes.
Many collisions will result in producing a large directory (that might not fit in memory).
Overflow pages If many entries have the same hash value across the entire range of bits
overflow pages have to be allocated, reducing efficiency.
This is because splitting a bucket (and doubling the directory) would cause another split if all the entries map the same bucket after splitting.
This can occur if the hash function is poor.
Collisions will occur if the distribution of values is skewed.
![Page 77: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/77.jpg)
Linear Hashing
Another dynamic system.
Like extendible hashing insertions and deletions are efficiently accommodated.
Unlike extendible hashing a directory is not required. Collisions may result in chains of overflow pages.
Linear hashing, like extendible hashing uses a family of hash functions.
Each function’s range is twice that of its predecessor. Pages are split when overflows occur – but not necessarily
the page with the overflow. Splitting occurs in turn, in a round robin fashion. When all the pages at one level (the current hash function)
have been split a new level is applied. Splitting occurs gradually Primary pages are allocated consecutively.
![Page 78: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/78.jpg)
Levels of Linear Hashing Initial Stage.
The initial level distributes entries into N0 buckets. Call the hash function to perform this h0.
Splitting buckets. If a bucket overflows its primary page is chained to an overflow page
(increasing the bucket’s size). Also when a bucket overflows some bucket is split.
The first bucket to be split is the first bucket in the file (not necessarily the bucket that overflows).
The next bucket to be split is the second bucket in the file … and so on until the Nth. has been split.
When buckets are split their entries (including those in overflow pages) are distributed using h1.
To access split buckets the next level hash function (h1) is applied. h1 maps entries to 2N0 (or N1)buckets.
Level progression. Once all Ni buckets of the current level (i) are split the hash function hi is
replaced by hi+1. The splitting process starts again at the first bucket and hi+2 is applied to find
entries in split buckets.
![Page 79: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/79.jpg)
Linear Hashing Example
The example above shows the index level equal to 0 where N0 equals 4 (three entries fit on a page).
h0 maps index entries to one of four buckets.
Given the initial page of the file the appropriate primary page can be determined by using an offset. i.e. initial page + h0(search key value)
In the above example only h0 is used and no buckets have been split.
Now consider what happens when 9 is inserted (which will not fit in the second bucket).
Note that next indicates which bucket is to split next.
next 64 36
1 17 5
6
31 15
![Page 80: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/80.jpg)
Linear Hashing Example 2
The page indicated by next is split (the first one).
Next is incremented.
An overflow page is chained to the primary page to contain the inserted value.
If h0 maps a value from zero to next – 1 (just the first page in this case) h1 must be used to where to insert the new entry.
Note how the new page falls naturally into the sequence as the fifth page.
h1 next 64
h0 next 1 17 5 9
h0 6
h0 31 15
h1 36
![Page 81: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/81.jpg)
Linear Hashing Example 3
Assume inserts of 8, 7, 18, 14, 111, 32, 162, 10, 13, 233
After the 2nd. split the base level is 1 (N1 = 8), use h1.
Subsequent splits will use h2 for inserts between the first bucket and next-1.
2 1
h1 h1 next3 64 8 32 16
h1 h1 1 17 9
h1 h0 next1 10 186 18 14
h0 h0 next2 1131 15 7 11
h1 h1 36
h1 h1 5 13
h1 - 6 14
- - 31 15 7 23
![Page 82: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/82.jpg)
Comparing Linear and Extendible Hashing
Differences
Because buckets are split in turn linear hashing does not need a directory.
Extendible hashing may lead to better use of space because the overflowing bucket is always the one that is split.
In particular, linear hashing does not deal elegantly with collisions in that long overflow chains may develop.
Collisions with extendible hashing lead to a large directory.
Similarities
Doubling the directory and moving to the next level of hash function have the same effect: the range of buckets is doubled.
The hash functions used by the two schemes may be the same. i.e. they can both use a bit translation of the search key value modulus the number of buckets required.
![Page 83: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/83.jpg)
83
Issues:
What do we lose?
Operations that require ordering are inefficient
FindMax: O(n) O(log n) Balanced binary tree
FindMin: O(n) O(log n) Balanced binary tree
PrintSorted: O(n log n) O(n) Balanced binary tree
What do we gain?
Insert: O(1) O(log n) Balanced binary tree
Delete: O(1) O(log n) Balanced binary tree
Find: O(1) O(log n) Balanced binary tree
How to handle Collision?
Separate chaining
Open addressing
![Page 84: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/84.jpg)
Interface Main methods:
Void Put(Object)
Object Get(Object) … returns null if not i
… Remove(Object)
Goal: methods are O(1)! (ususally)
Implementation details
HashTable: the storage bin
hashfunction(object): tells where object should go
collision resolution strategy: what to do when two objects “hash” to same location.
In Java, all objects have default int hashcode(), but better to define your own. Except for strings.
String hashing in Java is good.
![Page 85: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/85.jpg)
HashFunctions
Goal: map objects into table so distribution is uniform
Tricky to do.
Examples for string s
product ascii codes, then mod tablesize nearly always even, so bad
sum ascii codes, then mod tablesize may be too small
shift bits in ascii code java allows this with << and >>
Java does a good job with Strings
![Page 86: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/86.jpg)
Example Problem
Suppose we are storing numeric id’s of customers, maybe 100,000
We want to check if a person is delinquent, usually less than 400.
Use an array of size 1000, the delinquents.
Put id in at id mod tableSize.
Clearly fast for getting, removing
But what happens if entries collide?
![Page 87: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/87.jpg)
Separate Chaining
Array of linked lists
The hash function determines which list to search
May or may keep individual lists in sorted order
Problems:
needs a very good hash function, which may not exist
worse case: O(n)
extra-space for links
Another approach: Open Addressing
everything goes into the array, somehow
several approaches: linear, quadratic, double, rehashing
![Page 88: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/88.jpg)
Linear Probing
Store information (or prts to objects) in array
Linear Probing
When inserting an object, if location filled, find first unfilled position. I.e look at hi(x)+f(i) where f(i)= i;
When getting an object, start at hash addresses, and do linear search till find object or a hole.
primary clustering blocks of filled cells occur
Harder to insert than find existing element
Load factor =lf = percent of array filled
Expected probes for insertion: 1/2(1+1/(1-lf)^2))
successful search: 1/2(1+1/(1-lf))
![Page 89: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/89.jpg)
Expected number of probes
Load factor failure success
.1 1.11 1.06
.2 1.28 1.13
.3 1.52 1.21
.4 1.89 1.33
.5 2.5 1.50
.6 3.6 1.75
.7 6.0 2.17
.8 13.0 3.0
.9 50.5 5.5
![Page 90: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/90.jpg)
Quadratic Probing
Idea: f(i) = i^2 (or some other quadratic function)
Problem: If table is more than 1/2 full, no quarantee of finding any space!
Theorem: if table is less than 1/2 full, and table size is prime, then an element can be inserted.
Good: Quadratic probing eliminates primary clustering
Quadratic probing has secondary clustering (minor)
if hash to same addresses, then probe sequence will be the same
![Page 91: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/91.jpg)
Proof of theorem
Theorem: The first P/2 probes are distinct.
Suppose not.
Then there are i and j <P/2 that hash to same place
So h(x)+i^2 = h(y)+j^2 and h(x) = h(y).
So i^2 = j^2 mod P
(i+j)*(i-j) = 0 mod P
Since P is prime and i and j are less than P/2
then i+j and i-j are less than P and P factors.
Contradiction
![Page 92: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/92.jpg)
Double Hashing
Goal: spreading out the probe sequence
f(i) = i*hash2(x), where hash2 is another hash function
Dangerous: can be very bad.
Also may not eliminate any problems
In best case, it’s great
![Page 93: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/93.jpg)
Rehashing
All methods degrade when table becomes too full
Simpliest solution:
create new table, twice as large
rehash everything
O(N), so not happy if often
With quadratic probing, rehash when table 1/2 full
![Page 94: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/94.jpg)
Extendible Hashing: Uses secondary storage
Suppose data does not fit in main memory Goal: Reduce number of disks accesses. Suppose N records to store and M records fit in a
disk block Result: 2 disk accesses for find (~4 for insert) Let D be max number of bits so 2^D < M. This is for root or directory (a disk block) Algo:
hash on first D bits, yields ptr to disk block Expected number of leaves: (N/M) log 2 Expected directory size: O(N^(1+1/M) / M) Theoretically difficult, more details for
implementation
![Page 95: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/95.jpg)
Applications
Compilers: keep track of variables and scope
Graph Theory: associate id with name (general)
Game Playing: E.G. in chess, keep track of positions already considered and evaluated (which may be expensive)
Spelling Checker: At least to check that word is right.
But how to suggest correct word
Lexicon/book indices
![Page 96: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/96.jpg)
HashSets vs HashMaps
HashSets store objects supports adding and removing in constant time
HashMaps store a pair (key,object) this is an implementation of a Map
HashMaps are more useful and standard
HashMaps main methods are: put(Object key, Object value)
get(Object key)
remove(Object key)
All done in expected O(1) time.
![Page 97: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/97.jpg)
Lexicon Example
Inputs: text file (N) + content word file (the keys) (M)
Ouput: content words in order, with page numbers
Algo:
Define entry = (content word, linked list of integers)
Initially, list is empty for each word.
Step 1: Read content word file and Make HashMap of content word, empty list
Step 2: Read text file and check if work in HashMap;
if in, add to page number, else continue.
Step 3: Use the iterator method to now walk thru the HashMap and put it into a sortable container.
![Page 98: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/98.jpg)
Lexicon Example
Complexity:
step 1: O(M), M number of content words
step 2: O(N), N word file size
step 3: O(M log M) max.
So O(max(N, M log M))
Dumb Algorithm
Sort content words O(Mlog M) (balanced tree)
Look up each word in Content Word tree and update
O(N*logM)
Total complexity: O(N log M)
N = 500*2000 =1,000,000 and M = 1000
Smart algo: 1,000,000; dumb algo: 1,000,000*10.
![Page 99: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/99.jpg)
Memoization
Recursive Fibonacci:
fib(n) = if (n<2) return 1
else return fib(n-1)+fib(n-2)
Use hashing to store intermediate results
Hashtable ht;
fib(n) = Entry e = (Entry)ht.get(n);
if (e != null) return e.answer;
else if (n<2) return 1;
else ans = fib(n-1)+fib(n-2);
ht.put(n,ans);
return ans;
![Page 100: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/100.jpg)
Appendix: Hashingfor
Databases
![Page 101: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/101.jpg)
Contents
Static Hashing File Organization Properties of the Hash Function Bucket Overflow Indices
Dynamic Hashing Underlying Data Structure Querying and Updating
Comparisons Other types of hashing Ordered Indexing vs. Hashing
![Page 102: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/102.jpg)
Static Hashing
Hashing provides a means for accessing data without the use of an index structure.
Data is addressed on disk by computing a function on a search key instead.
![Page 103: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/103.jpg)
File organization
A bucket in a hash file is unit of storage (typically a disk block) that can hold one or more records.
The hash function, h, is a function from the set of all search-keys, K, to the set of all bucket addresses, B.
Insertion, deletion, and lookup are done in constant time.
![Page 104: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/104.jpg)
Querying and Updates
To insert a record into the structure compute the hash value h(Ki), and place the record in the bucket address returned.
For lookup operations, compute the hash value as above and search each record in the bucket for the specific record.
To delete simply lookup and remove.
![Page 105: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/105.jpg)
Properties of the Hash Function
The distribution should be uniform.
An ideal hash function should assign the same number of records in each bucket.
The distribution should be random.
Regardless of the actual search-keys, the each bucket has the same number of records on average
Hash values should not depend on any ordering or the search-keys
![Page 106: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/106.jpg)
Bucket Overflow
How does bucket overflow occur?
Not enough buckets to handle data
A few buckets have considerably more records then others. This is referred to as skew. Multiple records have the same hash value
Non-uniform hash function distribution.
![Page 107: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/107.jpg)
Solutions
Provide more buckets then are needed.
Overflow chaining
If a bucket is full, link another bucket to it. Repeat as necessary.
The system must then check overflow buckets for querying and updates. This is known as closed hashing.
![Page 108: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/108.jpg)
Alternatives
Open hashing
The number of buckets is fixed
Overflow is handled by using the next bucket in cyclic order that has space. This is known as linear probing.
Compute more hash functions.
Note: Closed hashing is preferred in database systems.
![Page 109: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/109.jpg)
Indices
A hash index organizes the search keys, with their pointers, into a hash file.
Hash indices never primary even though they provide direct access.
![Page 110: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/110.jpg)
Example of Hash Index
![Page 111: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/111.jpg)
Dynamic Hashing
More effective then static hashing when the database grows or shrinks
Extendable hashing splits and coalesces buckets appropriately with the database size.
i.e. buckets are added and deleted on demand.
![Page 112: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/112.jpg)
The Hash Function
Typically produces a large number of values, uniformly and randomly.
Only part of the value is used depending on the size of the database.
![Page 113: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/113.jpg)
Data Structure
Hash indices are typically a prefix of the entire hash value.
More then one consecutive index can point to the same bucket.
The indices have the same hash prefix which can be shorter then the length of the index.
![Page 114: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/114.jpg)
General Extendable Hash Structure
In this structure, i2 = i3 = i, whereas i1 = i – 1
![Page 115: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/115.jpg)
Queries and Updates
Lookup
Take the first i bits of the hash value.
Following the corresponding entry in the bucket address table.
Look in the bucket.
![Page 116: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/116.jpg)
Queries and Updates (Cont’d)
Insertion
Follow lookup procedure
If the bucket has space, add the record.
If not…
![Page 117: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/117.jpg)
Insertion (Cont’d)
Case 1: i = ij
Use an additional bit in the hash value This doubles the size of the bucket address table.
Makes two entries in the table point to the full bucket.
Allocate a new bucket, z. Set ij and iz to i
Point the second entry to the new bucket
Rehash the old bucket
Repeat insertion attempt
![Page 118: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/118.jpg)
Insertion (cont’d)
Case 2: i > ij
Allocate a new bucket, z
Add 1 to ij, set ij and iz to this new value
Put half of the entries in the first bucket and half in the other
Rehash records in bucket j
Reattempt insertion
![Page 119: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/119.jpg)
Insertion (Finally)
If all the records in the bucket have the same search value, simply use overflow buckets as seen in static hashing.
![Page 120: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/120.jpg)
Use of Extendable Hash Structure: Example
Initial Hash structure, bucket size = 2
![Page 121: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/121.jpg)
Example (Cont.)
Hash structure after insertion of one Brighton and two Downtown records
![Page 122: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/122.jpg)
Example (Cont.)Hash structure after insertion of Mianus record
![Page 123: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/123.jpg)
Example (Cont.)
Hash structure after insertion of three Perryridge records
![Page 124: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/124.jpg)
Example (Cont.)
Hash structure after insertion of Redwood and Round Hill records
![Page 125: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/125.jpg)
Comparison to Other Hashing Methods
Advantage: performance does not decrease as the database size increases
Space is conserved by adding and removing as necessary
Disadvantage: additional level of indirection for operations
Complex implementation
![Page 126: Data Structures and Algorithms Course slides: Hashing algis](https://reader035.vdocument.in/reader035/viewer/2022062303/5519b37655034660578b46bd/html5/thumbnails/126.jpg)
Ordered Indexing vs. Hashing
Hashing is less efficient if queries to the database include ranges as opposed to specific values.
In cases where ranges are infrequent hashing provides faster insertion, deletion, and lookup then ordered indexing.