hash tables. 2 exercise 2 /* exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i
TRANSCRIPT
Hash Tables
2
Exercise 2
/* Exercise 1 */void mystery(int n) { int i, j, k; for (i = 1; i <= n - 1; i++) { for (j = i + 1; j <= n; j++) { for (k = 1; k <= j; k++) { /* Some statement taking O(1) time */ } } }}
3
Exercise 3
/* Exercise 2 */void veryodd(int n) { int i, j, x, y; x = 0; y = 0; for (i = 1; i <= n; i++) { if (i % 2 == 1) { for (j = i; j <= n; j++) { x = x + 1; } for (j = 1; j <= i; j++) { y = y + 1; } } }}
4
Consider www.google.com
Efficient searches: lookup “laptop” in all web pages
How many web pages ? How fast is response ?
5
Consider www.google.com
4 billion pages
Consider data structures: linked list, sorted linked list,
array, sorted array, BST
6
Unsorted Linked List of n elem
int searchList(List *a, int key) {
if (a == NULL)return NULL; //not found
if (a->data == key) return a;
return searchList(a->next, key);}
Best, Average, Worst T(n) ?
7
Sorted Linked List of n elem
int searchList(List *a, int key) {
if (a == NULL)return NULL; //not found
if (a->data == key) return a;
return searchList(a->next, key);}
Best, Average, Worst T(n) ?
8
Unsorted Array of n elem
int seq_search(int n, int *a, int key) { int i = 0; while (i < n && a[i] != key) { i++; } return i;}
Best, Average, Worst T(n) ?
9
Sorted Array of n elem
int binary_search(int n, int *a, int key) { int lo = -1; int hi = n;
while (hi - lo != 1) { int mid = (hi + lo) / 2; if (a[mid] <= key) { lo = mid; } else { hi = mid; } } return lo;}
Best, Average, Worst T(n) ?
10
How about BST ?
Best O(1)
Average O(logn)
Worst O(n) – very imbalanced (tree degenerates to list)
11
Answer: Hash Tables
Search complexity is O(1) with “good” hash function
Hash Table: A generalization of an array that under
some assumptions allows O(1) for Insert/Delete/Search
12
Intuition
How can you store all Student Numbers in an array?
• Use an array with range 0 - 999,999,999
• This will give you O(1) access time but … considering there are approx. 5000 students you waste lots of array entries!
Problem: The range of key values is too large
(0-999,999,999) when compared to the # of keys (students)
13
Formal Definition
Hash Tables solve this problem by using a smaller array and mapping keys with a hash function.
Set of keys K and an array of size m. A hash function h is a function from K to 0…m-1, that is:
h : K 0…m-1
14
Example Hash Function
0
1234567
k888999222
k123456789
15
Example Hash Function
For example, if we hash the student number keys into a hash
table with 8 entries we could use h (key) = key mod 8
0
1234567
k888999222
k123456789
16
Problem ?
Collisions: Two keys hash into the same array entry
h (888888888) = h (000000000) = key % 8 = 0
0
1234567
k888999222
k123456789
17
Solution
• Hashing with Chaining (Open Hashing): every hash table entry contains a pointer to a linked list of keys that hash in the same entry
• Closed Hashing: every hash table entry contains only one key. If a new key hashes to a table entry which is filled, systematically examine other table entries until you find one empty entry to place the new key
18
Hashing with Chaining (Open Hashing)
h (54) = 54 % 5 = 4 = h (34) – solved by CHAIN-ing
0
1
2
3
4
key next
2
21
54 34
CHAIN
19
Hashing with Chaining
0
1
2
3
4
key next
2
21
54 34
CHAIN
Insert 101 – where does it hash to ?
20
Hashing with Chaining
h (101) = 101 % 5 = 1
0
1
2
3
4
Insert 101
2
21
54 34
0
1
2
3
4
key next
2
21
54 34
CHAIN
101
21
Complexity Analysis
What is the running time to insert/search/delete?
• Insert: It takes O(1) time to compute the hash function and insert at head of linked list
• Search: It is proportional to max linked list length• Delete: Same as search
22
What is a “good” hash ?
uniform hashing: each key is equally likely to hash in any of the m slots
• Creating a “good” hash function is black magic !
How about when keys are student names ?
Interpret characters as numbers: • (int)‘a’, (int)‘b’, (int)‘c’ means 97 98 99• Ex. Hash for names:
Name “abc” hashes to (‘a’+‘b’+‘c’)% m
23
Example Hash Function
For example, if we hash the student number keys into a hash
table with 8 entries we could use h (key) = key mod 8
0
1234567
k888999222
k123456789
24
Hashing with Chaining
0
1
2
3
4
key next
2
21
54 34
CHAIN
Insert 101 – where does it hash to ?
25
Closed Hashing
The key is first mapped to a slot:
index = h(k)
If there is a collision, subsequent probes are performed
collision resolution is done as a linear search. This is known as linear probing.
index = (index + 1) % m
26
Closed Hashing with Linear Probing
95371001
9875
98742009
3016
0
1
2
3
4
5
6
7
8
9
10
H(k) = k % 11
Insert(1100) ?
27
Closed Hashing with Linear Probing
95371001
9875
98742009
3016
0
1
2
3
4
5
6
7
8
9
10
H(k) = k % 11
Insert(1100) ?
28
Closed Hashing with Linear Probing
95371001
9875
98742009
3016
0
1
2
3
4
5
6
7
8
9
10
H(k) = k % 11
Insert(1100) ?
29
Closed Hashing with Linear Probing
95371001
9875
98742009
3016
0
1
2
3
4
5
6
7
8
9
10
H(k) = k % 11
Insert(1100) 3
Same for keys thathash into 0 or 1
Prob(insert_into_3) = ?
30
Closed Hashing with Linear Probing
95371001
9875
98742009
3016
0
1
2
3
4
5
6
7
8
9
10
H(k) = k % 11
Insert(1100) 3
Same for keys thathash into 0 or 1
Prob(insert_into_3) = 4/11
31
Closed Hashing with Linear Probing
95371001
9875
98742009
3016
0
1
2
3
4
5
6
7
8
9
10
H(k) = k % 11
Insert(1100) 3
Same for keys thathash into 0 or 1
Prob(insert_into_4) = 1/11
Prob(insert_into_3) = 4/11
32
Closed Hashing with Linear Probing
95371001
9875
98742009
3016
0
1
2
3
4
5
6
7
8
9
10
H(k) = k % 11
Assume: Insert(1052) 10
Prob(insert_into_4) = ?
Prob(insert_into_3) = ?
1052
33
Closed Hashing with Linear Probing
95371001
9875
98742009
3016
0
1
2
3
4
5
6
7
8
9
10
H(k) = k % 11
Assume: Insert(1052) 10
Prob(insert_into_4) = 1/11
Prob(insert_into_3) = 8/11
1052
34
Problem: Clustering
Even with a good hash function, linear probing has its problems:
• The position of the initial mapping i 0 of key k is called the home of k.
• When several insertions map to the same home position, they end up placed contiguously in the table. This collection of keys with the same home position is called a cluster.
• As clusters grow, the probability that a key will map to the middle of a cluster increases, increasing the rate of the cluster’s growth. As these clusters grow, they merge with other clusters forming even bigger clusters which grow even faster.
• This tendency of linear probing to place items together is known as primary clustering.
35
Complexity Analysis – Worst Case
What is the running time to insert/search/delete?
• Insert: Same as search • Search: It is proportional to max no of probes • Delete: Same as search
• Worst O(n)
36
Complexity Analysis
When hash table is empty – insert is in 1 step (in home position)
As the table fills up, the probab that a record canbe inserted in 1 step decreases
More and more records are likely to be insertedfar from their home position
37
Complexity Analysis - Intuition
The expected (avg.) cost of hash (insert/search/delete)
• is a function of how full the table is
38
The Load Factor
mn
n is the number of entries in a hash table that are occupied m is the size of the hash table
=1 means the table is full, and =0 means the table is empty.
39
Complexity Analysis - Average Case
The load factor where n current no of records
On avg. probability to find the position occupied:
The probability to find both position and next position
occupied is n/m * (n-1)/(m-1)
The probability of i collisions is:
• n/m * (n-1)/(m-1) * …(n- i +1)/(m – i +1) ~ (n/m)i
• probes = 1 + i =1 to N (n/m)i
mn
m n
40
Complexity Analysis Average Case
It can be shown that the number of probes in a successful search, C, and the number of probes in an unsuccessful search, C’ is given by:
11
11
l
C2
1
21
2
C
C
21
11
21
11
121
C
C
Separate chaining Linear probing
41
2
4
6
8
10
12
14
16
18
20
0 0.2 0.4 0.6 0.8 1
Ave
rage
# o
f pro
bes
Load factor
Successful search
Linear probingDouble hashing
Separate chaining
42
2
4
6
8
10
12
14
16
18
20
0 0.2 0.4 0.6 0.8 1
Ave
rage
# o
f pro
bes
Load factor
Unsuccessful search
Linear probingDouble hashing
Separate chaining
43
Insert Implementation
bool HashTable:: hashInsert(const Elem &e){
int home;
int index = home = h(getkey(e));
for (int i = 1; !is_empty(HT[index]); i++) {
index = (home + i) % m; // follow probes
if (is_equal (e, HT[index]) return false; // duplicate
}
HT[index] = e;
return true;
}
44
Search Implementation
bool HashTable:: hashSearch(const Key &k, Elem &e){
int home;
int index = home = h(k);
for (int i = 1;
!is_empty(HT[index]) && !is_equal(k, HT[index]);
i++)
index = (home + i) % m; // follow probes
if (is_equal (k, HT[index]){ //found it
e = HT[index];
return true;
}
else return false; // k is not in the table
}