1 searching and hashing. 2 concepts this lecture searching an array linear search binary search...
TRANSCRIPT
1
Searching and Hashing
2
Concepts This Lecture
Searching an array Linear search Binary search Comparing algorithm performance
3
Searching
Searching = looking for something Searching an array is particularly common
Goal: determine if a particular value is in the array
We'll see that more than one algorithm will work
4
Searching Algorithms
The algorithm used to find a number in a phone book is practical and efficient for human but not so good for computers It's not precise It's not consistent
Let's imagine another scenario. Suppose that you have A pile of cards containing names of customers They are not organized in any particular way You want to find the card with name Sarah (your key)
The procedure you'll will use is likely to be: look a each card's key (one by one) until one matches your target This is an algorithm and is called Linear Search
5
Searching as a Function Specification: Let b be the array to be searched, n is the size of the array, and b is x
is value being search for. If x appears in b[0..n-1], return its index, i.e., return k such that b[k]==x. If x not found, return –1
None of the parameters are changed by the function Function outline:
void Lookup ((const int vec[ ], int vSize, int key, Boolean& found, int& loc) {
...}
6
Linear Search Algorithm: start at the beginning of the array and examine each
element until x is found, or all elements have been examined
void Lookup (const int vec[ ], int vSize, int key, Boolean& found, int& loc) {
loc = 0;
while (loc < vSize && vec[loc] != key)
loc++;
found = (loc < vSize);
}
7
Linear Search
Test: search(v, 8, 6)
3 12 -5 6 142 21 -17 45b
Found It!
8
Linear Search
Test: search(v, 8, 15)
3 12 -5 6 142 21 -17 45b
Ran off the end! Not found.
9
Linear Search
Note: The loop condition is written so vec[loc] is not accessed if loc >= vSize.
while ( loc < vSize && vec[loc] != key )
(Why is this true? Why does it matter?)
3 12 -5 6 142 21 -17 45b
10
Write a Recursive Linear Search
NodeType linearSearch(NodeType *start, int target) { if (start->key == target) return *start; if (start == NULL) return NULL; else return LinearSearch(start->next, target);}
NodeType linearSearch(NodeType *start, int target) { if (start->key == target) return *start; if (start == NULL) return NULL; else return LinearSearch(start->next, target);}
11
Linear Search-Linked List
for each item in the list if the item's key match the target stop and report "success"report failure
for each item in the list if the item's key match the target stop and report "success"report failure
12
Linear Search (target = 9)
headhead
55 1212 99
//
headhead
13
Linear Search (target = 9)
headhead
55 1212 99
//
headhead
14
Linear Search (target = 9)
headhead
55 1212 99
//
headhead
15
Linear Search (target = n)
NodeType linearSearch(NodeType *start, int target) { NodeType *temp = start; while (temp != NULL) { if (temp->key == target) return *temp; temp = temp->next; } return NULL;}
NodeType linearSearch(NodeType *start, int target) { NodeType *temp = start; while (temp != NULL) { if (temp->key == target) return *temp; temp = temp->next; } return NULL;}
16
Analyzing Linear Search
Best case analysis The element is always found in the first position of the list, which
means that we do one comparison: O(1) Worst case analysis
The element is never present in the list. This means that we are going to do n comparisons where n is the size of the listwe have to go through the whole list to be sure whether the element is
present: O(N) Average case analysis
The search key can be found anywhere in the list If we "run" the algorithm for each possibility where the key may appear
we get: 1+2+….+vSize/vSize => (vSize*(vSize+1)/2)/vSize = (vSize+1)/2 = O(N)
17
Can we do better?
Time needed for linear search is proportional to the size of the array.
An alternate algorithm, "Binary search," works if the array is sorted 1. Look for the target in the middle. 2. If you don't find it, you can ignore half of the
array, and repeat the process with the other half.
Example: Find first page of pizza listings in the yellow pages
18
Can we do better?
Time needed for linear search is proportional to the size of the array.
An alternate algorithm, "Binary search," works if the array is sorted 1. Look for the target in the middle. 2. If you don't find it, you can ignore half of the
array, and repeat the process with the other half.
Example: Find first page of pizza listings in the yellow pages
19
Binary Search
In some cases, you get a list which is already ordered. In this case we can use algorithms that take this into
consideration The idea of binary search is
Split the list in two halves and compare the target with the key in the middle of the list
Based on this comparison we can tell which half of the list may contain the target
Binary search eliminates half of the list at each iteration
It requires direct access to the list elements
20
Binary Search Strategy What we want: Find split between values larger
and smaller than x:
<= x > x
0 L R n
b
<= x > x?
0 L R n
b
Situation while searching
Step: Look at b[(L+R)/2]. Move L or R to the middle depending on test.
21
Binary Search Strategy
More precisely
Values in b[0..L] <= x Values in b[R..n-1] > x Values in b[L+1..R-1] are unknown
<= x > x?
0 L R n
b
22
Binary SearchIterative Approach
/* If x appears in b[0..n-1], return its location, i.e., return k so that b[k]==x. If x not found, return -1 */NodeType binarySearch(NodeType list[], int size, int target){
int front, back, mid;___________________ ;
while ( _______________ ) {
} _________________ ;}
<= x > x?0 L R n
b
23
/* If x appears in b[0..n-1], return its location, i.e., return k so that b[k]==x. If x not found, return -1 */NodeType binarySearch(NodeType list[], int size, int target){
int front, back, mid;___________________ ;
while ( _______________ ) { mid = (front+back)/2;
if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; } _________________ ;}
<= x > x?0 L R n
b
Binary SearchIterative Approach
24
Loop Termination/* If x appears in b[0..n-1], return its location, i.e., return k so that b[k]==x. If x not found, return -1 */NodeType binarySearch(NodeType list[], int size, int target){
int front, back, mid;___________________ ;
while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1; } _________________ ;}
<= x > x?0 L R n
b
25
/* If x appears in b[0..n-1], return its location, i.e., return k so that b[k]==x. If x not found, return -1 */
NodeType binarySearch(NodeType list[], int size, int target) { int front(0); int back(size-1); int mid; while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1; } _________________ ;}
Initialization
<= x > x0 L R n
b
26
NodeType binarySearch(NodeType list[], int size, int target) { int front(0); int back(size-1); int mid; while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1; } return NULL; \\ Indicates target was not found;}
Return Result
<= x > x0 L R n
b
27
Binary Search
Test: bsearch(v,8,3);
-17 -5 3 6 12 21 45 142b
0 1 2 3 4 5 6 7
L Rmid
while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1;
RmidL midL
28
Binary Search
Test: bsearch(v,8,17);
-17 -5 3 6 12 21 45 142b
L Rmid
while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1;
midmidL RL
0 1 2 3 4 5 6 7
29
Binary Search
Test: bsearch(v,8,143);
-17 -5 3 6 12 21 45 142b
L Rmid
while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1;
midmidmidL L L L
0 1 2 3 4 5 6 7
30
Binary Search
Test: bsearch(v,8,-143);
-17 -5 3 6 12 21 45 142b
L Rmid
while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1;
midmid RRR
0 1 2 3 4 5 6 7
31
Binary Search (target = n)
NodeType binarySearch(NodeType list[], int size, int target) { int front(0); int back(size-1); int mid; while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1; } return NULL; \\ Indicates target was not found;}
NodeType binarySearch(NodeType list[], int size, int target) { int front(0); int back(size-1); int mid; while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1; } return NULL; \\ Indicates target was not found;}
32
Binary Search (target = 7)
44 66 77 1212 1818 2222 2323 2828
front(0)front(0)
3030
back(8)back(8)
33
Binary Search (target = 7)
44 66 77 1212 1818 2222 2323 2828
front(0)front(0)
3030
back(8)back(8)
mid(4)mid(4)
34
Binary Search (target = 7)
44 66 77 1212 1818 2222 2323 2828
front(0)front(0)
3030
back(3)back(3)mid(4)mid(4)
35
Binary Search (target = 7)
44 66 77 1212 1818 2222 2323 2828
front(0)front(0)
3030
back(3)back(3)mid(1)mid(1)
36
Binary Search (target = 7)
44 66 77 1212 1818 2222 2323 2828
front(2)front(2)
3030
back(3)back(3)
mid(1)mid(1)
37
Binary Search (target = 7)
44 66 77 1212 1818 2222 2323 2828
front(2)front(2)
3030
back(3)back(3)
mid(1)mid(1)
38
Is it worth the trouble?
Suppose you had 1000 elements Ordinary search would require maybe 500 comparisons on
average Binary search
after 1st compare, throw away half, leaving 500 elements to be searched.
after 2nd compare, throw away half, leaving 250. Then 125, 63, 32, 16, 8, 4, 2, 1 are left.
After at most 10 steps, you're done! What if you had 1,000,000 elements??
39
How Fast Is It?
Another way to look at it: How big an array can you search if you examine a given number of array elements?
# comps Array size
1 1
2 2
3 4
4 8
5 16
6 32
7 64
8 128
… …
11 1,024
… …
21 1,048,576
40
List size Loop Iterations1 13 27 315 431 563 6
127 7
Analyzing Binary Search
We only need to concentrate in the main loop The loop is different from the linear search because its number of
executions is not a multiple of n (list size) We can easily see that the size of the input is halved in each interaction.
This should already give a "hint" of each function describes this algorithm, but let's use a table
The table shows that thenumber of iterations grows
proportionally to the logarithm
base 2 of the size of the list
O(log n)
The table shows that thenumber of iterations grows
proportionally to the logarithm
base 2 of the size of the list
O(log n)
41
Time for Binary Search
Key observation: for binary search: size of the array n that can be searched with k comparisons: n ~ 2k
Number of comparisons k as a function of array size n: k ~ log2 n
This is fundamentally faster than linear search (where k ~ n)
42
Write a Recursive Binary Seach Function BinarySearch( )
BinarySearch takes sorted array vec, and two subscripts, fromLoc and toLoc, and key as arguments. It returns false if key is not found in the elements vec[fromLoc…toLoc]. Otherwise, it returns true.
BinarySearch is O(log2N).
43
found = BinarySearch(vec, 25, 0, 14 );
key fromLoc toLocindexes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
vec 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
16 18 20 22 24 26 28
24 26 28
24 NOTE: denotes element examined
44
Recursive Binary Seach -- basic idea
• This is an example of a recursive function where arguments are halved.
Given: a sorted array a of values (integers, strings, ..) from range [s,t]
Task: search if a value x is in the array. If yes, return position, otherwise -1.
45
Recursive Binary Seach -- basic idea
• Consider how you search for a name in a phone book: you don't use algorithm 1 (otherwise it would take ages to find a name starting with Z).
• instead, you open the book somewhere, and then continue searching in the half that contains the name then open up somewhere in that half, and continue searching in the portion that contains the name, etc.
46
Now let's do this for a sorted array of integers, but let's alwayscheck the middle of the remaining range.Example: search for 7 in the following array
2 5 7 11 17 24 31 38 40 41 0 1 2 3 4 5 6 7 8 9
mid: (0+9)/2 = 4, 7< a[4], so look in lower half
2 5 7 11
mid = (0+3)/2 == 1, 7> a[1], so look in upper half
7 11
mid = (2+3)/2 == 2, 7 == a[2], found!
low high
low high
low high
Recursive Binary Seach -- basic idea
47
Recursive Binary Seach -- basic idea
• Example: array contains 3,5. Search for 4. (0+1)/2 is 0 (integer div). so if we don't exclude mid, the sub array starts again at index 0 and ends at 1. => infinite number of recursive calls in the code on the next page, mid is excluded from the subarray to prevent this.
48
• Let's think about the design of the recursive fct before coding it:1. recursive calls: call function with that half of the current subrange
that contains x
Define subrange with start and end index
2. base case: when should the recursive calls stop: when we find x
• what if x is not in the array? -- stop if a single cell that does not contain x check: does the (start + end)/2 procedure always end in an array of length 1? A: depends on how you implement it. You must ensure that array gets at least smaller by 1.
Recursive Binary Seach -- basic idea
Boolean BinarySearch ( int vec[ ] , int key , int fromLoc , int toLoc )
// PRE: vec [ fromLoc . . toLoc ] sorted in ascending order // POST: FCTVAL == ( key in vec [ fromLoc . . toLoc] )
{ int mid ;if ( fromLoc > toLoc ) // base case -- not found
return false ; else {
mid = ( fromLoc + toLoc ) / 2 ;
if ( vec [ mid ] == key ) // base case-- found at mid
return true ;
else if ( key < vec [ mid ] ) // search lower half return BinarySearch ( vec, key, fromLoc, mid-1 ) ; else // search upper half
return BinarySearch( vec, key, mid + 1, toLoc ) ; }
} 49
Recursive Binary Seach
#include <stdio.h>
/* prototype */
int binSearch(int array[], int first, int last, int N);
void main(void){
int index;
int value;
int list[] = {1,2,3,5,6};
printf(“Enter a search value:”);
scanf(“%i”,&value);/* the function binSearch returns the index of the array */
/* where the match is found, otherwise a –1 */
index = binSearch(list,0,4,value);
if (index == -1)
printf(“Value not found!\n”);
else
printf(“Value matches the %i element in the array!\n”,++index);
}
/* code continued on next slide */
/* array is the name of the array (or sub-array) to be searched */
/* first is the left-most index of the array being searched */
/* last is the right-most index of the array being searched */
/* N is the value being searched for */
int binSearch(int array[], int first, int last, int N) {
int midpt; if (N < array[first] || N > array[last] )
return -1;
/* didn't meet our error condition */
midpt = (first+last)/2;
if (array[midpt] == N)
return midpt; /* recursive calls */
else if (array[midpt] > N)
return binSearch( array, first, midpt – 1, N);
else
return binSearch( array, midpt+1,last, N);
}
52
Note the contents of the “stack” when we execute a call binSearch from main:(some of the details are simplified)
(push) return binSearch( array, 0,1, 2); (first =0, last =4)
(push) return binSearch( array, 1,1, 2); (first =0, last =1)
(pop) return 1; (first = 1, last = 1)
(pop) return 1; (first =0, last=1)
(pop) return 1; (first =0, last=4)
Recursive Binary Seach
53
Note the contents of the “stack” when we execute a call binSearch from main:(some of the details are simplified)
(push) return binSearch( array, 2+1,4, 4); (first =0, last =4)
(pop) return -1; (first = 3, last = 4)
(pop) return -1; (first =0, last =4)
Recursive Binary Seach
54
Iteration vs. Recursion
Turns out any iterative algorithm can be reworked to use recursion instead (and vice versa).
There are programming languages where recursion is the only choice(!)
Some algorithms are more naturally written with recursion But naïve applications of recursion can be inefficient
55
Binary Seach
Several comments on binary search:
• Binary search assumes that the elements are sorted. If they are not sorted, you won't know in which half to continue searching.
• Binary search is not a great idea for linked lists, since you can't just jump to the middle element. You'd have to iterate through the list to get there, so you could just as well check for x while you are doing that.
56
Summary Linear search and binary search are two
different algorithms for searching an array Binary search is vastly more efficient
But binary search only works if the array elements are in order
57
Hashing
58
Tables: rows & columns of information
A table has several fields (types of information) A telephone book may have fields name, address, phone number A user account table may have fields user id, password, home
folder To find an entry in the table, you only need know the
contents of one of the fields (not all of them). This field is the key In a telephone book, the key is usually name In a user account table, the key is usually user id
Ideally, a key uniquely identifies an entry If the key is name and no two entries in the telephone book have
the same name, the key uniquely identifies the entries
59
The Table ADT: operations
insert: given a key and an entry, inserts the entry into the table
find: given a key, finds the entry associated with the key remove: given a key, finds the entry associated with the
key, and removes it
Also: getIterator: returns an iterator, which visits each of the
entries one by one (the order may or may not be defined)etc.
60
Table ADT’s
We are familiar with direct access structures and linear access structures.
Both have its advantages and disadvantages The crucial point for avoiding direct access structures is the
fact that we need to allocate in advance the size of this structure In all likelihood, we tend to overestimate the its size and we end up
with a very sparse structure We tend to think that the actual number of keys to be stored is
equivalent to the universe of possible existing keys In some problems the number of keys to be stored is smaller
than the number in the universe of keys. In this case a hash table may save us a lot of space.
61
How should we implement a table?
How often are entries inserted and removed? How many of the possible key values are likely to be used? What is the likely pattern of searching for keys?
e.g. Will most of the accesses be to just one or two key values?
Is the table small enough to fit into memory? How long will the table exist?
Our choice of representation for the Table ADT depends on the answers to the following
62
TableNode: a key and its entry For searching purposes, it is best to store the
key and the entry separately (even though the key’s value may be inside the entry)
“Smith” “Smith”, “124 Hawkers Lane”, “9675846”
“Yeo” “Yeo”, “1 Apple Crescent”, “0044 1970 622455”
key entry
TableNode
63
Implementation 1:unsorted sequential array
An array in which TableNodes are stored consecutively in any order
insert: add to back of array; O(1) find: search through the keys one at
a time, potentially all of the keys; O(n)
remove: find + replace removed node with last node; O(n)
0
…
key entry
1
23
and so on
64
Implementation 2:sorted sequential array
An array in which TableNodes are stored consecutively, sorted by key
insert: add in sorted order; O(n) find: binary search; O(log n) remove: find, remove node and
shuffle down; O(n)
0
…
key entry
1
23
We can use binary search because thearray elements are sorted
and so on
65
Implementation 3:linked list (unsorted or sorted)
TableNodes are again stored consecutively
insert: add to front; O(1)or O(n) for a sorted list
find: search through potentially all the keys, one at a time; O(n)still O(n) for a sorted list
remove: find, remove using pointer alterations; O(n)
key entry
and so on
66
An array in which TableNodes are not stored consecutively - their place of storage is calculated using the key and a hash function
Hashed key: the result of applying a hash function to a key
Keys and entries are scattered throughout the array
Implementation 5:hashing
key entry
Key hash function
array index
4
10
123
67
An array in which TableNodes are not stored consecutively - their place of storage is calculated using the key and a hash function
insert: calculate place of storage, insert TableNode; O(1)
find: calculate place of storage, retrieve entry; O(1)
remove: calculate place of storage, set it to null; O(1)
Implementation 5:hashing
key entry
4
10
123
All are O(1) !
68
Hash Functions
Hash tables normally maintain the invariant of direct access structure which provide O(1) time (constant time) to access an element
With direct access structure, a key k is normally stored in slot k. In hash tables this element is stored in slot h(k).
h(k) is a hash function. It maps the universe U of keys into the slots of a hash table (smaller than the universe)
h : U --> {0,1,...,m-1} where m is the size of the tableh : U --> {0,1,...,m-1} where m is the size of the table
69
Hashing example: a fruit shop 10 stock details, 10 table positions
key entry01
2
3
4
5
6
7
8
9
Stock numbers are between 0 and 1000Use hash function: stock no. / 100What if we now insert stock no. 350?
Position 3 is occupied: there is a collision
Collision resolution strategy: insert in the next free position (linear probing)
85 85, apples
462 462, pears
912 912, papaya
323 323, guava
350 350, oranges
Given a stock number, we find stock by using the hash function again, and use the collision resolution strategy if necessary
70
Pictorial view of Hash Tables
k1
k2k3
k4
71
Pictorial view of Hash Tables
k1
k2k3
k4
k5
72
Three factors affecting the performance of hashing
The hash function Ideally, it should distribute keys and entries evenly throughout the table It should minimise collisions, where the position given by the hash
function is already occupied The collision resolution strategy
Separate chaining: chain together several keys/entries in each position Open addressing: store the key/entry in a different position
The size of the table Too big will waste memory; too small will increase collisions and may
eventually force rehashing (copying into a larger table) Should be appropriate for the hash function used – and a prime number
is best
73
Choosing a hash function:turning a key into a table position
Truncation Ignore part of the key and use the rest as the array index (converting
non-numeric parts) A fast technique, but check for an even distribution throughout the
table Folding
Partition the key into several parts and then combine them in any convenient way
Unlike truncation, uses information from the whole key Modular arithmetic (used by truncation & folding, and on its own)
To keep the calculated table position within the table, divide the position by the size of the table, and take the remainder as the new position
74
Examples of hash functions (1)
Truncation: If students have an 9-digit identification number, take the last 3 digits as the table position e.g. 925371622 becomes 622
Folding: Split a 9-digit number into three 3-digit numbers, and add them e.g. 925371622 becomes 925 + 376 + 622 = 1923
Modular arithmetic: If the table size is 1000, the first example always keeps within the table range, but the second example does not (it should be mod 1000) e.g. 1923 mod 1000 = 923 (in Java: 1923 % 1000)
75
Examples of hash functions (2) Using a telephone number as a key
The area code is not random, so will not spread the keys/entries evenly through the table (many collisions)
The last 3-digits are more random Using a name as a key
Use full name rather than surname (surname not particularly random) Assign numbers to the characters (e.g. a = 1, b = 2; or use Unicode
values) Strategy 1: Add the resulting numbers. Bad for large table size. Strategy 2: Call the number of possible characters c (e.g. c = 54 for
alphabet in upper and lower case, plus space and hyphen). Then multiply each character in the name by increasing powers of c, and add together.
76
What is a Hash Table ?
The simplest kind of hash table is an array of records.
This example has 701 records.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
An array of records
. . .
[ 700]
77
What is a Hash Table ?
Each record has a special field, called its key.
In this example, the key is a long integer field called Number.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548
78
What is a Hash Table ?
The number might be a person's identification number, and the rest of the record has information about the person.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548
79
What is a Hash Table ?
When a hash table is in use, some spots contain valid records, and other spots are "empty".
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
80
Inserting a New Record
In order to insert a new record, the key must somehow be converted to an array index.
The index is called the hash value of the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
Number 580625685
81
Inserting a New Record
Typical way create a hash value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 mod 701) ?
82
Inserting a New Record
Typical way to create a hash value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 mod 701) ?3
83
Inserting a New Record
The hash value is used for the location of the new record.
Number 580625685
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .
[3]
84
Inserting a New Record
The hash value is used for the location of the new record.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
85
Collisions
Here is another new record to insert, with a hash value of 2.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
Number 701466868
My hashvalue is [2].
86
Collisions
This is called a collision, because there is already another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
Number 701466868
When a collision occurs,move forward until you
find an empty spot.
When a collision occurs,move forward until you
find an empty spot.
87
Collisions
This is called a collision, because there is already another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
Number 701466868
When a collision occurs,move forward until you
find an empty spot.
When a collision occurs,move forward until you
find an empty spot.
88
Collisions
This is called a collision, because there is already another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685
Number 701466868
When a collision occurs,move forward until you
find an empty spot.
When a collision occurs,move forward until you
find an empty spot.
89
Collisions
This is called a collision, because there is already another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
The new record goesin the empty spot.
The new record goesin the empty spot.
90
A Quiz
Where would you be placed in this table, if there is no collision? Use your social security number or some other favorite number.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322Number 580625685 Number 701466868
. . .
91
Searching for a Key
The data that's attached to a key can be found fairly quickly.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
92
Searching for a Key
Calculate the hash value. Check that location of the array
for the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is [2].
Not me.
93
Searching for a Key
Keep moving forward until you find the key, or you reach an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is [2].
Not me.
94
Searching for a Key
Keep moving forward until you find the key, or you reach an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is [2].
Not me.
95
Searching for a Key
Keep moving forward until you find the key, or you reach an empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is [2].
Yes!
96
Searching for a Key
When the item is found, the information can be copied to the necessary location.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Number 701466868
My hashvalue is [2].
Yes!
97
Deleting a Record
Records may also be deleted from a hash table.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Pleasedelete me.
98
Deleting a Record
Records may also be deleted from a hash table. But the location must not be left as an ordinary "empty
spot" since that could interfere with searches.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
99
Deleting a Record
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136Number 281942902 Number 155778322
. . .Number 580625685 Number 701466868
Records may also be deleted from a hash table. But the location must not be left as an ordinary "empty
spot" since that could interfere with searches. The location must be marked in some special way so that
a search can tell that the spot used to have something in it.
100
Using a hash function
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
Empty
4501
Empty
8903
8
10
values
[ 97]
[ 98]
[ 99]
7803
Empty
.
.
.
Empty
2298
3699
HandyParts company makes no more than 100 different parts. But theparts all have four digit numbers.
This hash function can be used tostore and retrieve parts in an array.
Hash(key) = partNum % 100
101
Placing elements in the array
Use the hash function
Hash(key) = partNum % 100
to place the element with
part number 5502 in the
array.
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
Empty
4501
Empty
8903
8
10
values
[ 97]
[ 98]
[ 99]
7803
Empty
.
.
.
Empty
2298
3699
102
Placing elements in the array
Next place part number6702 in the array.
Hash(key) = partNum % 100
6702 % 100 = 2
But values[2] is already occupied.
COLLISION OCCURS
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
values
[ 97]
[ 98]
[ 99]
7803
Empty
.
.
.
Empty 2298
3699
Empty
4501
5502
103
How to resolve the collision?
One way is by linear probing.This uses the rehash function
(HashValue + 1) % 100
repeatedly until an empty locationis found for part number 6702.
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
values
[ 97]
[ 98]
[ 99]
7803
Empty
.
.
.
Empty
2298
3699
Empty
4501
5502
104
Resolving the collision
Still looking for a place for 6702using the function
(HashValue + 1) % 100
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
values
[ 97]
[ 98]
[ 99]
7803
Empty
.
.
.
Empty
2298
3699
Empty
4501
5502
105
Collision resolved
Part 6702 can be placed atthe location with index 4.
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
values
[ 97]
[ 98]
[ 99]
7803
Empty
.
.
.
Empty
2298
3699
Empty
4501
5502
106
Collision resolved
Part 6702 is placed atthe location with index 4.
Where would the part withnumber 4598 be placed usinglinear probing?
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
values
[ 97]
[ 98]
[ 99]
7803
6702
.
.
.
Empty
2298
3699
Empty
4501
5502
107
Choosing the table size to minimise collisions
As the number of elements in the table increases, the likelihood of a collision increases - so make the table as large as practical
If the table size is 100, and all the hashed keys are divisable by 10, there will be many collisions! Particularly bad if table size is a power of a small integer
such as 2 or 10 More generally, collisions may be more frequent if:
greatest common divisor (hashed keys, table size) > 1 Therefore, make the table size a prime number (gcd = 1)
Collisions may still happen, so we need a collision resolution strategy
108
Collision resolution techniques
We will review a simple technique called chaining. However there are those who argue against this approach and point out other techniques such as: Linear Probing: Very simple. If position h(key) is occupied, do a
linear search in the table until you find a empty slot. The slot is searched in this order: h(key), k(key)+1, h(key)+2, ..., h(key)+c
Quadratic probing: is a variant of the above where the term being added to the hash result is squared. h(key)+c2
Random probing: is another variant where the term being added to the hash function is a random number. h(key)+random()
Rehashing: is a technique where a sequence of hashing functions are defined (h
1, h
2, ... h
k). If a collision occurs the
functions are used in the this order
109
Collision resolution:open addressing (1)
Linear probing: increase by 1 each time [mod table size!] Quadratic probing: to the original position, add 1, 4, 9, 16,…
Probing: If the table position given by the hashed key is already occupied, increase the position by some amount, until an empty position is found
Use the collision resolution strategy when inserting and when finding (ensure that the search key and the found keys match)
May also double hash: result of linear probing result of another hash function
With open addressing, the table size should be double the expected no. of elements
110
Clustering
is the tendency of elements to become unevenly distributed in the hash table, with many elements clustering around a single hash location.
One problem with linear probing is that it results in clustering.
111
Collision resolution:open addressing (2)
If the table is fairly empty with many collisions, linear probing may cluster (group) keys/entries This increases the time to insert and to find
1 2 3 4 5 6 7 8
For a table of size n, then if the table is empty, the probability of the next entry going to any particular place is 1/nIn the diagram, the probability of position 2 getting filled next is 2/n (either a hash to 1 or to 2 fills it)Once 2 is full, the probability of 4 being filled next is 4/n and then of 7 is 7/n (i.e. the probability of getting long strings steadily increases)
112
Collision resolution:open addressing (3)
An empty key/entry marks the end of a cluster, and so can be used to terminate a find operation
So, if we remove an entry within a cluster, we should not empty it!
To allow probing to continue, the removed entry must be marked as ‘removed but cluster continues’
113
Collision resolution:open addressing (4)
Quadratic probing is a solution to the clustering problem Linear probing adds 1, 2, 3, etc. to the original hashed key Quadratic probing adds 12, 22, 32 etc. to the original hashed
key However, whereas linear probing guarantees that all empty
positions will be examined if necessary, quadratic probing does not e.g. Table size 16 and original hashed key 3 gives the
sequence: 3, 4, 7, 12, 3, 12, 7, 4… More generally, with quadratic probing, insertion may be
impossible if the table is more than half-full! Need to rehash (see later)
114
Collision resolution: chaining Each slot of a hash table will be a
pointer to a linked list Add the keys and entries anywhere in
the list (front easiest) Advantages over open addressing:
Simpler insertion and removal Array size is not a limitation (but
should still minimise collisions: make table size roughly equal to expected number of keys and entries)
Disadvantage Memory overhead is large if entries
are small
4
10
123
key entry key entry
key entry key entry
key entry
No need to change position!
115
Chaining
is another means (besides linear probing) used to handle collisions that arise from the use of a hash function.
Chaining uses the hash value, not as the actual location of the element, but as the index into an array of pointers. A chain is a linked list of elements that share the same hash location.
FOR EXAMPLE . . .
116
Using hashing and chaining
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
pointers
[ 97]
[ 98]
[ 99]
HandyParts company makes no more than 100 different parts. But theparts all have four digit numbers.
Use this hash function to store and retrieve parts in the chains.
Hash(key) = partNum % 100
7803
.
.
.
2298
3699
4501
117
Using chaining
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
pointers
[ 97]
[ 98]
[ 99]
7803
.
.
.
2298
3699
4501
Use the hash function
Hash(key) = partNum % 100
to place the element with
part number 5502 in a chain.
5502
118
Using chaining
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
[ 97]
[ 98]
[ 99]
7803
.
.
.
2298
3699
4501
5502
Next place part number6702 in a chain.
Hash(key) = partNum % 100
6702 % 100 = 2
6702
pointers
119
Using chaining
[ 0 ]
[ 1 ]
[ 2 ]
[ 3 ]
[ 4 ]
. . .
[ 97]
[ 98]
[ 99]
7803
.
.
.
2298
3699
4501
5502 6702
Where would the part withnumber 4598 be placed using chaining?
pointers
120
More Chaining…….
121
Hashing(103)
h(103) = 103 mod 10 h(103) = 3
h(103) = 103 mod 10 h(103) = 3
122
Hashing(103)
h(n) = 103 mod 10 h(n) = 3
h(n) = 103 mod 10 h(n) = 3
103103 //
123
Hashing(69)
h(n) = 69 mod 10 h(n) = 9
h(n) = 69 mod 10 h(n) = 9
103103 //
6969 //
124
Hashing(20)
h(n) = 20 mod 10 h(n) = 0
h(n) = 20 mod 10 h(n) = 0
103103 //
6969 //
2020 //
125
Hashing(13)
h(n) = 13 mod 10 h(n) = 3
h(n) = 13 mod 10 h(n) = 3
103103
6969 //
2020 //
1313 //
126
Hashing(110)
h(n) = 110 mod 10 h(n) = 0
h(n) = 110 mod 10 h(n) = 0
103103
6969 //
2020
1313 //
110110 //
127
Hashing(53)
h(n) = 53 mod 10 h(n) = 3
h(n) = 53 mod 10 h(n) = 3
103103
6969 //
2020
1313 //
110110 //
5353 //
128
Final Hash Table
103103
6969 //
2020
1313 //
110110 //
5353 //
129
Searching in a Hash Table
Like any other structure, searching is a common task with hash tables
Searching works as belowGiven a target, hash the targetTake the value of the hash of target and go to the slot.
If the target exist it must be in this slotSearch in the list in the current slot using a linear
search.
130
Searching for 53
103103
6969 //
2020
1313 //
110110 //
5353 //
131
Searching for 53
103103
6969 //
2020
1313 //
110110 //
5353 //
132
Searching for 53
103103
6969 //
2020
1313 //
110110 //
5353 //
temptemp
133
Searching for 53
103103
6969 //
2020
1313 //
110110 //
5353 //
temptemp
134
Searching for 53
103103
6969 //
2020
1313 //
110110 //
5353 //
temptemp
135
Searching for 53
103103
6969 //
2020
1313 //
110110 //
5353 //
temptemp
136
hashSearch(n)
NodeType hashSearch(NodeType* table[],int target) { int index = hash(target); NodeType *temp = table[index]; return linearSearch(temp,target);}
NodeType hashSearch(NodeType* table[],int target) { int index = hash(target); NodeType *temp = table[index]; return linearSearch(temp,target);}
137
Rehashing: enlarging the table To rehash:
Create a new table of double the size (adjusting until it is again prime) Transfer the entries in the old table to the new table, by recomputing their
positions (using the hash function) When should we rehash?
When the table is completely full With quadratic probing, when the table is half-full or insertion fails
Why double the size? If n is the number of elements in the table, there must have been n/2
insertions before the previous rehash (if rehashing done when table full) So by making the table size 2n, a constant cost is added to each insertion
138
Comparison of collision techniques
factor (n/size)
Exp
ecte
d N
umbe
r of
Pro
besLinear Probing
Random Probing
Chaining
139
Applications of Hashing Compilers use hash tables to keep track of declared variables A hash table can be used for on-line spelling checkers — if
misspelling detection (rather than correction) is important, an entire dictionary can be hashed and words checked in constant time
Game playing programs use hash tables to store seen positions, thereby saving computation time if the position is encountered again
Hash functions can be used to quickly check for inequality — if two elements hash to different values they must be different
Storing sparse data
140
When are other representations more suitable than hashing?
Hash tables are very good if there is a need for many searches in a reasonably stable table
Hash tables are not so good if there are many insertions and deletions, or if table traversals are needed — in this case, AVL trees are better
If there are more data than available memory then use a B-tree
Also, hashing is very slow for any operations which require the entries to be sorted e.g. Find the minimum key
141
Performance of Hashing
The number of probes depends on the load factor (usually denoted by ) which represents the ratio of entries present in the table to the number of positions in the array
We also need to consider successful and unsuccessful searches separately
For a chained hash table, the average number of probes for an unsuccessful search is and for a successful search is 1 + /2
142
Performance of Hashing (2)
For open addressing, the formulae are more complicated but typical values are:Load Factor 0.1 0.5 0.8 0.9 0.99Successful searchLinear Probes 1.05 1.6 3.4 6.2 21.3Quadratic Probes 1.04 1.5 2.1 2.7 5.2Unsuccessful searchLinear Probes 1.13 2.7 15.4 59.8 430Quadratic probes 1.13 2.2 5.2 11.9 126
Note that these do not depend on the size of the array or the number of entries present but only on the ratio (the load factor)
143
Hash tables store a collection of records with keys. The location of a record depends on the hash value of the
record's key. When a collision occurs, the next available location is
used. Searching for a particular key is generally quick. When an item is deleted, the location must be marked in a
special way, so that the searches know that the spot used to be used.
Summary