1 searching and hashing. 2 concepts this lecture searching an array linear search binary search...

1

Searching and Hashing

2

Concepts This Lecture

Searching an array Linear search Binary search Comparing algorithm performance

3

Searching

Searching = looking for something Searching an array is particularly common

Goal: determine if a particular value is in the array

We'll see that more than one algorithm will work

4

Searching Algorithms

The algorithm used to find a number in a phone book is practical and efficient for human but not so good for computers It's not precise It's not consistent

Let's imagine another scenario. Suppose that you have A pile of cards containing names of customers They are not organized in any particular way You want to find the card with name Sarah (your key)

The procedure you'll will use is likely to be: look a each card's key (one by one) until one matches your target This is an algorithm and is called Linear Search

5

Searching as a Function Specification: Let b be the array to be searched, n is the size of the array, and b is x

is value being search for. If x appears in b[0..n-1], return its index, i.e., return k such that b[k]==x. If x not found, return –1

None of the parameters are changed by the function Function outline:

void Lookup ((const int vec[ ], int vSize, int key, Boolean& found, int& loc) {

...}

6

Linear Search Algorithm: start at the beginning of the array and examine each

element until x is found, or all elements have been examined

void Lookup (const int vec[ ], int vSize, int key, Boolean& found, int& loc) {

loc = 0;

while (loc < vSize && vec[loc] != key)

loc++;

found = (loc < vSize);

}

7

Linear Search

Test: search(v, 8, 6)

3 12 -5 6 142 21 -17 45b

Found It!

8

Linear Search

Test: search(v, 8, 15)

3 12 -5 6 142 21 -17 45b

Ran off the end! Not found.

9

Linear Search

Note: The loop condition is written so vec[loc] is not accessed if loc >= vSize.

while ( loc < vSize && vec[loc] != key )

(Why is this true? Why does it matter?)

3 12 -5 6 142 21 -17 45b

10

Write a Recursive Linear Search

NodeType linearSearch(NodeType *start, int target) { if (start->key == target) return *start; if (start == NULL) return NULL; else return LinearSearch(start->next, target);}

NodeType linearSearch(NodeType *start, int target) { if (start->key == target) return *start; if (start == NULL) return NULL; else return LinearSearch(start->next, target);}

11

Linear Search-Linked List

for each item in the list if the item's key match the target stop and report "success"report failure

for each item in the list if the item's key match the target stop and report "success"report failure

12

Linear Search (target = 9)

headhead

55 1212 99

//

headhead

13


headhead

55 1212 99

//

headhead

14


headhead

55 1212 99

//

headhead

15

Linear Search (target = n)

NodeType linearSearch(NodeType *start, int target) { NodeType *temp = start; while (temp != NULL) { if (temp->key == target) return *temp; temp = temp->next; } return NULL;}

NodeType linearSearch(NodeType *start, int target) { NodeType *temp = start; while (temp != NULL) { if (temp->key == target) return *temp; temp = temp->next; } return NULL;}

16

Analyzing Linear Search

Best case analysis The element is always found in the first position of the list, which

means that we do one comparison: O(1) Worst case analysis

The element is never present in the list. This means that we are going to do n comparisons where n is the size of the listwe have to go through the whole list to be sure whether the element is

present: O(N) Average case analysis

The search key can be found anywhere in the list If we "run" the algorithm for each possibility where the key may appear

we get: 1+2+….+vSize/vSize => (vSize*(vSize+1)/2)/vSize = (vSize+1)/2 = O(N)

17

Can we do better?

Time needed for linear search is proportional to the size of the array.

An alternate algorithm, "Binary search," works if the array is sorted 1. Look for the target in the middle. 2. If you don't find it, you can ignore half of the

array, and repeat the process with the other half.

Example: Find first page of pizza listings in the yellow pages

18

Can we do better?

Time needed for linear search is proportional to the size of the array.

An alternate algorithm, "Binary search," works if the array is sorted 1. Look for the target in the middle. 2. If you don't find it, you can ignore half of the

array, and repeat the process with the other half.

Example: Find first page of pizza listings in the yellow pages

19

Binary Search

In some cases, you get a list which is already ordered. In this case we can use algorithms that take this into

consideration The idea of binary search is

Split the list in two halves and compare the target with the key in the middle of the list

Based on this comparison we can tell which half of the list may contain the target

Binary search eliminates half of the list at each iteration

It requires direct access to the list elements

20

Binary Search Strategy What we want: Find split between values larger

and smaller than x:

<= x > x

0 L R n

b

<= x > x?

0 L R n

b

Situation while searching

Step: Look at b[(L+R)/2]. Move L or R to the middle depending on test.

21

Binary Search Strategy

More precisely

Values in b[0..L] <= x Values in b[R..n-1] > x Values in b[L+1..R-1] are unknown

<= x > x?

0 L R n

b

22

Binary SearchIterative Approach

/* If x appears in b[0..n-1], return its location, i.e., return k so that b[k]==x. If x not found, return -1 */NodeType binarySearch(NodeType list[], int size, int target){

int front, back, mid;___________________ ;

while ( _______________ ) {

} _________________ ;}

<= x > x?0 L R n

b

23

/* If x appears in b[0..n-1], return its location, i.e., return k so that b[k]==x. If x not found, return -1 */NodeType binarySearch(NodeType list[], int size, int target){


while ( _______________ ) { mid = (front+back)/2;

if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; } _________________ ;}

<= x > x?0 L R n

b

Binary SearchIterative Approach

24

Loop Termination/* If x appears in b[0..n-1], return its location, i.e., return k so that b[k]==x. If x not found, return -1 */NodeType binarySearch(NodeType list[], int size, int target){


while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1; } _________________ ;}

<= x > x?0 L R n

b

25

/* If x appears in b[0..n-1], return its location, i.e., return k so that b[k]==x. If x not found, return -1 */

NodeType binarySearch(NodeType list[], int size, int target) { int front(0); int back(size-1); int mid; while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1; } _________________ ;}

Initialization

<= x > x0 L R n

b

26

NodeType binarySearch(NodeType list[], int size, int target) { int front(0); int back(size-1); int mid; while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1; } return NULL; \\ Indicates target was not found;}

Return Result

<= x > x0 L R n

b

27

Binary Search

Test: bsearch(v,8,3);

-17 -5 3 6 12 21 45 142b

0 1 2 3 4 5 6 7

L Rmid

while (front <= back) { mid = (front+back)/2; if (target == list[mid].key) return list[mid]; else if (target < list[mid].key) back = mid-1; else front = mid+1;

RmidL midL

28

Binary Search


-17 -5 3 6 12 21 45 142b

L Rmid


midmidL RL

0 1 2 3 4 5 6 7

29

Binary Search


-17 -5 3 6 12 21 45 142b

L Rmid


midmidmidL L L L

0 1 2 3 4 5 6 7

30

Binary Search

Test: bsearch(v,8,-143);

-17 -5 3 6 12 21 45 142b

L Rmid


midmid RRR

0 1 2 3 4 5 6 7

31

Binary Search (target = n)



32

Binary Search (target = 7)

44 66 77 1212 1818 2222 2323 2828

front(0)front(0)

3030

back(8)back(8)

33


44 66 77 1212 1818 2222 2323 2828

front(0)front(0)

3030

back(8)back(8)

mid(4)mid(4)

34


44 66 77 1212 1818 2222 2323 2828

front(0)front(0)

3030

back(3)back(3)mid(4)mid(4)

35


44 66 77 1212 1818 2222 2323 2828

front(0)front(0)

3030

back(3)back(3)mid(1)mid(1)

36


44 66 77 1212 1818 2222 2323 2828

front(2)front(2)

3030

back(3)back(3)

mid(1)mid(1)

37


44 66 77 1212 1818 2222 2323 2828

front(2)front(2)

3030

back(3)back(3)

mid(1)mid(1)

38

Is it worth the trouble?

Suppose you had 1000 elements Ordinary search would require maybe 500 comparisons on

average Binary search

after 1st compare, throw away half, leaving 500 elements to be searched.

after 2nd compare, throw away half, leaving 250. Then 125, 63, 32, 16, 8, 4, 2, 1 are left.

After at most 10 steps, you're done! What if you had 1,000,000 elements??

39

How Fast Is It?

Another way to look at it: How big an array can you search if you examine a given number of array elements?

# comps Array size

1 1

2 2

3 4

4 8

5 16

6 32

7 64

8 128

… …

11 1,024

… …

21 1,048,576

40

List size Loop Iterations1 13 27 315 431 563 6

127 7

Analyzing Binary Search

We only need to concentrate in the main loop The loop is different from the linear search because its number of

executions is not a multiple of n (list size) We can easily see that the size of the input is halved in each interaction.

This should already give a "hint" of each function describes this algorithm, but let's use a table

The table shows that thenumber of iterations grows

proportionally to the logarithm

base 2 of the size of the list

O(log n)

The table shows that thenumber of iterations grows

proportionally to the logarithm

base 2 of the size of the list

O(log n)

41

Time for Binary Search

Key observation: for binary search: size of the array n that can be searched with k comparisons: n ~ 2k

Number of comparisons k as a function of array size n: k ~ log2 n

This is fundamentally faster than linear search (where k ~ n)

42

Write a Recursive Binary Seach Function BinarySearch( )

BinarySearch takes sorted array vec, and two subscripts, fromLoc and toLoc, and key as arguments. It returns false if key is not found in the elements vec[fromLoc…toLoc]. Otherwise, it returns true.

BinarySearch is O(log2N).

43

found = BinarySearch(vec, 25, 0, 14 );

key fromLoc toLocindexes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

vec 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

16 18 20 22 24 26 28

24 26 28

24 NOTE: denotes element examined

44

Recursive Binary Seach -- basic idea

• This is an example of a recursive function where arguments are halved.

Given: a sorted array a of values (integers, strings, ..) from range [s,t]

Task: search if a value x is in the array. If yes, return position, otherwise -1.

45


• Consider how you search for a name in a phone book: you don't use algorithm 1 (otherwise it would take ages to find a name starting with Z).

• instead, you open the book somewhere, and then continue searching in the half that contains the name then open up somewhere in that half, and continue searching in the portion that contains the name, etc.

46

Now let's do this for a sorted array of integers, but let's alwayscheck the middle of the remaining range.Example: search for 7 in the following array

2 5 7 11 17 24 31 38 40 41 0 1 2 3 4 5 6 7 8 9

mid: (0+9)/2 = 4, 7< a[4], so look in lower half

2 5 7 11

mid = (0+3)/2 == 1, 7> a[1], so look in upper half

7 11

mid = (2+3)/2 == 2, 7 == a[2], found!

low high

low high

low high


47


• Example: array contains 3,5. Search for 4. (0+1)/2 is 0 (integer div). so if we don't exclude mid, the sub array starts again at index 0 and ends at 1. => infinite number of recursive calls in the code on the next page, mid is excluded from the subarray to prevent this.

48

• Let's think about the design of the recursive fct before coding it:1. recursive calls: call function with that half of the current subrange

that contains x

Define subrange with start and end index

2. base case: when should the recursive calls stop: when we find x

• what if x is not in the array? -- stop if a single cell that does not contain x check: does the (start + end)/2 procedure always end in an array of length 1? A: depends on how you implement it. You must ensure that array gets at least smaller by 1.


Boolean BinarySearch ( int vec[ ] , int key , int fromLoc , int toLoc )

// PRE: vec [ fromLoc . . toLoc ] sorted in ascending order // POST: FCTVAL == ( key in vec [ fromLoc . . toLoc] )

{ int mid ;if ( fromLoc > toLoc ) // base case -- not found

return false ; else {

mid = ( fromLoc + toLoc ) / 2 ;

if ( vec [ mid ] == key ) // base case-- found at mid

return true ;

else if ( key < vec [ mid ] ) // search lower half return BinarySearch ( vec, key, fromLoc, mid-1 ) ; else // search upper half

return BinarySearch( vec, key, mid + 1, toLoc ) ; }

} 49

Recursive Binary Seach

#include <stdio.h>

/* prototype */

int binSearch(int array[], int first, int last, int N);

void main(void){

int index;

int value;

int list[] = {1,2,3,5,6};

printf(“Enter a search value:”);

scanf(“%i”,&value);/* the function binSearch returns the index of the array */

/* where the match is found, otherwise a –1 */

index = binSearch(list,0,4,value);

if (index == -1)

printf(“Value not found!\n”);

else

printf(“Value matches the %i element in the array!\n”,++index);

}

/* code continued on next slide */

/* array is the name of the array (or sub-array) to be searched */

/* first is the left-most index of the array being searched */

/* last is the right-most index of the array being searched */

/* N is the value being searched for */

int binSearch(int array[], int first, int last, int N) {

int midpt; if (N < array[first] || N > array[last] )

return -1;

/* didn't meet our error condition */

midpt = (first+last)/2;

if (array[midpt] == N)

return midpt; /* recursive calls */

else if (array[midpt] > N)

return binSearch( array, first, midpt – 1, N);

else

return binSearch( array, midpt+1,last, N);

}

52

Note the contents of the “stack” when we execute a call binSearch from main:(some of the details are simplified)

(push) return binSearch( array, 0,1, 2); (first =0, last =4)

(push) return binSearch( array, 1,1, 2); (first =0, last =1)

(pop) return 1; (first = 1, last = 1)

(pop) return 1; (first =0, last=1)

(pop) return 1; (first =0, last=4)


53

Note the contents of the “stack” when we execute a call binSearch from main:(some of the details are simplified)

(push) return binSearch( array, 2+1,4, 4); (first =0, last =4)

(pop) return -1; (first = 3, last = 4)

(pop) return -1; (first =0, last =4)


54

Iteration vs. Recursion

Turns out any iterative algorithm can be reworked to use recursion instead (and vice versa).

There are programming languages where recursion is the only choice(!)

Some algorithms are more naturally written with recursion But naïve applications of recursion can be inefficient

55

Binary Seach

Several comments on binary search:

• Binary search assumes that the elements are sorted. If they are not sorted, you won't know in which half to continue searching.

• Binary search is not a great idea for linked lists, since you can't just jump to the middle element. You'd have to iterate through the list to get there, so you could just as well check for x while you are doing that.

56

Summary Linear search and binary search are two

different algorithms for searching an array Binary search is vastly more efficient

But binary search only works if the array elements are in order

57

Hashing

58

Tables: rows & columns of information

A table has several fields (types of information) A telephone book may have fields name, address, phone number A user account table may have fields user id, password, home

folder To find an entry in the table, you only need know the

contents of one of the fields (not all of them). This field is the key In a telephone book, the key is usually name In a user account table, the key is usually user id

Ideally, a key uniquely identifies an entry If the key is name and no two entries in the telephone book have

the same name, the key uniquely identifies the entries

59

The Table ADT: operations

insert: given a key and an entry, inserts the entry into the table

find: given a key, finds the entry associated with the key remove: given a key, finds the entry associated with the

key, and removes it

Also: getIterator: returns an iterator, which visits each of the

entries one by one (the order may or may not be defined)etc.

60

Table ADT’s

We are familiar with direct access structures and linear access structures.

Both have its advantages and disadvantages The crucial point for avoiding direct access structures is the

fact that we need to allocate in advance the size of this structure In all likelihood, we tend to overestimate the its size and we end up

with a very sparse structure We tend to think that the actual number of keys to be stored is

equivalent to the universe of possible existing keys In some problems the number of keys to be stored is smaller

than the number in the universe of keys. In this case a hash table may save us a lot of space.

61

How should we implement a table?

How often are entries inserted and removed? How many of the possible key values are likely to be used? What is the likely pattern of searching for keys?

e.g. Will most of the accesses be to just one or two key values?

Is the table small enough to fit into memory? How long will the table exist?

Our choice of representation for the Table ADT depends on the answers to the following

62

TableNode: a key and its entry For searching purposes, it is best to store the

key and the entry separately (even though the key’s value may be inside the entry)

“Smith” “Smith”, “124 Hawkers Lane”, “9675846”

“Yeo” “Yeo”, “1 Apple Crescent”, “0044 1970 622455”

key entry

TableNode

63

Implementation 1:unsorted sequential array

An array in which TableNodes are stored consecutively in any order

insert: add to back of array; O(1) find: search through the keys one at

a time, potentially all of the keys; O(n)

remove: find + replace removed node with last node; O(n)

0

…

key entry

1

23

and so on

64

Implementation 2:sorted sequential array

An array in which TableNodes are stored consecutively, sorted by key

insert: add in sorted order; O(n) find: binary search; O(log n) remove: find, remove node and

shuffle down; O(n)

0

…

key entry

1

23

We can use binary search because thearray elements are sorted

and so on

65

Implementation 3:linked list (unsorted or sorted)

TableNodes are again stored consecutively

insert: add to front; O(1)or O(n) for a sorted list

find: search through potentially all the keys, one at a time; O(n)still O(n) for a sorted list

remove: find, remove using pointer alterations; O(n)

key entry

and so on

66

An array in which TableNodes are not stored consecutively - their place of storage is calculated using the key and a hash function

Hashed key: the result of applying a hash function to a key

Keys and entries are scattered throughout the array

Implementation 5:hashing

key entry

Key hash function

array index

4

10

123

67

An array in which TableNodes are not stored consecutively - their place of storage is calculated using the key and a hash function

insert: calculate place of storage, insert TableNode; O(1)

find: calculate place of storage, retrieve entry; O(1)

remove: calculate place of storage, set it to null; O(1)

Implementation 5:hashing

key entry

4

10

123

All are O(1) !

68

Hash Functions

Hash tables normally maintain the invariant of direct access structure which provide O(1) time (constant time) to access an element

With direct access structure, a key k is normally stored in slot k. In hash tables this element is stored in slot h(k).

h(k) is a hash function. It maps the universe U of keys into the slots of a hash table (smaller than the universe)

h : U --> {0,1,...,m-1} where m is the size of the tableh : U --> {0,1,...,m-1} where m is the size of the table

69

Hashing example: a fruit shop 10 stock details, 10 table positions

key entry01

2

3

4

5

6

7

8

9

Stock numbers are between 0 and 1000Use hash function: stock no. / 100What if we now insert stock no. 350?

Position 3 is occupied: there is a collision

Collision resolution strategy: insert in the next free position (linear probing)

85 85, apples

462 462, pears

912 912, papaya

323 323, guava

350 350, oranges

Given a stock number, we find stock by using the hash function again, and use the collision resolution strategy if necessary

70

Pictorial view of Hash Tables

k1

k2k3

k4

71

Pictorial view of Hash Tables

k1

k2k3

k4

k5

72

Three factors affecting the performance of hashing

The hash function Ideally, it should distribute keys and entries evenly throughout the table It should minimise collisions, where the position given by the hash

function is already occupied The collision resolution strategy

Separate chaining: chain together several keys/entries in each position Open addressing: store the key/entry in a different position

The size of the table Too big will waste memory; too small will increase collisions and may

eventually force rehashing (copying into a larger table) Should be appropriate for the hash function used – and a prime number

is best

73

Choosing a hash function:turning a key into a table position

Truncation Ignore part of the key and use the rest as the array index (converting

non-numeric parts) A fast technique, but check for an even distribution throughout the

table Folding

Partition the key into several parts and then combine them in any convenient way

Unlike truncation, uses information from the whole key Modular arithmetic (used by truncation & folding, and on its own)

To keep the calculated table position within the table, divide the position by the size of the table, and take the remainder as the new position

74

Examples of hash functions (1)

Truncation: If students have an 9-digit identification number, take the last 3 digits as the table position e.g. 925371622 becomes 622

Folding: Split a 9-digit number into three 3-digit numbers, and add them e.g. 925371622 becomes 925 + 376 + 622 = 1923

Modular arithmetic: If the table size is 1000, the first example always keeps within the table range, but the second example does not (it should be mod 1000) e.g. 1923 mod 1000 = 923 (in Java: 1923 % 1000)

75

Examples of hash functions (2) Using a telephone number as a key

The area code is not random, so will not spread the keys/entries evenly through the table (many collisions)

The last 3-digits are more random Using a name as a key

Use full name rather than surname (surname not particularly random) Assign numbers to the characters (e.g. a = 1, b = 2; or use Unicode

values) Strategy 1: Add the resulting numbers. Bad for large table size. Strategy 2: Call the number of possible characters c (e.g. c = 54 for

alphabet in upper and lower case, plus space and hyphen). Then multiply each character in the name by increasing powers of c, and add together.

76

What is a Hash Table ?

The simplest kind of hash table is an array of records.

This example has 701 records.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

An array of records

. . .

[ 700]

77


Each record has a special field, called its key.

In this example, the key is a long integer field called Number.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

. . .

[ 700]

[ 4 ]

Number 506643548

78


The number might be a person's identification number, and the rest of the record has information about the person.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]

. . .

[ 700]

[ 4 ]

Number 506643548

79


When a hash table is in use, some spots contain valid records, and other spots are "empty".

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322

. . .

80

Inserting a New Record

In order to insert a new record, the key must somehow be converted to an array index.

The index is called the hash value of the key.


. . .

Number 580625685

81


Typical way create a hash value:


. . .

Number 580625685

(Number mod 701)

What is (580625685 mod 701) ?

82


Typical way to create a hash value:


. . .

Number 580625685

(Number mod 701)

What is (580625685 mod 701) ?3

83


The hash value is used for the location of the new record.

Number 580625685


. . .

[3]

84


The hash value is used for the location of the new record.


. . .Number 580625685

85

Collisions

Here is another new record to insert, with a hash value of 2.


. . .Number 580625685

Number 701466868

My hashvalue is [2].

86

Collisions

This is called a collision, because there is already another valid record at [2].


. . .Number 580625685

Number 701466868

When a collision occurs,move forward until you

find an empty spot.


find an empty spot.

87

Collisions



. . .Number 580625685

Number 701466868


find an empty spot.


find an empty spot.

88

Collisions



. . .Number 580625685

Number 701466868


find an empty spot.


find an empty spot.

89

Collisions



. . .Number 580625685 Number 701466868

The new record goesin the empty spot.

The new record goesin the empty spot.

90

A Quiz

Where would you be placed in this table, if there is no collision? Use your social security number or some other favorite number.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 506643548Number 233667136Number 281942902 Number 155778322Number 580625685 Number 701466868

. . .

91

Searching for a Key

The data that's attached to a key can be found fairly quickly.


. . .Number 580625685 Number 701466868

Number 701466868

92

Searching for a Key

Calculate the hash value. Check that location of the array

for the key.


. . .Number 580625685 Number 701466868

Number 701466868


Not me.

93

Searching for a Key

Keep moving forward until you find the key, or you reach an empty spot.


. . .Number 580625685 Number 701466868

Number 701466868


Not me.

94

Searching for a Key



. . .Number 580625685 Number 701466868

Number 701466868


Not me.

95

Searching for a Key



. . .Number 580625685 Number 701466868

Number 701466868


Yes!

96

Searching for a Key

When the item is found, the information can be copied to the necessary location.


. . .Number 580625685 Number 701466868

Number 701466868


Yes!

97

Deleting a Record

Records may also be deleted from a hash table.


. . .Number 580625685 Number 701466868

Pleasedelete me.

98

Deleting a Record

Records may also be deleted from a hash table. But the location must not be left as an ordinary "empty

spot" since that could interfere with searches.

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136Number 281942902 Number 155778322

. . .Number 580625685 Number 701466868

99

Deleting a Record

[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]Number 233667136Number 281942902 Number 155778322

. . .Number 580625685 Number 701466868

Records may also be deleted from a hash table. But the location must not be left as an ordinary "empty

spot" since that could interfere with searches. The location must be marked in some special way so that

a search can tell that the spot used to have something in it.

100

Using a hash function

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

Empty

4501

Empty

8903

8

10

values

[ 97]

[ 98]

[ 99]

7803

Empty

.

.

.

Empty

2298

3699

HandyParts company makes no more than 100 different parts. But theparts all have four digit numbers.

This hash function can be used tostore and retrieve parts in an array.

Hash(key) = partNum % 100

101

Placing elements in the array

Use the hash function


to place the element with

part number 5502 in the

array.

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

Empty

4501

Empty

8903

8

10

values

[ 97]

[ 98]

[ 99]

7803

Empty

.

.

.

Empty

2298

3699

102

Placing elements in the array

Next place part number6702 in the array.


6702 % 100 = 2

But values[2] is already occupied.

COLLISION OCCURS

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

values

[ 97]

[ 98]

[ 99]

7803

Empty

.

.

.

Empty 2298

3699

Empty

4501

5502

103

How to resolve the collision?

One way is by linear probing.This uses the rehash function

(HashValue + 1) % 100

repeatedly until an empty locationis found for part number 6702.

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

values

[ 97]

[ 98]

[ 99]

7803

Empty

.

.

.

Empty

2298

3699

Empty

4501

5502

104

Resolving the collision

Still looking for a place for 6702using the function

(HashValue + 1) % 100

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

values

[ 97]

[ 98]

[ 99]

7803

Empty

.

.

.

Empty

2298

3699

Empty

4501

5502

105

Collision resolved

Part 6702 can be placed atthe location with index 4.

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

values

[ 97]

[ 98]

[ 99]

7803

Empty

.

.

.

Empty

2298

3699

Empty

4501

5502

106

Collision resolved

Part 6702 is placed atthe location with index 4.

Where would the part withnumber 4598 be placed usinglinear probing?

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

values

[ 97]

[ 98]

[ 99]

7803

6702

.

.

.

Empty

2298

3699

Empty

4501

5502

107

Choosing the table size to minimise collisions

As the number of elements in the table increases, the likelihood of a collision increases - so make the table as large as practical

If the table size is 100, and all the hashed keys are divisable by 10, there will be many collisions! Particularly bad if table size is a power of a small integer

such as 2 or 10 More generally, collisions may be more frequent if:

greatest common divisor (hashed keys, table size) > 1 Therefore, make the table size a prime number (gcd = 1)

Collisions may still happen, so we need a collision resolution strategy

108

Collision resolution techniques

We will review a simple technique called chaining. However there are those who argue against this approach and point out other techniques such as: Linear Probing: Very simple. If position h(key) is occupied, do a

linear search in the table until you find a empty slot. The slot is searched in this order: h(key), k(key)+1, h(key)+2, ..., h(key)+c

Quadratic probing: is a variant of the above where the term being added to the hash result is squared. h(key)+c2

Random probing: is another variant where the term being added to the hash function is a random number. h(key)+random()

Rehashing: is a technique where a sequence of hashing functions are defined (h

1, h

2, ... h

k). If a collision occurs the

functions are used in the this order

109

Collision resolution:open addressing (1)

Linear probing: increase by 1 each time [mod table size!] Quadratic probing: to the original position, add 1, 4, 9, 16,…

Probing: If the table position given by the hashed key is already occupied, increase the position by some amount, until an empty position is found

Use the collision resolution strategy when inserting and when finding (ensure that the search key and the found keys match)

May also double hash: result of linear probing result of another hash function

With open addressing, the table size should be double the expected no. of elements

110

Clustering

is the tendency of elements to become unevenly distributed in the hash table, with many elements clustering around a single hash location.

One problem with linear probing is that it results in clustering.

111


If the table is fairly empty with many collisions, linear probing may cluster (group) keys/entries This increases the time to insert and to find

1 2 3 4 5 6 7 8

For a table of size n, then if the table is empty, the probability of the next entry going to any particular place is 1/nIn the diagram, the probability of position 2 getting filled next is 2/n (either a hash to 1 or to 2 fills it)Once 2 is full, the probability of 4 being filled next is 4/n and then of 7 is 7/n (i.e. the probability of getting long strings steadily increases)

112


An empty key/entry marks the end of a cluster, and so can be used to terminate a find operation

So, if we remove an entry within a cluster, we should not empty it!

To allow probing to continue, the removed entry must be marked as ‘removed but cluster continues’

113


Quadratic probing is a solution to the clustering problem Linear probing adds 1, 2, 3, etc. to the original hashed key Quadratic probing adds 12, 22, 32 etc. to the original hashed

key However, whereas linear probing guarantees that all empty

positions will be examined if necessary, quadratic probing does not e.g. Table size 16 and original hashed key 3 gives the

sequence: 3, 4, 7, 12, 3, 12, 7, 4… More generally, with quadratic probing, insertion may be

impossible if the table is more than half-full! Need to rehash (see later)

114

Collision resolution: chaining Each slot of a hash table will be a

pointer to a linked list Add the keys and entries anywhere in

the list (front easiest) Advantages over open addressing:

Simpler insertion and removal Array size is not a limitation (but

should still minimise collisions: make table size roughly equal to expected number of keys and entries)

Disadvantage Memory overhead is large if entries

are small

4

10

123

key entry key entry

key entry key entry

key entry

No need to change position!

115

Chaining

is another means (besides linear probing) used to handle collisions that arise from the use of a hash function.

Chaining uses the hash value, not as the actual location of the element, but as the index into an array of pointers. A chain is a linked list of elements that share the same hash location.

FOR EXAMPLE . . .

116

Using hashing and chaining

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

pointers

[ 97]

[ 98]

[ 99]

HandyParts company makes no more than 100 different parts. But theparts all have four digit numbers.

Use this hash function to store and retrieve parts in the chains.


7803

.

.

.

2298

3699

4501

117

Using chaining

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

pointers

[ 97]

[ 98]

[ 99]

7803

.

.

.

2298

3699

4501

Use the hash function


to place the element with

part number 5502 in a chain.

5502

118

Using chaining

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

[ 97]

[ 98]

[ 99]

7803

.

.

.

2298

3699

4501

5502

Next place part number6702 in a chain.


6702 % 100 = 2

6702

pointers

119

Using chaining

[ 0 ]

[ 1 ]

[ 2 ]

[ 3 ]

[ 4 ]

. . .

[ 97]

[ 98]

[ 99]

7803

.

.

.

2298

3699

4501

5502 6702

Where would the part withnumber 4598 be placed using chaining?

pointers

120

More Chaining…….

121

Hashing(103)

h(103) = 103 mod 10 h(103) = 3

h(103) = 103 mod 10 h(103) = 3

122

Hashing(103)

h(n) = 103 mod 10 h(n) = 3

h(n) = 103 mod 10 h(n) = 3

103103 //

123

Hashing(69)

h(n) = 69 mod 10 h(n) = 9

h(n) = 69 mod 10 h(n) = 9

103103 //

6969 //

124

Hashing(20)

h(n) = 20 mod 10 h(n) = 0

h(n) = 20 mod 10 h(n) = 0

103103 //

6969 //

2020 //

125

Hashing(13)

h(n) = 13 mod 10 h(n) = 3

h(n) = 13 mod 10 h(n) = 3

103103

6969 //

2020 //

1313 //

126

Hashing(110)

h(n) = 110 mod 10 h(n) = 0

h(n) = 110 mod 10 h(n) = 0

103103

6969 //

2020

1313 //

110110 //

127

Hashing(53)

h(n) = 53 mod 10 h(n) = 3

h(n) = 53 mod 10 h(n) = 3

103103

6969 //

2020

1313 //

110110 //

5353 //

128

Final Hash Table

103103

6969 //

2020

1313 //

110110 //

5353 //

129

Searching in a Hash Table

Like any other structure, searching is a common task with hash tables

Searching works as belowGiven a target, hash the targetTake the value of the hash of target and go to the slot.

If the target exist it must be in this slotSearch in the list in the current slot using a linear

search.

130

Searching for 53

103103

6969 //

2020

1313 //

110110 //

5353 //

131

Searching for 53

103103

6969 //

2020

1313 //

110110 //

5353 //

132

Searching for 53

103103

6969 //

2020

1313 //

110110 //

5353 //

temptemp

133

Searching for 53

103103

6969 //

2020

1313 //

110110 //

5353 //

temptemp

134

Searching for 53

103103

6969 //

2020

1313 //

110110 //

5353 //

temptemp

135

Searching for 53

103103

6969 //

2020

1313 //

110110 //

5353 //

temptemp

136

hashSearch(n)

NodeType hashSearch(NodeType* table[],int target) { int index = hash(target); NodeType *temp = table[index]; return linearSearch(temp,target);}

NodeType hashSearch(NodeType* table[],int target) { int index = hash(target); NodeType *temp = table[index]; return linearSearch(temp,target);}

137

Rehashing: enlarging the table To rehash:

Create a new table of double the size (adjusting until it is again prime) Transfer the entries in the old table to the new table, by recomputing their

positions (using the hash function) When should we rehash?

When the table is completely full With quadratic probing, when the table is half-full or insertion fails

Why double the size? If n is the number of elements in the table, there must have been n/2

insertions before the previous rehash (if rehashing done when table full) So by making the table size 2n, a constant cost is added to each insertion

138

Comparison of collision techniques

factor (n/size)

Exp

ecte

d N

umbe

r of

Pro

besLinear Probing

Random Probing

Chaining

139

Applications of Hashing Compilers use hash tables to keep track of declared variables A hash table can be used for on-line spelling checkers — if

misspelling detection (rather than correction) is important, an entire dictionary can be hashed and words checked in constant time

Game playing programs use hash tables to store seen positions, thereby saving computation time if the position is encountered again

Hash functions can be used to quickly check for inequality — if two elements hash to different values they must be different

Storing sparse data

140

When are other representations more suitable than hashing?

Hash tables are very good if there is a need for many searches in a reasonably stable table

Hash tables are not so good if there are many insertions and deletions, or if table traversals are needed — in this case, AVL trees are better

If there are more data than available memory then use a B-tree

Also, hashing is very slow for any operations which require the entries to be sorted e.g. Find the minimum key

141

Performance of Hashing

The number of probes depends on the load factor (usually denoted by ) which represents the ratio of entries present in the table to the number of positions in the array

We also need to consider successful and unsuccessful searches separately

For a chained hash table, the average number of probes for an unsuccessful search is and for a successful search is 1 + /2

142

Performance of Hashing (2)

For open addressing, the formulae are more complicated but typical values are:Load Factor 0.1 0.5 0.8 0.9 0.99Successful searchLinear Probes 1.05 1.6 3.4 6.2 21.3Quadratic Probes 1.04 1.5 2.1 2.7 5.2Unsuccessful searchLinear Probes 1.13 2.7 15.4 59.8 430Quadratic probes 1.13 2.2 5.2 11.9 126

Note that these do not depend on the size of the array or the number of entries present but only on the ratio (the load factor)

143

Hash tables store a collection of records with keys. The location of a record depends on the hash value of the

record's key. When a collision occurs, the next available location is

used. Searching for a particular key is generally quick. When an item is deleted, the location must be marked in a

special way, so that the searches know that the spot used to be used.

Summary

1 searching and hashing. 2 concepts this lecture searching an array linear search binary search...

Documents