[0][1][2][3][4][5][6][7][8][9] bing david ina abhinav erik hyun jim fiona gheeta chelsea i can...
TRANSCRIPT
Introduction to Hash Tables
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
Bing
David
Ina
Abhinav
Erik
Hyun
Jim
Fiona
Gheeta
Chelsea
I can easily loop through all the student records by using a for loop. But if I want to access Jim’s record only, I have to start at 0 and loop through the array until I find it. With a big array this could be rather inefficient. Is there a better way?
Sequential access good
Arrays
Direct access bad
Remember! The array
elements just hold
references to the objects,
not the objects themselves!
Consider this array of Student records
Sequential access bad
Hash tablesDirect access good
Bing
David
Ina
Abhinav
Erik
Hyun
Jim
Fiona
Gheeta
Chelsea
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7][8]
[9]
Hashing Function
Jim’s student ID no.
“6”
The student records are stored in an array. The place in the array that a particular student is held is determined by the hashing function.
The hashing function takes some value, e.g. a name, or, as here, a student id number, and translates it into an array index. So if we want to find Jim’s record we just give his id number to the hashing function and it tells us where his record is located. We don’t need to search through the records. This is direct access.
CollisionsWhat happens if the hashing function gives the same array index for two different students?
This happens and it is called a collision. There are a number of ways of dealing with collisions, the details of which you don’t need to know. But what you do need to know is that the performance of hash tables degrades over time because of multiple collisions.
Bing
David
Ina
Abhinav
Erik
Hyun
Jim
Fiona
Gheeta
Chelsea
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7][8]
[9]
Hashing Function
Hiro’s student ID no.
“6”
Collision!
Collisions [0]
[1]
[2]
[3]
[4]
[5]
[6]
[7][8]
[9]
Hashing Function
Erik’s student ID no.
“4” ErikDavid’s student ID no. “1”
David
Hyun’s student ID no. “4”
Collision!
Hyun goes into next available
index
Hyun
If there had already been a lot of records in the array when the collision happened, Hyun may have been pushed a long way down the array.
Click to go through the animation
Later, when we try to access Hyun’s record, the hashing function still gives us 4 as the place to find him. But he’s not there! So we have to do a sequential search from index number 4, through the array, to find him. This is the reason that hash table performance degrades over time.
The Hashing AlgorithmThe simplest way to translate the Student ID into an array index is to use the modulo operator (% in Java). The modulo operator returns the remainder of a division operation, for example 11 % 4 = 3.
Question: If we have an array of 10 elements, what do we need to mod our Student IDs by to be sure of getting some value from 0 to 10?Answer: 11
Question: Let’s say we have an array of size N. Now what to we need to mod our Student IDs by?
Answer: N+1
Random Student ID: Array size:
Array index this student will be assigned to using modulo operator:
What happens if we don’t have a
numerical Student ID to use? Say we
only have their name? Well we just
convert the string into some numerical
value using one of several methods.
MD5 is a common method; you give it
text, it gives you a 128-bit number. The
important thing is that we get an even
distribution of entries into the array to
minimize collisions.
MD5 is also used to verify copies of
documents because even if one
character has changed during the
copying, the number that MD5 returns
will be totally different.
Go