chapter 14: hashing -...

38
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 14: Hashing Java Software Structures: Designing and Using Data Structures Third Edition John Lewis & Joseph Chase

Upload: others

Post on 21-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

© 2010 Pearson Addison-Wesley. All rights reserved.

Addison Wesley

is an imprint of

CHAPTER 14:

Hashing

Java Software Structures:

Designing and Using Data Structures

Third Edition

John Lewis & Joseph Chase

Page 2: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-2

© 2010 Pearson Addison-Wesley. All rights reserved. 1-2

Chapter Objectives

• Define hashing

• Examine various hashing functions

• Examine the problem of collisions in hash

tables

• Explore the Java Collections API

implementations of hashing

Page 3: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-3

© 2010 Pearson Addison-Wesley. All rights reserved. 1-3

Hashing

• In the collections we have discussed thus

far, we have made one of three

assumptions regarding order:

– Order is unimportant

– Order is determined by the way elements are

added

– Order is determined by comparing the values

of the elements

Page 4: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-4

© 2010 Pearson Addison-Wesley. All rights reserved. 1-4

Hashing

• Hashing is the concept that order is

determined by some function of the value

of the element to be stored

– More precisely, the location within the

collection is determined by some function of

the value of the element to be stored

Page 5: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-5

© 2010 Pearson Addison-Wesley. All rights reserved. 1-5

Hashing

• In hashing, elements are stored in a hash

table with their location in the table

determined by a hashing function

• Each location in the hash table is called a

cell or a bucket

Page 6: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-6

© 2010 Pearson Addison-Wesley. All rights reserved. 1-6

Hashing

• Consider a simple example where we create an array that will hold 26 elements

• Wishing to store names in our array, we create a simple hashing function that uses the first letter of each name

Page 7: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-7

© 2010 Pearson Addison-Wesley. All rights reserved. 1-7

A simple

hashing example

Page 8: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-8

© 2010 Pearson Addison-Wesley. All rights reserved. 1-8

Hashing

• Notice that using a hashing approach results in the access time to a particular element being independent of the number of elements in the table

• This means that all of the operations on an element of a hash table should be O(1)

• However, this efficiency is only realized if each element maps to a unique position in the table

Page 9: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-9

© 2010 Pearson Addison-Wesley. All rights reserved. 1-9

Hashing

• A collision occurs when two or more elements map to the same location

• A hashing function that maps each element to a unique location is called perfect hashing function

• While it is possible to develop a perfect hashing function, a function that does a good job of distributing the elements in the table will still result in O(1) operations

Page 10: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-10

© 2010 Pearson Addison-Wesley. All rights reserved. 1-10

Hashing

• Another issue surrounding hashing is how large the table should be

• If we are assured of a dataset of size n and a perfect hashing function, then the table should be of size n

• If a perfect hashing function is not available but we know the size of the dataset (n), then a good rule of thumb is to make the table 150% the size of the dataset

Page 11: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-11

© 2010 Pearson Addison-Wesley. All rights reserved. 1-11

Hashing

• The third case is the case where we do not know the size of the dataset

• In this case, we depend upon dynamic resizing

• Dynamic resizing of a hash table involves creating a new, larger, hash table and then inserting all of the elements of the old table into the new one

Page 12: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-12

© 2010 Pearson Addison-Wesley. All rights reserved. 1-12

Hashing

• Deciding when to resize is the key to dynamic resizing

• One possibility is simply to resize when the table is full – However, it is the nature of hash tables that their performance

seriously degrades as they become full

• A better approach is to use a load factor

• The load factor of a hash table is the percentage occupancy of the table at which the table will be resized

Page 13: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-13

© 2010 Pearson Addison-Wesley. All rights reserved. 1-13

Hashing Functions - Extraction

• There are a wide variety of hashing functions that provide good distribution of various types of data

• The method we used in our earlier example is called extraction

• Extraction involves using only a part of an element’s value or key to compute the location at which to store the element

• In our previous example, we simply extracted the first letter of a string and computed its value relative to the letter A

Page 14: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-14

© 2010 Pearson Addison-Wesley. All rights reserved. 1-14

Hashing Functions - Division

• Creating a hashing function by division simply means we will use the remainder of the key divided by some positive integer p Hashcode(key) = Math.abs(key)%p

• This function will yield a result in the range 0 to p-1

• If we use our table size as p, we then have an index that maps directly to the table

Page 15: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-15

© 2010 Pearson Addison-Wesley. All rights reserved. 1-15

Hashing Functions - Folding

• In the folding method, the key is divided into two parts that are then combined or folded together to create an index into the table

• This is done by first dividing the key into parts where each of the parts of the key will be the same length as the desired key

• In the shift folding method, these parts are then added together to create the index – Using the SSN 987-65-4321 we divide into three parts 987,

654, 321, and then add these together to get 1962

– We would then use either division or extraction to get a three digit index

Page 16: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-16

© 2010 Pearson Addison-Wesley. All rights reserved. 1-16

Hashing Functions - Folding

• In the boundary folding method, some parts of the key are reversed before adding

• Using the same example of a SSN 987-65-4321 – divide the SSN into parts 987, 654, 321

– reverse the middle element 987, 456, 321

– add yielding 1764

– then use either division or extraction to get a three digit index

• Folding may also be used on strings

Page 17: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-17

© 2010 Pearson Addison-Wesley. All rights reserved. 1-17

Hashing Functions - Mid-square

• In the mid-square method, the key is multiplied by itself and then the extraction method is used to extract the needed number of digits from the middle of the result

• For example, if our key is 4321

– Multiply the key by itself yielding 18671041

– Extract the needed three digits

• It is critical that the same three digits by extract each time

• We may also extract bits and then reconstruct an index from the bits

Page 18: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-18

© 2010 Pearson Addison-Wesley. All rights reserved. 1-18

Hashing Functions - Radix

Transformation

• In the radix transformation method, the key is transformed into another numeric base

• For example, if our key is 23 in base 10, we might convert it to 32 in base 7

• We then use the division method and divide the converted key by the table size and use the remainder as our index

Page 19: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-19

© 2010 Pearson Addison-Wesley. All rights reserved. 1-19

Hashing Functions - Digit Analysis

• In the digit analysis method, the index is formed by extracting, and then manipulating specific digits from the key

• For example, if our key is 1234567, we might select the digits in positions 2 through 4 yielding 234

• The manipulation can then take many forms – Reversing the digits (432)

– Performing a circular shift to the right (423)

– Performing a circular shift to the left (342)

– Swapping each pair of digits (324)

Page 20: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-20

© 2010 Pearson Addison-Wesley. All rights reserved. 1-20

Hashing Functions - Length-

Dependent

• In the length-dependent method, the key and the length of the key are combined in some way to form either the index itself or an intermediate version

• For example, if our key is 8765, we might multiply the first two digits by the length and then divide by the last digit yielding 69

• If our table size is 43, we would then use the division method resulting in an index of 26

Page 21: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-21

© 2010 Pearson Addison-Wesley. All rights reserved. 1-21

Hashing Functions in the Java

Language

• The java.lang.Object class defines a method called hashcode that returns an integer based on the memory location of the object

• This is generally not very useful

• Class that are derived from Object often override the inherited definition of hashcode to provide their own version

• For example, the String and Integer classes define their own hashcode methods

• These more specific hashcode functions are more effective

Page 22: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-22

© 2010 Pearson Addison-Wesley. All rights reserved. 1-22

Hashing - Resolving Collisions

• If we are able to develop a perfect hashing function, then we do not need to be concerned about collisions or table size

• However, often we do not know the size of the dataset and are not able to develop a perfect hashing function

• In these cases, we must choose a method for handling collisions

Page 23: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-23

© 2010 Pearson Addison-Wesley. All rights reserved. 1-23

Resolving Collisions - Chaining

• The chaining method simply treats the hash table conceptually as a table of collections rather than a table of individual elements

• Thus each cell is a pointer to the collection associated with that location in the table

• Usually, this internal collection is either an unordered list or an ordered list

• Chaining can be accomplished using a linked structure or an array based structure with an overflow area

Page 24: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-24

© 2010 Pearson Addison-Wesley. All rights reserved. 1-24

The chaining

method of

collision handling

Page 25: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-25

© 2010 Pearson Addison-Wesley. All rights reserved. 1-25

Chaining using

an overflow area

Page 26: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-26

© 2010 Pearson Addison-Wesley. All rights reserved. 1-26

Resolving Collisions - Open

Addressing

• The open addressing method looks for another open position in the table other than the one to which the element is originally hashed

• The simplest of these methods is linear probing

• In linear probing, if an element hashes to position p and that position is occupied we simply try position (p+1)%s where s is the size of the table

• One problem with linear probing is the development of clusters of occupied cells called primary clusters

Page 27: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-27

© 2010 Pearson Addison-Wesley. All rights reserved. 1-27

Open addressing

using linear probing

Page 28: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-28

© 2010 Pearson Addison-Wesley. All rights reserved. 1-28

Resolving Collisions - Open

Addressing

• A second form of open addressing is quadratic probing

• In this method, once we have a collision, the next index to try is computed by newhashcode(x) = hashcode(x) + (-1)i-1((i+1)/2)2

• Quadratic probing helps to eliminate the problem of primary clusters

Page 29: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-29

© 2010 Pearson Addison-Wesley. All rights reserved. 1-29

Open addressing using

quadratic probing

Page 30: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-30

© 2010 Pearson Addison-Wesley. All rights reserved. 1-30

Resolving Collisons - Open

Addressing

• A third form of the open addressing method is double hashing

• In this method, collisions are resolved by supplying a secondary

hashing function

• For example, if a key x hashes to a position p that is already

occupied, then the next position p’ will be

p’ = p + secondaryhashcode(x)

• Given the same example, if we use a secondary hashing function

that uses the length of the string, we get the following result

Page 31: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-31

© 2010 Pearson Addison-Wesley. All rights reserved. 1-31

Open

addressing

using double

hashing

Page 32: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-32

© 2010 Pearson Addison-Wesley. All rights reserved. 1-32

Deleting Elements from a Hash Table

• Removing an element from a chained implementation falls into one of five cases

• If the element we are attempting to remove is the only one mapped to the location, we simply remove the element by setting the table position to null

Page 33: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-33

© 2010 Pearson Addison-Wesley. All rights reserved. 1-33

Deleting Elements from a Hash Table

• If the element we are attempting to remove is stored in the table but has an index into the overflow area – Replace the element and the next index value in the

table with the element and next index value of the array position pointed to by the element to be removed

– We then must also set the position in the overflow area to null and add it back to our list of free overflow cells

Page 34: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-34

© 2010 Pearson Addison-Wesley. All rights reserved. 1-34

Deleting Elements from a Hash Table

• If the element we are attempting to remove is at the end of the list of elements stored at that location the table – Set its position in the table to null

– Set the next pointer of the previous element to null

– Add that position to the list of free overflow cells

Page 35: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-35

© 2010 Pearson Addison-Wesley. All rights reserved. 1-35

Deleting Elements from a Hash Table

• If the element we are attempting to remove is in the middle of the list of elements stored at that location – Set its position in the overflow area to null

– Set the next pointer of the previous element to point to the next element of the element being removed

– Add the position back to the list of free overflow cells

• If the element we are attempting to remove is not in the table, then we must throw an exception

Page 36: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-36

© 2010 Pearson Addison-Wesley. All rights reserved. 1-36

Deleting Elements from a Hash Table

• If we have chosen to implement our table using open addressing, the deletion creates more of a challenge

• This is because we cannot remove a deleted item with possibly affecting our ability to find items that collided with it on entry

• The solution to this problem is to mark items as deleted but not actually remove them from the table until some future point when either the deleted item is overwritten or the entire table is rehashed

Page 37: CHAPTER 14: Hashing - profesor.uprb.eduprofesor.uprb.edu/mvelez/cursos/sici4036/Presentations/lewis_chase... · • In the digit analysis method, the index is formed by extracting,

1-37

© 2010 Pearson Addison-Wesley. All rights reserved. 1-37

Open addressing

and deletion