hash tables. 2 exercise 2 /* exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i

Hash Tables

2

Exercise 2

/* Exercise 1 */void mystery(int n) { int i, j, k; for (i = 1; i <= n - 1; i++) { for (j = i + 1; j <= n; j++) { for (k = 1; k <= j; k++) { /* Some statement taking O(1) time */ } } }}

3

Exercise 3

/* Exercise 2 */void veryodd(int n) { int i, j, x, y; x = 0; y = 0; for (i = 1; i <= n; i++) { if (i % 2 == 1) { for (j = i; j <= n; j++) { x = x + 1; } for (j = 1; j <= i; j++) { y = y + 1; } } }}

4

Consider www.google.com

Efficient searches: lookup “laptop” in all web pages

How many web pages ? How fast is response ?

5

Consider www.google.com

4 billion pages

Consider data structures: linked list, sorted linked list,

array, sorted array, BST

6

Unsorted Linked List of n elem

int searchList(List *a, int key) {

if (a == NULL)return NULL; //not found

if (a->data == key) return a;

return searchList(a->next, key);}

Best, Average, Worst T(n) ?

7

Sorted Linked List of n elem

int searchList(List *a, int key) {

if (a == NULL)return NULL; //not found

if (a->data == key) return a;

return searchList(a->next, key);}


8

Unsorted Array of n elem

int seq_search(int n, int *a, int key) { int i = 0; while (i < n && a[i] != key) { i++; } return i;}


9

Sorted Array of n elem

int binary_search(int n, int *a, int key) { int lo = -1; int hi = n;

while (hi - lo != 1) { int mid = (hi + lo) / 2; if (a[mid] <= key) { lo = mid; } else { hi = mid; } } return lo;}


10

How about BST ?

Best O(1)

Average O(logn)

Worst O(n) – very imbalanced (tree degenerates to list)

11

Answer: Hash Tables

Search complexity is O(1) with “good” hash function

Hash Table: A generalization of an array that under

some assumptions allows O(1) for Insert/Delete/Search

12

Intuition

How can you store all Student Numbers in an array?

• Use an array with range 0 - 999,999,999

• This will give you O(1) access time but … considering there are approx. 5000 students you waste lots of array entries!

Problem: The range of key values is too large

(0-999,999,999) when compared to the # of keys (students)

13

Formal Definition

Hash Tables solve this problem by using a smaller array and mapping keys with a hash function.

Set of keys K and an array of size m. A hash function h is a function from K to 0…m-1, that is:

h : K 0…m-1

14

Example Hash Function

0

1234567

k888999222

k123456789

15


For example, if we hash the student number keys into a hash

table with 8 entries we could use h (key) = key mod 8

0

1234567

k888999222

k123456789

16

Problem ?

Collisions: Two keys hash into the same array entry

h (888888888) = h (000000000) = key % 8 = 0

0

1234567

k888999222

k123456789

17

Solution

• Hashing with Chaining (Open Hashing): every hash table entry contains a pointer to a linked list of keys that hash in the same entry

• Closed Hashing: every hash table entry contains only one key. If a new key hashes to a table entry which is filled, systematically examine other table entries until you find one empty entry to place the new key

18

Hashing with Chaining (Open Hashing)

h (54) = 54 % 5 = 4 = h (34) – solved by CHAIN-ing

0

1

2

3

4

key next

2

21

54 34

CHAIN

19

Hashing with Chaining

0

1

2

3

4

key next

2

21

54 34

CHAIN

Insert 101 – where does it hash to ?

20


h (101) = 101 % 5 = 1

0

1

2

3

4

Insert 101

2

21

54 34

0

1

2

3

4

key next

2

21

54 34

CHAIN

101

21

Complexity Analysis

What is the running time to insert/search/delete?

• Insert: It takes O(1) time to compute the hash function and insert at head of linked list

• Search: It is proportional to max linked list length• Delete: Same as search

22

What is a “good” hash ?

uniform hashing: each key is equally likely to hash in any of the m slots

• Creating a “good” hash function is black magic !

How about when keys are student names ?

Interpret characters as numbers: • (int)‘a’, (int)‘b’, (int)‘c’ means 97 98 99• Ex. Hash for names:

Name “abc” hashes to (‘a’+‘b’+‘c’)% m

23


For example, if we hash the student number keys into a hash

table with 8 entries we could use h (key) = key mod 8

0

1234567

k888999222

k123456789

24


0

1

2

3

4

key next

2

21

54 34

CHAIN

Insert 101 – where does it hash to ?

25

Closed Hashing

The key is first mapped to a slot:

index = h(k)

If there is a collision, subsequent probes are performed

collision resolution is done as a linear search. This is known as linear probing.

index = (index + 1) % m

26

Closed Hashing with Linear Probing

95371001

9875

98742009

3016

0

1

2

3

4

5

6

7

8

9

10

H(k) = k % 11

Insert(1100) ?

27


95371001

9875

98742009

3016

0

1

2

3

4

5

6

7

8

9

10

H(k) = k % 11

Insert(1100) ?

28


95371001

9875

98742009

3016

0

1

2

3

4

5

6

7

8

9

10

H(k) = k % 11

Insert(1100) ?

29


95371001

9875

98742009

3016

0

1

2

3

4

5

6

7

8

9

10

H(k) = k % 11

Insert(1100) 3

Same for keys thathash into 0 or 1

Prob(insert_into_3) = ?

30


95371001

9875

98742009

3016

0

1

2

3

4

5

6

7

8

9

10

H(k) = k % 11

Insert(1100) 3


Prob(insert_into_3) = 4/11

31


95371001

9875

98742009

3016

0

1

2

3

4

5

6

7

8

9

10

H(k) = k % 11

Insert(1100) 3




32


95371001

9875

98742009

3016

0

1

2

3

4

5

6

7

8

9

10

H(k) = k % 11

Assume: Insert(1052) 10



1052

33


95371001

9875

98742009

3016

0

1

2

3

4

5

6

7

8

9

10

H(k) = k % 11

Assume: Insert(1052) 10



1052

34

Problem: Clustering

Even with a good hash function, linear probing has its problems:

• The position of the initial mapping i 0 of key k is called the home of k.

• When several insertions map to the same home position, they end up placed contiguously in the table. This collection of keys with the same home position is called a cluster.

• As clusters grow, the probability that a key will map to the middle of a cluster increases, increasing the rate of the cluster’s growth. As these clusters grow, they merge with other clusters forming even bigger clusters which grow even faster.

• This tendency of linear probing to place items together is known as primary clustering.

35

Complexity Analysis – Worst Case

What is the running time to insert/search/delete?

• Insert: Same as search • Search: It is proportional to max no of probes • Delete: Same as search

• Worst O(n)

36

Complexity Analysis

When hash table is empty – insert is in 1 step (in home position)

As the table fills up, the probab that a record canbe inserted in 1 step decreases

More and more records are likely to be insertedfar from their home position

37

Complexity Analysis - Intuition

The expected (avg.) cost of hash (insert/search/delete)

• is a function of how full the table is

38

The Load Factor

mn

n is the number of entries in a hash table that are occupied m is the size of the hash table

=1 means the table is full, and =0 means the table is empty.

39

Complexity Analysis - Average Case

The load factor where n current no of records

On avg. probability to find the position occupied:

The probability to find both position and next position

occupied is n/m * (n-1)/(m-1)

The probability of i collisions is:

• n/m * (n-1)/(m-1) * …(n- i +1)/(m – i +1) ~ (n/m)i

• probes = 1 + i =1 to N (n/m)i

mn

m n

40

Complexity Analysis Average Case

It can be shown that the number of probes in a successful search, C, and the number of probes in an unsuccessful search, C’ is given by:

11

11

l

C2

1

21

2

C

C

21

11

21

11

121

C

C

Separate chaining Linear probing

41

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Ave

rage

# o

f pro

bes

Load factor

Successful search

Linear probingDouble hashing

Separate chaining

42

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Ave

rage

# o

f pro

bes

Load factor

Unsuccessful search

Linear probingDouble hashing

Separate chaining

43

Insert Implementation

bool HashTable:: hashInsert(const Elem &e){

int home;

int index = home = h(getkey(e));

for (int i = 1; !is_empty(HT[index]); i++) {

index = (home + i) % m; // follow probes

if (is_equal (e, HT[index]) return false; // duplicate

}

HT[index] = e;

return true;

}

44

Search Implementation

bool HashTable:: hashSearch(const Key &k, Elem &e){

int home;

int index = home = h(k);

for (int i = 1;

!is_empty(HT[index]) && !is_equal(k, HT[index]);

i++)

index = (home + i) % m; // follow probes

if (is_equal (k, HT[index]){ //found it

e = HT[index];

return true;

}

else return false; // k is not in the table

}

hash tables. 2 exercise 2 /* exercise 1 */ void mystery(int n) { int i, j, k; for (i = 1; i

Documents