1 yet more on indexes hash tables source: our textbook, slides by hector garcia-molina

1

Yet More on Indexes

Hash Tables

Source: our textbook, slides by Hector Garcia-Molina

2

Main Memory Hash Tables

A hash function h maps search keys to integers in some range 0 to B-1

B is the number of buckets There is a B-element array, each

entry holds a pointer to a linked list Record with key k is put in the

linked list that starts at entry h(k) of B.

3

Example of Hash Table

0

1

2

3

4

15 10

22

104 29

34

B = 5

h(k) = k mod 5

4

Changes for Secondary Storage

Bucket array contains blocks, not pointers to linked lists

Records that hash to a certain bucket are put in the corresponding block

If a bucket overflows then start a chain of overflow blocks

5

Insertion into Static Hash Table

To insert a record with key K: compute h(K) insert record into one of the blocks

in the chain of blocks for bucket number h(K), adding a new block to the chain if necessary

6

EXAMPLE 2 records/bucket

INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0

0

1

2

3

d

ac

b

h(e) = 1

e

7

Deletion from a Static Hash Table

To delete records with key K: Go to the bucket numbered h(K) Search for records with key K,

deleting any that are found Possibly condense the chain of

overflow blocks for that bucket

8

0

1

2

3

a

bc

e

d

EXAMPLE: deletion

Delete:ef

fg

maybe move“g” up

cd

9

Rule of thumb: Try to keep space utilization

between 50% and 80% Utilization = # record used

total # records that fit

If < 50%, wasting space If > 80%, overflows significant

depends on how good hashfunction is & on # records/bucket

10

Efficiency of Static Hash Tables

If the hash table size is large enough and the distribution of keys by the hash function is sufficiently "even", then most buckets have no overflow blocks

In this case lookup typically takes one disk I/O and insertion/deletion take two

Significantly better than sequential indexes and B-trees

(But: hash tables do not support efficient range queries as B-trees do)

What if there are long overflow blocks?

11

How do we cope with growth?

Overflows and reorganizations Dynamic hashing

Extensible Linear

12

Extensible Hash Tables

Each bucket in the bucket array contains a pointer to a block, instead of a block itself

Bucket array can grow by doubling in size Certain buckets can share a block if small

enough hash function computes a sequence of k

bits, but only first i bits are used at any time to index into the bucket array

Value of i can increase (corresponds to bucket array doubling in size)

14

(b) Use directory

h(K)[i ] to bucket

.

.

.

.

15

Inserting into Extensible Hash Table

To insert record with key K: compute h(K) go to bucket indexed by first i bits of h(K) follow the pointer to get to block B if room in B, insert record else let j be number of bits of hash value

used to determine membership in B

16

Insertion cont'd

Case 1: j < i. split block B in two distribute records in B to the 2 new blocks

based on value of their (j+1)-st bit update header of each new block to j+1 adjust pointers in bucket array so that

entries that used to point to B now point to correct block

if still no room in appropriate block for new record then repeat this process

17

Insertion cont'd

Case 2: j = i. increment i by 1 double length of bucket array entry for w0 and w1 both point to

same block that old entry w pointed to (block is shared)

apply case 1 to split block B

18

Example: h(k) is 4 bits; 2 keys/bucket

i = 1

1

1

0001

1001

1100

Insert 1010

11100

1010

New directory

200

01

10

11

i =

2

2

19

10001

21001

1010

21100

Insert:

0111

0000

00

01

10

11

2i =

Example continued

0111

0000

0111

0001

2

2

20

00

01

10

11

2i =

21001

1010

21100

20111

20000

0001

Insert:

1001

Example continued

1001

1001

1010

000

001

010

011

100

101

110

111

3i =

3

3

21

Extensible hashing: deletion

No merging of blocks Merge blocks

and cut directory if possible(Reverse insert procedure)

22

Extensible hashing

Can handle growing files- with less wasted space- with no full reorganizations

Summary

+

Indirection(Not bad if directory in

memory)

Directory doubles in size(Now it fits, now it does not)

-

-

23

Linear Hash Tables

Number of buckets increases more slowly than with extensible hashing

Number of buckets is such that on average each block is x% full (say 80%) -- threshold

Overflow blocks can occur but average number per bucket << 1

Use the i low-order bits from the result of the hash function to index into the bucket array

24

Linear hashing Another dynamic hashing scheme

Two ideas:(a) Use i low order bits of hash

01110101grows

b

i

(b) Bucket array grows linearly

25

Inserting into Linear Hash Table

To insert record with key K, with last i bits of h(K) being a1a2…ai :

Let m be the integer represented by a1a2…ai in binary

If m < n (number of buckets), then bucket m exists -- put record in that bucket

If m ≥ n, then bucket m does not (yet) exist, so put record in bucket whose index corresponds to 0a2…ai

26

Inserting cont'd

If no room in indicated bucket, then create an overflow bucket

Compare # records / # buckets to threshold

If exceeds threshold then add a new bucket and rearrange records

If number of buckets exceeds i, then increment i by 1

27

Example b=4 bits, i =2, 2 keys/bucket

00 01 10 11

0101

1111

0000

1010

m = 01 (max used block)

Futuregrowthbuckets

If h(k)[i ] m, then look at bucket h(k)[i ]

else, look at bucket h(k)[i ] - 2i -1

Rule

0101• can have overflow chains!

• insert 0101

28

Example b=4 bits, i =2, 2 keys/bucket

00 01 10 11

0101

1111

0000

1010


Futuregrowthbuckets

10

1010

0101 • insert 0101

11

11110101

29

Example Continued: How to grow beyond this?

00 01 10 11

111110100101

0101

0000


i = 2

0 0 0 0100 101 110 111

3

. . .

100

100

101

101

0101

0101

30

Linear Hashing

Can handle growing files- with less wasted space- with no full reorganizations

No indirection like extensible hashing

Summary

+

+

Can still have overflow chains-

31

Hashing good for probes given keye.g., SELECT …

FROM RWHERE R.A = 5

Comparing Index Approaches

32

Sequential Indexes and B-trees good for

Range Searches:e.g., SELECT

FROM RWHERE R.A > 5

Indexing vs Hashing

33

Index definition in SQL

Create index name on rel (attr) Create unique index name on rel

(attr)defines candidate key

Drop INDEX name

34

CANNOT SPECIFY TYPE OF INDEX

(e.g. B-tree, Hashing, …)

OR PARAMETERS(e.g. Load Factor, Size of

Hash,...)

... at least in SQL...

Note

1 yet more on indexes hash tables source: our textbook, slides by hector garcia-molina

Documents

block b slide

size slide

recordsbucket slide

process slide

necessary slide

wsplit block b

chain of overflow blocks

hash function b hk use