1 yet more on indexes hash tables source: our textbook, slides by hector garcia-molina
TRANSCRIPT
1
Yet More on Indexes
Hash Tables
Source: our textbook, slides by Hector Garcia-Molina
2
Main Memory Hash Tables
A hash function h maps search keys to integers in some range 0 to B-1
B is the number of buckets There is a B-element array, each
entry holds a pointer to a linked list Record with key k is put in the
linked list that starts at entry h(k) of B.
3
Example of Hash Table
0
1
2
3
4
15 10
22
104 29
34
B = 5
h(k) = k mod 5
4
Changes for Secondary Storage
Bucket array contains blocks, not pointers to linked lists
Records that hash to a certain bucket are put in the corresponding block
If a bucket overflows then start a chain of overflow blocks
5
Insertion into Static Hash Table
To insert a record with key K: compute h(K) insert record into one of the blocks
in the chain of blocks for bucket number h(K), adding a new block to the chain if necessary
6
EXAMPLE 2 records/bucket
INSERT:h(a) = 1h(b) = 2h(c) = 1h(d) = 0
0
1
2
3
d
ac
b
h(e) = 1
e
7
Deletion from a Static Hash Table
To delete records with key K: Go to the bucket numbered h(K) Search for records with key K,
deleting any that are found Possibly condense the chain of
overflow blocks for that bucket
8
0
1
2
3
a
bc
e
d
EXAMPLE: deletion
Delete:ef
fg
maybe move“g” up
cd
9
Rule of thumb: Try to keep space utilization
between 50% and 80% Utilization = # record used
total # records that fit
If < 50%, wasting space If > 80%, overflows significant
depends on how good hashfunction is & on # records/bucket
10
Efficiency of Static Hash Tables
If the hash table size is large enough and the distribution of keys by the hash function is sufficiently "even", then most buckets have no overflow blocks
In this case lookup typically takes one disk I/O and insertion/deletion take two
Significantly better than sequential indexes and B-trees
(But: hash tables do not support efficient range queries as B-trees do)
What if there are long overflow blocks?
11
How do we cope with growth?
Overflows and reorganizations Dynamic hashing
Extensible Linear
12
Extensible Hash Tables
Each bucket in the bucket array contains a pointer to a block, instead of a block itself
Bucket array can grow by doubling in size Certain buckets can share a block if small
enough hash function computes a sequence of k
bits, but only first i bits are used at any time to index into the bucket array
Value of i can increase (corresponds to bucket array doubling in size)
14
(b) Use directory
h(K)[i ] to bucket
.
.
.
.
15
Inserting into Extensible Hash Table
To insert record with key K: compute h(K) go to bucket indexed by first i bits of h(K) follow the pointer to get to block B if room in B, insert record else let j be number of bits of hash value
used to determine membership in B
16
Insertion cont'd
Case 1: j < i. split block B in two distribute records in B to the 2 new blocks
based on value of their (j+1)-st bit update header of each new block to j+1 adjust pointers in bucket array so that
entries that used to point to B now point to correct block
if still no room in appropriate block for new record then repeat this process
17
Insertion cont'd
Case 2: j = i. increment i by 1 double length of bucket array entry for w0 and w1 both point to
same block that old entry w pointed to (block is shared)
apply case 1 to split block B
18
Example: h(k) is 4 bits; 2 keys/bucket
i = 1
1
1
0001
1001
1100
Insert 1010
11100
1010
New directory
200
01
10
11
i =
2
2
19
10001
21001
1010
21100
Insert:
0111
0000
00
01
10
11
2i =
Example continued
0111
0000
0111
0001
2
2
20
00
01
10
11
2i =
21001
1010
21100
20111
20000
0001
Insert:
1001
Example continued
1001
1001
1010
000
001
010
011
100
101
110
111
3i =
3
3
21
Extensible hashing: deletion
No merging of blocks Merge blocks
and cut directory if possible(Reverse insert procedure)
22
Extensible hashing
Can handle growing files- with less wasted space- with no full reorganizations
Summary
+
Indirection(Not bad if directory in
memory)
Directory doubles in size(Now it fits, now it does not)
-
-
23
Linear Hash Tables
Number of buckets increases more slowly than with extensible hashing
Number of buckets is such that on average each block is x% full (say 80%) -- threshold
Overflow blocks can occur but average number per bucket << 1
Use the i low-order bits from the result of the hash function to index into the bucket array
24
Linear hashing Another dynamic hashing scheme
Two ideas:(a) Use i low order bits of hash
01110101grows
b
i
(b) Bucket array grows linearly
25
Inserting into Linear Hash Table
To insert record with key K, with last i bits of h(K) being a1a2…ai :
Let m be the integer represented by a1a2…ai in binary
If m < n (number of buckets), then bucket m exists -- put record in that bucket
If m ≥ n, then bucket m does not (yet) exist, so put record in bucket whose index corresponds to 0a2…ai
26
Inserting cont'd
If no room in indicated bucket, then create an overflow bucket
Compare # records / # buckets to threshold
If exceeds threshold then add a new bucket and rearrange records
If number of buckets exceeds i, then increment i by 1
27
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
If h(k)[i ] m, then look at bucket h(k)[i ]
else, look at bucket h(k)[i ] - 2i -1
Rule
0101• can have overflow chains!
• insert 0101
28
Example b=4 bits, i =2, 2 keys/bucket
00 01 10 11
0101
1111
0000
1010
m = 01 (max used block)
Futuregrowthbuckets
10
1010
0101 • insert 0101
11
11110101
29
Example Continued: How to grow beyond this?
00 01 10 11
111110100101
0101
0000
m = 11 (max used block)
i = 2
0 0 0 0100 101 110 111
3
. . .
100
100
101
101
0101
0101
30
Linear Hashing
Can handle growing files- with less wasted space- with no full reorganizations
No indirection like extensible hashing
Summary
+
+
Can still have overflow chains-
31
Hashing good for probes given keye.g., SELECT …
FROM RWHERE R.A = 5
Comparing Index Approaches
32
Sequential Indexes and B-trees good for
Range Searches:e.g., SELECT
FROM RWHERE R.A > 5
Indexing vs Hashing
33
Index definition in SQL
Create index name on rel (attr) Create unique index name on rel
(attr)defines candidate key
Drop INDEX name
34
CANNOT SPECIFY TYPE OF INDEX
(e.g. B-tree, Hashing, …)
OR PARAMETERS(e.g. Load Factor, Size of
Hash,...)
... at least in SQL...
Note