linear hashing - dsl.cds.iisc.ac.incourse/dbms/class/lh/lh.pdf · jan 2014 linear-hashing slide 4...
TRANSCRIPT
JAN 2014 Slide 1LINEAR-HASHING
LINEAR HASHINGE0 261
Jayant Haritsa
Computer Science and Automation
Indian Institute of Science
JAN 2014 Slide 2LINEAR-HASHING
Indexing Techniques
• Comparison-based indexing (B+-trees)• Computation-based indexing (Hashing)• Main difference is that search cost in former
is function of database size, in latter it is constant.
• But the former can support range searches, whereas the latter cannot.
JAN 2014 Slide 4LINEAR-HASHING
Static Hashing
• # primary pages N fixed, allocated sequentially, never de-allocated; overflow pages if needed.
• h(k) mod N = bucket to which data entry withkey k belongs.
h(key) mod N
hkey
Primary bucket pages Overflow pages
10
N-1
JAN 2014 Slide 5LINEAR-HASHING
Extendible Hashing
• Situation: Bucket (primary page) becomes full. Why not re-organize file by doubling # of buckets?– Reading and writing all pages is expensive!– Idea: Use directory of pointers to buckets, double # of
buckets by doubling the directory, splitting just the bucket that overflowed.
– Directory much smaller than file, so doubling it is much cheaper. Only one page of data entries is split. Nooverflow page.
– Trick lies in how hash function is adjusted.
JAN 2014 Slide 6LINEAR-HASHING
Example
• Directory is array of size 4.• To find bucket for r, take
last `global depth’ # bits of h(k)– If h(k) = 5 = binary 101,
it is in bucket pointed to by 01.
Insert: If bucket is full, split it (allocate new page, re-distribute).
If necessary, double the directory. (Splitting abucket does not always require doubling; we can tell by comparing global depth with local depth for the split bucket.)
13*00
01
10
11
2
2
2
2
2
LOCAL DEPTH
GLOBAL DEPTH
DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
DATA PAGES
10*
1* 21*
4* 12* 32* 16*
15* 7* 19*
5*
JAN 2014 Slide 7LINEAR-HASHING
Insert h(k)=20 (Causes Doubling)
20*
00011011
2 2
2
2
LOCAL DEPTH 2
2
DIRECTORY
GLOBAL DEPTHBucket A
Bucket B
Bucket C
Bucket D
Bucket A2(‘split image’of Bucket A)
1* 5* 21* 13*
32* 16*
10*
15* 7* 19*
4* 12*
19*
2
2
2
000
001010
011100
101
110111
3
3
3DIRECTORY
Bucket A
Bucket B
Bucket C
Bucket D
Bucket A2(‘split image’of Bucket A)
32*
1* 5* 21* 13*
16*
10*
15* 7*
4* 20*12*
LOCAL DEPTH
GLOBAL DEPTH
JAN 2014 Slide 8LINEAR-HASHING
Directory Doubling
00
01
10
11
2
Why use least significant bits in directory?Allows for doubling via copying!
000
001
010
011
3
100
101
110
111
vs.
0
1
1
2*2*
2*
2* = 010
00
01
10
11
2
3
0
1
1
2* 2*
2*
2* = 010000
001
010
011
100
101
110
111
Least Significant Most Significant
JAN 2014 Slide 9LINEAR-HASHING
Comments on Extendible Hashing
• If directory fits in memory, equality search answered with one disk access; else two.
• Directory grows in spurts, and, if the distribution of hash values is skewed, directory can grow large.
• Multiple entries with same hash value cause problems.
JAN 2014 Slide 11LINEAR-HASHING
Linear Hashing
• This is another dynamic hashing scheme, an alternative to Extendible Hashing.
• LH handles the problem of long overflow chains without using a directory, and handles duplicates.
• Idea: Use a family of hash functions h0, h1, h2, ...– hi(key) = h(key) mod(2iN); N = initial # buckets– h is some hash function (range is 0 to 2|MachineBitLength|)– If N = 2d0, for some d0, hi consists of applying h and looking
at the last di bits, where di = d0 + i.– hi+1 doubles the range of hi (similar to directory doubling)
JAN 2014 Slide 12LINEAR-HASHING
Linear Hashing (Contd.)• Directory avoided in LH by using overflow pages,
and choosing bucket to split round-robin.– Splitting proceeds in ‘rounds’. Round ends when all NR
initial (for round R) buckets are split. – Buckets 0 to Next-1 have been split; Next to NR yet to
be split.– Current round number is Level.
JAN 2014 Slide 13LINEAR-HASHING
Overview of LH File
• In the middle of a round.
Level-1h
Buckets that existed at thebeginning of this round:
this is the range of
NextBucket to be split
of other buckets) in this round
Level-1h search key value )(
search key value )(
Buckets split in this round:If is in this range, must useh Level
'split image' bucket.to decide if entry is in
created (through splitting'split image' buckets:NR
M
JAN 2014 Slide 14LINEAR-HASHING
LH Search
To find bucket for data entry k, find hLevel-1(k):• If hLevel-1(k) in range `Next to NR’ , k belongs
here.• Else, r could belong to bucket hLevel-1(k) or
bucket hLevel-1(k) + NR; must apply hLevel(k) to find out.
JAN 2014 Slide 15LINEAR-HASHING
Simpler Formulation
• m = hlevel (k)if m >= M, m = m - NR
NR M 2NR0 Nexthlevel hlevel-1 hlevel
JAN 2014 Slide 16LINEAR-HASHING
LH Insert
• Find bucket by applying hLevel-1 / hLevel:– If bucket to insert into is full:
• Add overflow page and insert data entry.• (Maybe) Split Next bucket and increment Next.
• Can choose any criterion to `trigger’ split.• Since buckets are split round-robin, long overflow
chains don’t develop!• Doubling of directory in Extendible Hashing is
similar; switching of hash functions is implicit in how the # of bits examined is increased.
JAN 2014 Slide 17LINEAR-HASHING
Example of Linear Hashing
• On split, hLevel+1 is used to re-distribute entries.
0hh
1
Level=0, N=4
00
01
10
11
000
001
010
011
(Actual contents of the linear hashed file)
Next=0PRIMARY
PAGES
Data entry rwith h(r)=5
Primary bucket page
44* 36*32*
25*9* 5*
14* 18*10*30*
31*35* 11*7*
0hh
1
Level=0
00
01
10
11
000
001
010
011
Next=1
PRIMARYPAGES
44* 36*
32*
25*9* 5*
14* 18*10*30*
31*35* 11*7*
OVERFLOWPAGES
43*
00100
Add 43*
JAN 2014 Slide 18LINEAR-HASHING
Example: End of a Round
0hh1
22*
00
01
10
11
000
001
010
011
00100
Next=3
01
10
101
110
Level=0PRIMARYPAGES
OVERFLOWPAGES
32*
9*
5*
14*
25*
66* 10*18* 34*
35*31* 7* 11* 43*
44*36*
37* 29*
30*
0hh1
37*
00
01
10
11
000
001
010
011
00100
10
101
110
Next=0
Level=1
111
11
PRIMARYPAGES
OVERFLOWPAGES
11
32*
9* 25*
66* 18* 10* 34*
35* 11*
44* 36*
5* 29*
43*
14* 30* 22*
31*7*
50*
Add 50*
JAN 2014 Slide 19LINEAR-HASHING
Physical Address Computing
• Logical bucket address given by hashing must be converted into the physical address of the bucket on disk.
• Simple solution: contiguous allocation on disk, but not feasible in general
• In practice, keep giving larger and larger contiguous chunks as the file size grows (Equation 6 doubles chunk size with K=1)
JAN 2014 Slide 20LINEAR-HASHING
Points to Note
• Hash Function:– hi (k) = [(A k) mod w] mod 2i N
where A = 6125423371 w = 232
(A is prime w.r.t. w)• In LH, split is local impact, whereas in the
B-tree, split is “global”• In LH, duplicates don’t cause a problem
because of presence of overflow buckets
JAN 2014 Slide 21LINEAR-HASHING
Problem
• For small cardinality domains (e.g. gender), neither hashing nor indexing work
• Solution: Bitmap Indexes (next class)