secondary storage and indexing
TRANSCRIPT
Secondary Storage and Indexing
CSCI 4380 Database Systems
Friday, April 20, 12
Disk access• Databases are generally large data stores,
much larger than available memory.
• Data is stored on disk, brought to memory on demand.
• A disk page or block is the smallest unit of access, to read and write.
• A disk page typically is 1K - 8K.
Friday, April 20, 12
Disk organization
• A disk contains
• multiple platters (usually 2 surfaces per platter)
• usually, the disk contains read/write heads that allow is to read/write from all surfaces simultaneously
Friday, April 20, 12
Disk organization
• A disk surface contains
• multiple concentric tracks
• the same track on different surfaces can be read by different heads at the same time, this unit is called a cylinder
Friday, April 20, 12
Disk organization• A track is broken down to sectors, sectors are
separated from each other by blank spaces
• A sector is the smallest unit of operation (read/write) possible on a disk
• A disk block is usually composed of a number of consecutive sectors (determined by the operating system)
• Data are read/written in units of a disk block/page
• A disk block is the same size as a memory block or page.
Friday, April 20, 12
Reading a disk page• Reading a page from disk requires the disk to start
spinning
• Disk arm has to be moved to the correct track of the disk -> seek operation
• The disk head must wait until the right location on the track is found -> rotational latency
• Then, the disk page can be read from disk and copied to memory -> transfer time.
Friday, April 20, 12
Reading a disk page• The cost of reading a disk page:
• seek time + rotational latency time + transfer time
• Multiple pages on the same track/cylinder can be read with a single seek/latency. Reading M pages on the same track/cylinder:
• seek time + rotational latency time + transfer time * (percentage of disk circumference to be scanned)
Friday, April 20, 12
A high end disk example
• Consider a disk with 16 surfaces, 216 tracks per surface (approx. 65K), 28=256 sectors per track and 212 bytes per sector.
• Each track has = 212 * 28 =220 bytes (1MB)
• Each surface has = 220 * 216 = 236 bytes
• The disk has = 24 * 236 = 240 byte = 1 TB
Friday, April 20, 12
Reading a page• Typical times:
• 7200 rpm means one rotation takes 8.33 ms (in the average, 1/2 of the disk needs to be rotated before the correct location is found, 4.17ms)
• seek time between 0 - 17.38 ms (in the average, 1/3 of the disk surface is scanned = 6.46 ms)
• transfer time for one sector : 8.33/256 = 0.03 ms
Friday, April 20, 12
Reading a page• Reading a page of 8K (2 sectors):
• 1 seek + 1 rotational latency + 2 sector transfer time
• 6.46 + 4.17 + 0.03 * 2 = 10.69 ms
• Reading 100 consecutive pages on the same track:
• 6.46 + 4.17 + 0.03 * 10 = 13.63 ms
• The lesson: Put blocks that are accessed together on the same track/cylinder as much as possible
Friday, April 20, 12
Disk scheduling• The disk controller can order the requests
to minimize seeks
• When the controller is moving from low tracks to high tracks, serve the next track request in the direction of the movement, queue the rest
• The method is called the elevator algorithm
Friday, April 20, 12
Checksums• For each sector, store a number of error checking bits called
checksums.
• The checksum is 1 if the number of 1’s in the given sector is odd, and 0 if the number of 1’s is even.
• When reading a sector, check that the checksum is correct.
• Checks for 1 bit errors.
• Errors for more than 1 bits, the checksum will catch it in 50% of the time.
• For better error correction, use multiple bits (8 bits, bit i stores the parity of the ith bit of each byte).
Friday, April 20, 12
Stable storage• When we are writing a sector, if the write fails, then we
lost the data on that sector.
• Use two sectors for each sector, XL and XR.
• First write XL, check the checksum. If XL is written correctly, then write XR.
• If XL is written incorrectly, then the old version of X is still stored in XR.
• If XR is written incorrectly, then the new version of X is stored in XL.
Friday, April 20, 12
Multiple disks
• Raid (redundant array of inexpensive disks) is a series of methods for improving access time and reducing possibility of data loss by using multiple disks.
Friday, April 20, 12
RAID-0
• RAID-0, striping
• Distribute the data into multiple disks
• Example with 4 disks:
• Disk 1 has pages 1,5,9
• Disk 2 has pages 2,6,10
• Disk 3 has pages 3,7,11
• Disk 4 has pages 4,8,12
Friday, April 20, 12
RAID-0
• RAID-0, striping
• Reads are faster (read from all disks simultaneously)
• Writes are the same
• No redundancy in case a disk fails
Friday, April 20, 12
RAID-1• RAID-1, mirroring
• Mirror each disk onto another disk
• Reads are twice as fast, read from any disk available
• Writes are slow, each write require writing to two disks
• If one of the disks fail, the other one contains all the data (no data loss)
Friday, April 20, 12
RAID-4• One block contains the parity of the remaining disks
• Block i in the parity disk contains the parity of the ith block in all the remaining disks
• Reads are unchanged
• Writes are slower, each write requires a write to the parity disk as well
• If a disk fails, the lost data can be constructed from the remaining disks
Friday, April 20, 12
RAID-5• Similar to RAID-4, but the parity block is distributed
to all the disks
• Example: Given 5 disks (4 regular and 1 parity):
• Use disk 1 for parity of block 1
• Use disk 2 for parity of block 2
• etc.
• Reads are the same
• Writes are faster as the parity block is no longer a bottleneck
Friday, April 20, 12
Tuple organization• A disk page typically stores multiple tuples.
Many different organizations exist.
• The number of tuples that can fit in a page is determined by the number of attributes and the types of attributes the relation has.
Header info row directory
1 2 N... Free space Data rowsRow N Row N-1 Row 1...
Friday, April 20, 12
Tuple addressing• Tuples have a physical address which contains the
relevant subset of:
• Host name/Disk number/Surface No/ Track No/Sector No
• Physical address tends to be long
• Tuples are also given a logical address in the relation,
• A map table stored on disk contains the mapping from the logical address to physical address
Friday, April 20, 12
Tuple addressing
• When tuples are brought from disk to memory, its current address becomes a memory address
• Pointer swizzling is the act of changing physical address to the memory address in the map table for pages in memory
Friday, April 20, 12
Indexing
• An index is a lookup structure built on a search key
• the search key can consist of multiple attribute
• the index contains pointers to tuples (logical address)
• The index itself is also packed into pages and stored on disk.
Friday, April 20, 12
Dense vs. sparse
• The index is called dense if it contains an entry for each tuple in the relation.
• An index is called sparse if it does not contain an entry for each tuple.
• A sparse index is possible if the addressed relation is sorted with respect to the index key.
Friday, April 20, 12
Dense Index Example1, t1
2,t3
4,t5
5,t6
8,t7
9,t10
10,t12
Index
t1
t7
t12
t5
t6
t3
t10
Indexed
Relation
Friday, April 20, 12
Sparse Index Example1, t1
8,t7
Index
t1
t3
t5
t6
t7
t10
t12
Indexed
Relation
1,t1 points to all values between 1 and 5 8,t7 points to all values greater than 5
Friday, April 20, 12
Index types
• An index can be
• primary, i.e. determines where the tuples are stored
• secondary, i.e. points to the tuples
• There can be many secondary indices.
• An index can be multi-level, i.e. a tree index, where each level is an index on the level below.
Friday, April 20, 12
B- trees
• B trees (called B+ trees in some books) are constructed on a list of attributes (also called the index key)
• Each node on a B-tree is mapped to a disk page
• Leaf nodes:
• A leaf node can contain at most n tuples (key values and pointers) and 1 additional pointer to the sibling node.
• A leaf node must contain at least floor((n+1)/2) tuples (plus one additional pointer to the next sibling node.
Friday, April 20, 12
B- trees
• Internal nodes:
• An internal node can contain at most n + 1 pointers and n key values.
• An internal node must contain at least floor((n+1)/2) pointers (and one less key value), except the root which can contain a single key value and 2 pointers.
Friday, April 20, 12
B- tree example• Suppose n = 3
• Each leaf node will have at least 2 and at most 3 tuples.
• Each internal node will point to at least 2 and at most 4 nodes below (and hence will have between 1 and 2 key values).
• Suppose n = 99
• Each leaf node will have at least 50 and at most 99 tuples.
• Each internal node will point to at least 50 and at most 100 nodes below (and hence will have between 49 and 99 key values).
• The root can have 2 pointers and 1 key value in the least.
Friday, April 20, 12
Sibling nodes
• Leaf nodes point to the next node in the leaf, called a sibling node.
Friday, April 20, 12
B- trees
• Leaf nodes contain pairs of
• key values
• pointers to the tuple
• If the B-tree is a secondary index, then there is an entry in the leaf level for each tuple in the relation.
• The leaf nodes also contain a pointer to the next (sibling) leaf node.
Friday, April 20, 12
B- trees• Internal nodes contain n key values and n+1 pointers
• The pointers point to the nodes at the level below
10 25 32
values
<10
values
>=10
and
<25
values
>=25
and
<32
values
>=32
Friday, April 20, 12
Example B-tree
Assume at most 4 key values per node
2 7 11 15 22 30 41 53 54 63 66 69 71 76 78 84 93
11 3066 78
53
pointers to tuples
Friday, April 20, 12
B-trees with duplicate values
• If the B-tree is built on a key value that may contain duplicates, build the index in an identical way, except:
• The non-leaf node pointing to leaf node contains the key value of the first node that is not repeating from the previous sibling
• If there is no such key, then a null value is stored at this location.
Friday, April 20, 12
Example B-tree with duplicates
Assume at most 4 key values per node
2 7 11 15 15 15 18 18 22 41 41 41 41 41 55 63
11 18- 55
22
Friday, April 20, 12
B-tree equality search• Given select * from R where A = x and an index on R.A (assume no
duplicate values for R.A):
• While not at leaf level:
• Starting from the root, find the address for the node below that may contain this value (the pointer to the left of the first key value that is greater than x or the last pointer if no such value exists)
• Read the node from disk
• If the leaf level contains a tuple with the searched value, read the matched tuples from disk and return
Friday, April 20, 12
B-tree equality search• Given select * from R where A = x and an index on R.A (assume R.A
may contain duplicate values):
• While not at leaf level:
• Starting from the root, find the address for the node below that may contain this value (the pointer to the left of the first key value that is greater than x or the last pointer if no such value exists)
• Read the node from disk
• If the leaf level contains a tuple with the searched value, scan all sibling pointers until a value different than x is found. Read the matched tuples from disk and return
Friday, April 20, 12
B-tree range search
• Given select * from R where A < y and A > x an index on R.A:
• Using the same algorithm from before, find the first leaf node containing a value > x
• Traverse the sibling pointers from left to right until all tuples in the range are read
• Read all the matching tuples from the disk
Friday, April 20, 12
Index only search
• Given select A from R where A < 120 and A > 10 and an index on R.A:
• Scan the index for matching tuples as before and return the found A values (no need to read the tuples from disk)
Friday, April 20, 12
Index partial match• Given an index on R.A, R.B (index is sorted on A first and then
on B)
• Select * from R where A > 10 and A < 100 and B=2
• Scan index for the range A > 10 and A < 100, and for each matching tuple check the B value, read matched tuples from disk
• Select * from R where B > 10 and B < 100
• Scan the leaf level of the index completely to find the matching B tuples, read matched tuples from disk
Friday, April 20, 12
Insertion1. Given a new entry A to be inserted
1.1. Search the tree for the new entry
1.2. If the leaf node X has space for the new entry, insert.
1.3. Otherwise
1.3.1. Create a new leaf node Y and distribute the entries in X and the entry A to X and the new node
1.3.2. Create a new entry B with the address of Y and the lowest entry in Y
1.3.3. Insert B into the parent of X recursively (go to step 1.2)
Friday, April 20, 12
43
Insert Example
Insert record with key 57 (at most 4 key values)
2 7 11 15 22 30 41 53 54 63 66 69 71 76 78 84 93
11 3066 78
53
Friday, April 20, 12
44
Insert Example
Insert record with key 57 (at most 4 key values)
2 7 11 15 22 30 41 53 54 57 63 66 69 71 76 78 84 93
11 3066 78
53
We are done! No rebalancing necessary
Friday, April 20, 12
45
Another Insert Example
Insert 65
2 7 11 15 22 30 41 53 54 57 63 66 69 71 76 78 84 93
11 3066 78
53
Friday, April 20, 12
46
Another Insert Example
Overflown node is split
2 7 11 15 22 30 41 53 54 57 66 69 71 76 78 84 93
11 30 63 66 78
53
63 65
Friday, April 20, 12
47
Another Insert Example
Insert 70 and 94, one more node split
53
2 7
11 30 63 66 71 76
11 15 22 30 41 53 54 57 66 69 70 78 84 93 94
63 65 71 76
Friday, April 20, 12
48
Another Insert Example
Finally, insert 90 (which will cause the parent to split)
53
2 7
11 30 63 66 71 76
11 15 22 30 41 53 54 57 66 69 70 78 84 93 94
63 65 71 76
Friday, April 20, 12
49
Another Insert Example
Finally, insert 90 (which will cause the parent to split)
53 71
2 7
11 30 63 66
11 15 22 30 41 53 54 57 66 69 70 78 84 90
63 65 71 76 93 94
78 93
Friday, April 20, 12
Deletion
• Suppose we would like to delete entry A
• Locate leaf node X containing entry A and delete A
• If X has n/2 or more pointers, then adjust the parent node entry pointing to this node if necessary recursively (if we deleted the smallest entry in the node)
Friday, April 20, 12
Deletion• Otherwise, the node has too few pointers.
• If a sibling node with the same parent has more than n/2 pointers, then redistribute entries with the sibling and adjust the parent pointers
• Else
• delete A
• insert all the entries in A to a sibling B
• adjust the parent entry for B
• delete the entry Y in the parent for A recursively (go to the first step of this algorithm)
Friday, April 20, 12
52
Deletion Example
Delete key 30
2 7 11 15 17 22 30 53 54 78 84 93
11 22
78
53
Friday, April 20, 12
53
Deletion Example
Delete key 30Borrow from neighbor
Adjust the internal node
2 7 11 15 17 22 53 54 78 84 93
11 17
78
53
Friday, April 20, 12
53
Deletion Example
Delete key 30Borrow from neighbor
Redistribute betweenthe second andthird leaf nodes.
Adjust the internal node
2 7 11 15 17 22 53 54 78 84 93
11 17
78
53
Friday, April 20, 12
54
Another Deletion Example
Delete key 7
Cannot borrow from neighbor,Merge with neighbor
2 7 11 15 17 22 53 54 78 84 93
11 17
78
53
Friday, April 20, 12
55
Another Deletion Example
2 11 15 17 22 53 54 78 84 93
17
78
53
Delete the corresponding pointer
Friday, April 20, 12
56
Another Deletion Example
Delete 53, must merge with a sibling
2 11 15 17 22 53 54 78 84
17
78
53
Friday, April 20, 12
57
Another Deletion Example
2 11 15 17 22 54 78 84
17
78
53Node too empty,
cannot borrow from sibling,
must merge with sibling
Friday, April 20, 12
58
Another Deletion Example
2 11 15 17 22 54 78 84
17 54
78
53
Friday, April 20, 12
59
Another Deletion Example
2 11 15 17 22 54 78 84
17 54
The final tree.
Friday, April 20, 12
60
A B-Tree Example
Given:
disk page has capacity of 4K bytes
each tuple address takes 6 bytes and each key value takes 2 bytes
each node is 70% full
need to store 1 million tuples
Friday, April 20, 12
61
A B+-Tree Example
Leaf node capacity
• each (key value, tuple address) pair takes 8 bytes
• disk page capacity is 4K, so (4*1024)/8 = 512 (key value, rowid) pairs per leaf page
in reality there are extra headers and pointers that we will ignore
• Hence, the maximum number of points for the tree is about 256 (and 255 key values)
Friday, April 20, 12
62
Example Continued• If all pages are 70% full, each page has about
512*0.7 = 359 pointers
• To store 1 million tuples, requires
1,000,000 / 359 = 2786 pages at the leaf level
2789 / 359 = 8 pages at next level up
1 root page pointing to those 8 pages
Hence, we have a B-tree with 3 levels
Friday, April 20, 12
Hashing• Given a hash of K buckets
• Allocate a number of disk blocks M to each bucket
• For each tuple t, apply the hash function. Suppose, we hash on attribute A, if h(t.A) = x, then store t in the blocks allocated for bucket x.
• Search on attribute A (select * from r where r.a=c)
• Cost: M/2 (search half the pages for that bucket in the average
Friday, April 20, 12
Hashing
• Search on another attribute
• Cost: N
• Insertion cost: 1 read and 1 write (find the last page in the appropriate bucket and store)
• Deletion/Update cost: M/2 (search cost) + 1 to update
Friday, April 20, 12
Hashing - collisions
• If a bucket has too many tuples, than the allocated M pages may not be sufficient
• Allocate additional overflow area
• If the overflow area is large, the benefit of the hash is lost
Friday, April 20, 12
Extensible hashing• The address space of the hash (K) can be adjusted to the
number of tuples in the relation
• Use a hash function h
• But, use only first z bits of the hashed value to address the tuples
• If a bucket overflows, split the hash directory and use z+1 bits to address
Friday, April 20, 12
Extensible hashing• Using a single bit to address
tuples
0
1z=1
Page 1
0
1
Page 0Overflow!
new point
Friday, April 20, 12
Extensible hashing• Double the directory
Page 1
0
1
Page 0
Distribute to00 and 10
Friday, April 20, 12
Extensible hashing• Double the directory
Page 1
00
01
Page 0
Page 3
10
11
Page 2
Make a copy of thedirectory
Friday, April 20, 12
Extensible hashing
Page 1
00
01
Page 0
Page 3
10
11
Page 2
Update thelink for the new node
Friday, April 20, 12
Extensible hashing
Page 1
00
01
Page 0
Page 3
10
11
Page 2
How do we knowwhich nodes canbe split withoutsplitting the directory?
2
1
2
Friday, April 20, 12
Linear hashing• The addressing is the same, but we allow overflows
• We decide to split based on a global rule
• If number of pages/number of tuples > k %
• Split one bucket at a time
0
1
Friday, April 20, 12
Linear hashing0
1
new point
decide to split
00
1
10
split the contents
into 00 and 10
00
1
10
bucket 1 still contains
all entries
01 and 11
Friday, April 20, 12
Linear hashing• The bucket split is the next one in sequence
• it may not be the one that has overflow pages
• eventually all buckets will be split
Friday, April 20, 12