secondary storage and indexing

Secondary Storage and Indexing

CSCI 4380 Database Systems

Friday, April 20, 12

Disk access• Databases are generally large data stores,

much larger than available memory.

• Data is stored on disk, brought to memory on demand.

• A disk page or block is the smallest unit of access, to read and write.

• A disk page typically is 1K - 8K.

Disk organization

• A disk contains

• multiple platters (usually 2 surfaces per platter)

• usually, the disk contains read/write heads that allow is to read/write from all surfaces simultaneously

Disk organization

• A disk surface contains

• multiple concentric tracks

• the same track on different surfaces can be read by different heads at the same time, this unit is called a cylinder

Disk organization• A track is broken down to sectors, sectors are

separated from each other by blank spaces

• A sector is the smallest unit of operation (read/write) possible on a disk

• A disk block is usually composed of a number of consecutive sectors (determined by the operating system)

• Data are read/written in units of a disk block/page

• A disk block is the same size as a memory block or page.

Reading a disk page• Reading a page from disk requires the disk to start

spinning

• Disk arm has to be moved to the correct track of the disk -> seek operation

• The disk head must wait until the right location on the track is found -> rotational latency

• Then, the disk page can be read from disk and copied to memory -> transfer time.

Reading a disk page• The cost of reading a disk page:

• seek time + rotational latency time + transfer time

• Multiple pages on the same track/cylinder can be read with a single seek/latency. Reading M pages on the same track/cylinder:

• seek time + rotational latency time + transfer time * (percentage of disk circumference to be scanned)

A high end disk example

• Consider a disk with 16 surfaces, 216 tracks per surface (approx. 65K), 28=256 sectors per track and 212 bytes per sector.

• Each track has = 212 * 28 =220 bytes (1MB)

• Each surface has = 220 * 216 = 236 bytes

• The disk has = 24 * 236 = 240 byte = 1 TB

Reading a page• Typical times:

• 7200 rpm means one rotation takes 8.33 ms (in the average, 1/2 of the disk needs to be rotated before the correct location is found, 4.17ms)

• seek time between 0 - 17.38 ms (in the average, 1/3 of the disk surface is scanned = 6.46 ms)

• transfer time for one sector : 8.33/256 = 0.03 ms

Reading a page• Reading a page of 8K (2 sectors):

• 1 seek + 1 rotational latency + 2 sector transfer time

• 6.46 + 4.17 + 0.03 * 2 = 10.69 ms

• Reading 100 consecutive pages on the same track:

• 6.46 + 4.17 + 0.03 * 10 = 13.63 ms

• The lesson: Put blocks that are accessed together on the same track/cylinder as much as possible

Disk scheduling• The disk controller can order the requests

to minimize seeks

• When the controller is moving from low tracks to high tracks, serve the next track request in the direction of the movement, queue the rest

• The method is called the elevator algorithm

Checksums• For each sector, store a number of error checking bits called

checksums.

• The checksum is 1 if the number of 1’s in the given sector is odd, and 0 if the number of 1’s is even.

• When reading a sector, check that the checksum is correct.

• Checks for 1 bit errors.

• Errors for more than 1 bits, the checksum will catch it in 50% of the time.

• For better error correction, use multiple bits (8 bits, bit i stores the parity of the ith bit of each byte).

Stable storage• When we are writing a sector, if the write fails, then we

lost the data on that sector.

• Use two sectors for each sector, XL and XR.

• First write XL, check the checksum. If XL is written correctly, then write XR.

• If XL is written incorrectly, then the old version of X is still stored in XR.

• If XR is written incorrectly, then the new version of X is stored in XL.

Multiple disks

• Raid (redundant array of inexpensive disks) is a series of methods for improving access time and reducing possibility of data loss by using multiple disks.

RAID-0

• RAID-0, striping

• Distribute the data into multiple disks

• Example with 4 disks:

• Disk 1 has pages 1,5,9

RAID-0

• RAID-0, striping

• Reads are faster (read from all disks simultaneously)

• Writes are the same

• No redundancy in case a disk fails

RAID-1• RAID-1, mirroring

• Mirror each disk onto another disk

• Reads are twice as fast, read from any disk available

• Writes are slow, each write require writing to two disks

• If one of the disks fail, the other one contains all the data (no data loss)

RAID-4• One block contains the parity of the remaining disks

• Block i in the parity disk contains the parity of the ith block in all the remaining disks

• Reads are unchanged

• Writes are slower, each write requires a write to the parity disk as well

• If a disk fails, the lost data can be constructed from the remaining disks

RAID-5• Similar to RAID-4, but the parity block is distributed

to all the disks

• Example: Given 5 disks (4 regular and 1 parity):

• Use disk 1 for parity of block 1

• Use disk 2 for parity of block 2

• etc.

• Reads are the same

• Writes are faster as the parity block is no longer a bottleneck

Tuple organization• A disk page typically stores multiple tuples.

Many different organizations exist.

• The number of tuples that can fit in a page is determined by the number of attributes and the types of attributes the relation has.

Header info row directory

1 2 N... Free space Data rowsRow N Row N-1 Row 1...

Tuple addressing• Tuples have a physical address which contains the

relevant subset of:

• Host name/Disk number/Surface No/ Track No/Sector No

• Physical address tends to be long

• Tuples are also given a logical address in the relation,

• A map table stored on disk contains the mapping from the logical address to physical address

Tuple addressing

• When tuples are brought from disk to memory, its current address becomes a memory address

• Pointer swizzling is the act of changing physical address to the memory address in the map table for pages in memory

Indexing

• An index is a lookup structure built on a search key

• the search key can consist of multiple attribute

• the index contains pointers to tuples (logical address)

• The index itself is also packed into pages and stored on disk.

Dense vs. sparse

• The index is called dense if it contains an entry for each tuple in the relation.

• An index is called sparse if it does not contain an entry for each tuple.

• A sparse index is possible if the addressed relation is sorted with respect to the index key.

Dense Index Example1, t1

10,t12

Indexed

Relation

Sparse Index Example1, t1

Indexed

Relation

1,t1 points to all values between 1 and 5 8,t7 points to all values greater than 5

Index types

• An index can be

• primary, i.e. determines where the tuples are stored

• secondary, i.e. points to the tuples

• There can be many secondary indices.

• An index can be multi-level, i.e. a tree index, where each level is an index on the level below.

B- trees

• B trees (called B+ trees in some books) are constructed on a list of attributes (also called the index key)

• Each node on a B-tree is mapped to a disk page

• Leaf nodes:

• A leaf node can contain at most n tuples (key values and pointers) and 1 additional pointer to the sibling node.

• A leaf node must contain at least floor((n+1)/2) tuples (plus one additional pointer to the next sibling node.

B- trees

• Internal nodes:

• An internal node can contain at most n + 1 pointers and n key values.

• An internal node must contain at least floor((n+1)/2) pointers (and one less key value), except the root which can contain a single key value and 2 pointers.

B- tree example• Suppose n = 3

• Each leaf node will have at least 2 and at most 3 tuples.

• Each internal node will point to at least 2 and at most 4 nodes below (and hence will have between 1 and 2 key values).

• Suppose n = 99

• Each leaf node will have at least 50 and at most 99 tuples.

• Each internal node will point to at least 50 and at most 100 nodes below (and hence will have between 49 and 99 key values).

• The root can have 2 pointers and 1 key value in the least.

Sibling nodes

• Leaf nodes point to the next node in the leaf, called a sibling node.

B- trees

• Leaf nodes contain pairs of

• key values

• pointers to the tuple

• If the B-tree is a secondary index, then there is an entry in the leaf level for each tuple in the relation.

• The leaf nodes also contain a pointer to the next (sibling) leaf node.

B- trees• Internal nodes contain n key values and n+1 pointers

• The pointers point to the nodes at the level below

10 25 32

values

Example B-tree

Assume at most 4 key values per node

2 7 11 15 22 30 41 53 54 63 66 69 71 76 78 84 93

11 3066 78

pointers to tuples

B-trees with duplicate values

• If the B-tree is built on a key value that may contain duplicates, build the index in an identical way, except:

• The non-leaf node pointing to leaf node contains the key value of the first node that is not repeating from the previous sibling

• If there is no such key, then a null value is stored at this location.

Example B-tree with duplicates

Assume at most 4 key values per node

2 7 11 15 15 15 18 18 22 41 41 41 41 41 55 63

11 18- 55

B-tree equality search• Given select * from R where A = x and an index on R.A (assume no

duplicate values for R.A):

• While not at leaf level:

• Starting from the root, find the address for the node below that may contain this value (the pointer to the left of the first key value that is greater than x or the last pointer if no such value exists)

• Read the node from disk

• If the leaf level contains a tuple with the searched value, read the matched tuples from disk and return

B-tree equality search• Given select * from R where A = x and an index on R.A (assume R.A

may contain duplicate values):

• While not at leaf level:

• Starting from the root, find the address for the node below that may contain this value (the pointer to the left of the first key value that is greater than x or the last pointer if no such value exists)

• Read the node from disk

• If the leaf level contains a tuple with the searched value, scan all sibling pointers until a value different than x is found. Read the matched tuples from disk and return

B-tree range search

• Given select * from R where A < y and A > x an index on R.A:

• Using the same algorithm from before, find the first leaf node containing a value > x

• Traverse the sibling pointers from left to right until all tuples in the range are read

• Read all the matching tuples from the disk

Index only search

• Given select A from R where A < 120 and A > 10 and an index on R.A:

• Scan the index for matching tuples as before and return the found A values (no need to read the tuples from disk)

Index partial match• Given an index on R.A, R.B (index is sorted on A first and then

• Select * from R where A > 10 and A < 100 and B=2

• Scan index for the range A > 10 and A < 100, and for each matching tuple check the B value, read matched tuples from disk

• Select * from R where B > 10 and B < 100

• Scan the leaf level of the index completely to find the matching B tuples, read matched tuples from disk

Insertion1. Given a new entry A to be inserted

1.1. Search the tree for the new entry

1.2. If the leaf node X has space for the new entry, insert.

1.3. Otherwise

1.3.1. Create a new leaf node Y and distribute the entries in X and the entry A to X and the new node

1.3.2. Create a new entry B with the address of Y and the lowest entry in Y

1.3.3. Insert B into the parent of X recursively (go to step 1.2)

Insert Example

Insert record with key 57 (at most 4 key values)

2 7 11 15 22 30 41 53 54 63 66 69 71 76 78 84 93

11 3066 78

Insert Example

Insert record with key 57 (at most 4 key values)

2 7 11 15 22 30 41 53 54 57 63 66 69 71 76 78 84 93

11 3066 78

We are done! No rebalancing necessary

Another Insert Example

Insert 65

2 7 11 15 22 30 41 53 54 57 63 66 69 71 76 78 84 93

11 3066 78

Overflown node is split

2 7 11 15 22 30 41 53 54 57 66 69 71 76 78 84 93

11 30 63 66 78

Insert 70 and 94, one more node split

11 30 63 66 71 76

11 15 22 30 41 53 54 57 66 69 70 78 84 93 94

63 65 71 76

Finally, insert 90 (which will cause the parent to split)

11 30 63 66 71 76

11 15 22 30 41 53 54 57 66 69 70 78 84 93 94

63 65 71 76

Finally, insert 90 (which will cause the parent to split)

11 30 63 66

11 15 22 30 41 53 54 57 66 69 70 78 84 90

63 65 71 76 93 94

Deletion

• Suppose we would like to delete entry A

• Locate leaf node X containing entry A and delete A

• If X has n/2 or more pointers, then adjust the parent node entry pointing to this node if necessary recursively (if we deleted the smallest entry in the node)

Deletion• Otherwise, the node has too few pointers.

• If a sibling node with the same parent has more than n/2 pointers, then redistribute entries with the sibling and adjust the parent pointers

• Else

• delete A

• insert all the entries in A to a sibling B

• adjust the parent entry for B

• delete the entry Y in the parent for A recursively (go to the first step of this algorithm)

Deletion Example

Delete key 30

2 7 11 15 17 22 30 53 54 78 84 93

Deletion Example

Delete key 30Borrow from neighbor

Adjust the internal node

2 7 11 15 17 22 53 54 78 84 93

Deletion Example

Delete key 30Borrow from neighbor

Redistribute betweenthe second andthird leaf nodes.

Adjust the internal node

2 7 11 15 17 22 53 54 78 84 93

Another Deletion Example

Delete key 7

Cannot borrow from neighbor,Merge with neighbor

2 7 11 15 17 22 53 54 78 84 93

2 11 15 17 22 53 54 78 84 93

Delete the corresponding pointer

Delete 53, must merge with a sibling

2 11 15 17 22 53 54 78 84

2 11 15 17 22 54 78 84

53Node too empty,

cannot borrow from sibling,

must merge with sibling

2 11 15 17 22 54 78 84

The final tree.

A B-Tree Example

Given:

disk page has capacity of 4K bytes

each tuple address takes 6 bytes and each key value takes 2 bytes

each node is 70% full

need to store 1 million tuples

A B+-Tree Example

Leaf node capacity

• each (key value, tuple address) pair takes 8 bytes

• disk page capacity is 4K, so (4*1024)/8 = 512 (key value, rowid) pairs per leaf page

in reality there are extra headers and pointers that we will ignore

• Hence, the maximum number of points for the tree is about 256 (and 255 key values)

Example Continued• If all pages are 70% full, each page has about

512*0.7 = 359 pointers

• To store 1 million tuples, requires

1,000,000 / 359 = 2786 pages at the leaf level

2789 / 359 = 8 pages at next level up

1 root page pointing to those 8 pages

Hence, we have a B-tree with 3 levels

Hashing• Given a hash of K buckets

• Allocate a number of disk blocks M to each bucket

• For each tuple t, apply the hash function. Suppose, we hash on attribute A, if h(t.A) = x, then store t in the blocks allocated for bucket x.

• Search on attribute A (select * from r where r.a=c)

• Cost: M/2 (search half the pages for that bucket in the average

Hashing

• Search on another attribute

• Cost: N

• Insertion cost: 1 read and 1 write (find the last page in the appropriate bucket and store)

• Deletion/Update cost: M/2 (search cost) + 1 to update

Hashing - collisions

• If a bucket has too many tuples, than the allocated M pages may not be sufficient

• Allocate additional overflow area

• If the overflow area is large, the benefit of the hash is lost

Extensible hashing• The address space of the hash (K) can be adjusted to the

number of tuples in the relation

• Use a hash function h

• But, use only first z bits of the hashed value to address the tuples

• If a bucket overflows, split the hash directory and use z+1 bits to address

Extensible hashing• Using a single bit to address

tuples

Overflow!

new point

Extensible hashing• Double the directory

Distribute to00 and 10

Extensible hashing• Double the directory

Make a copy of thedirectory

Extensible hashing

Update thelink for the new node

Extensible hashing

How do we knowwhich nodes canbe split withoutsplitting the directory?

Linear hashing• The addressing is the same, but we allow overflows

• We decide to split based on a global rule

• If number of pages/number of tuples > k %

• Split one bucket at a time

Linear hashing0

new point

decide to split

split the contents

into 00 and 10

bucket 1 still contains

all entries

01 and 11

Linear hashing• The bucket split is the next one in sequence

• it may not be the one that has overflow pages

• eventually all buckets will be split

secondary storage and indexing

Documents

secondary storage

image storage, indexing and recognition

storage, indexing, query processing, and benchmarking in

scalable inverted indexing on nosql table storage

5 data storage and indexing

efﬁcient metadata indexing for hpc storage systems

overview of storage and indexing

designing succinct secondary indexing mechanism by

phoenix secondary indexing - la hug sept 9th, 2013

secondary indexing

spatial storage and indexing · chap4: spatial storage and...

modernizing file system through in-storage indexing

a comparative study of secondary indexing techniques in...

why concerning storage and indexing?

ch8 storage indexing overview-95

secondary indexing in phoenix - hadoop summit 2012 - hbase...

storage and indexing

musings on secondary indexing in hbase

lecture 12: storage and indexing

storage and indexing - harvard...