secondary storage and indexing

Post on 09-Jun-2022

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Secondary Storage and Indexing

CSCI 4380 Database Systems

Friday, April 20, 12

Disk access• Databases are generally large data stores,

much larger than available memory.

• Data is stored on disk, brought to memory on demand.

• A disk page or block is the smallest unit of access, to read and write.

• A disk page typically is 1K - 8K.

Friday, April 20, 12

Disk organization

• A disk contains

• multiple platters (usually 2 surfaces per platter)

• usually, the disk contains read/write heads that allow is to read/write from all surfaces simultaneously

Friday, April 20, 12

Disk organization

• A disk surface contains

• multiple concentric tracks

• the same track on different surfaces can be read by different heads at the same time, this unit is called a cylinder

Friday, April 20, 12

Disk organization• A track is broken down to sectors, sectors are

separated from each other by blank spaces

• A sector is the smallest unit of operation (read/write) possible on a disk

• A disk block is usually composed of a number of consecutive sectors (determined by the operating system)

• Data are read/written in units of a disk block/page

• A disk block is the same size as a memory block or page.

Friday, April 20, 12

Reading a disk page• Reading a page from disk requires the disk to start

spinning

• Disk arm has to be moved to the correct track of the disk -> seek operation

• The disk head must wait until the right location on the track is found -> rotational latency

• Then, the disk page can be read from disk and copied to memory -> transfer time.

Friday, April 20, 12

Reading a disk page• The cost of reading a disk page:

• seek time + rotational latency time + transfer time

• Multiple pages on the same track/cylinder can be read with a single seek/latency. Reading M pages on the same track/cylinder:

• seek time + rotational latency time + transfer time * (percentage of disk circumference to be scanned)

Friday, April 20, 12

A high end disk example

• Consider a disk with 16 surfaces, 216 tracks per surface (approx. 65K), 28=256 sectors per track and 212 bytes per sector.

• Each track has = 212 * 28 =220 bytes (1MB)

• Each surface has = 220 * 216 = 236 bytes

• The disk has = 24 * 236 = 240 byte = 1 TB

Friday, April 20, 12

Reading a page• Typical times:

• 7200 rpm means one rotation takes 8.33 ms (in the average, 1/2 of the disk needs to be rotated before the correct location is found, 4.17ms)

• seek time between 0 - 17.38 ms (in the average, 1/3 of the disk surface is scanned = 6.46 ms)

• transfer time for one sector : 8.33/256 = 0.03 ms

Friday, April 20, 12

Reading a page• Reading a page of 8K (2 sectors):

• 1 seek + 1 rotational latency + 2 sector transfer time

• 6.46 + 4.17 + 0.03 * 2 = 10.69 ms

• Reading 100 consecutive pages on the same track:

• 6.46 + 4.17 + 0.03 * 10 = 13.63 ms

• The lesson: Put blocks that are accessed together on the same track/cylinder as much as possible

Friday, April 20, 12

Disk scheduling• The disk controller can order the requests

to minimize seeks

• When the controller is moving from low tracks to high tracks, serve the next track request in the direction of the movement, queue the rest

• The method is called the elevator algorithm

Friday, April 20, 12

Checksums• For each sector, store a number of error checking bits called

checksums.

• The checksum is 1 if the number of 1’s in the given sector is odd, and 0 if the number of 1’s is even.

• When reading a sector, check that the checksum is correct.

• Checks for 1 bit errors.

• Errors for more than 1 bits, the checksum will catch it in 50% of the time.

• For better error correction, use multiple bits (8 bits, bit i stores the parity of the ith bit of each byte).

Friday, April 20, 12

Stable storage• When we are writing a sector, if the write fails, then we

lost the data on that sector.

• Use two sectors for each sector, XL and XR.

• First write XL, check the checksum. If XL is written correctly, then write XR.

• If XL is written incorrectly, then the old version of X is still stored in XR.

• If XR is written incorrectly, then the new version of X is stored in XL.

Friday, April 20, 12

Multiple disks

• Raid (redundant array of inexpensive disks) is a series of methods for improving access time and reducing possibility of data loss by using multiple disks.

Friday, April 20, 12

RAID-0

• RAID-0, striping

• Distribute the data into multiple disks

• Example with 4 disks:

• Disk 1 has pages 1,5,9

• Disk 2 has pages 2,6,10

• Disk 3 has pages 3,7,11

• Disk 4 has pages 4,8,12

Friday, April 20, 12

RAID-0

• RAID-0, striping

• Reads are faster (read from all disks simultaneously)

• Writes are the same

• No redundancy in case a disk fails

Friday, April 20, 12

RAID-1• RAID-1, mirroring

• Mirror each disk onto another disk

• Reads are twice as fast, read from any disk available

• Writes are slow, each write require writing to two disks

• If one of the disks fail, the other one contains all the data (no data loss)

Friday, April 20, 12

RAID-4• One block contains the parity of the remaining disks

• Block i in the parity disk contains the parity of the ith block in all the remaining disks

• Reads are unchanged

• Writes are slower, each write requires a write to the parity disk as well

• If a disk fails, the lost data can be constructed from the remaining disks

Friday, April 20, 12

RAID-5• Similar to RAID-4, but the parity block is distributed

to all the disks

• Example: Given 5 disks (4 regular and 1 parity):

• Use disk 1 for parity of block 1

• Use disk 2 for parity of block 2

• etc.

• Reads are the same

• Writes are faster as the parity block is no longer a bottleneck

Friday, April 20, 12

Tuple organization• A disk page typically stores multiple tuples.

Many different organizations exist.

• The number of tuples that can fit in a page is determined by the number of attributes and the types of attributes the relation has.

Header info row directory

1 2 N... Free space Data rowsRow N Row N-1 Row 1...

Friday, April 20, 12

Tuple addressing• Tuples have a physical address which contains the

relevant subset of:

• Host name/Disk number/Surface No/ Track No/Sector No

• Physical address tends to be long

• Tuples are also given a logical address in the relation,

• A map table stored on disk contains the mapping from the logical address to physical address

Friday, April 20, 12

Tuple addressing

• When tuples are brought from disk to memory, its current address becomes a memory address

• Pointer swizzling is the act of changing physical address to the memory address in the map table for pages in memory

Friday, April 20, 12

Indexing

• An index is a lookup structure built on a search key

• the search key can consist of multiple attribute

• the index contains pointers to tuples (logical address)

• The index itself is also packed into pages and stored on disk.

Friday, April 20, 12

Dense vs. sparse

• The index is called dense if it contains an entry for each tuple in the relation.

• An index is called sparse if it does not contain an entry for each tuple.

• A sparse index is possible if the addressed relation is sorted with respect to the index key.

Friday, April 20, 12

Dense Index Example1, t1

2,t3

4,t5

5,t6

8,t7

9,t10

10,t12

Index

t1

t7

t12

t5

t6

t3

t10

Indexed

Relation

Friday, April 20, 12

Sparse Index Example1, t1

8,t7

Index

t1

t3

t5

t6

t7

t10

t12

Indexed

Relation

1,t1 points to all values between 1 and 5 8,t7 points to all values greater than 5

Friday, April 20, 12

Index types

• An index can be

• primary, i.e. determines where the tuples are stored

• secondary, i.e. points to the tuples

• There can be many secondary indices.

• An index can be multi-level, i.e. a tree index, where each level is an index on the level below.

Friday, April 20, 12

B- trees

• B trees (called B+ trees in some books) are constructed on a list of attributes (also called the index key)

• Each node on a B-tree is mapped to a disk page

• Leaf nodes:

• A leaf node can contain at most n tuples (key values and pointers) and 1 additional pointer to the sibling node.

• A leaf node must contain at least floor((n+1)/2) tuples (plus one additional pointer to the next sibling node.

Friday, April 20, 12

B- trees

• Internal nodes:

• An internal node can contain at most n + 1 pointers and n key values.

• An internal node must contain at least floor((n+1)/2) pointers (and one less key value), except the root which can contain a single key value and 2 pointers.

Friday, April 20, 12

B- tree example• Suppose n = 3

• Each leaf node will have at least 2 and at most 3 tuples.

• Each internal node will point to at least 2 and at most 4 nodes below (and hence will have between 1 and 2 key values).

• Suppose n = 99

• Each leaf node will have at least 50 and at most 99 tuples.

• Each internal node will point to at least 50 and at most 100 nodes below (and hence will have between 49 and 99 key values).

• The root can have 2 pointers and 1 key value in the least.

Friday, April 20, 12

Sibling nodes

• Leaf nodes point to the next node in the leaf, called a sibling node.

Friday, April 20, 12

B- trees

• Leaf nodes contain pairs of

• key values

• pointers to the tuple

• If the B-tree is a secondary index, then there is an entry in the leaf level for each tuple in the relation.

• The leaf nodes also contain a pointer to the next (sibling) leaf node.

Friday, April 20, 12

B- trees• Internal nodes contain n key values and n+1 pointers

• The pointers point to the nodes at the level below

10 25 32

values

<10

values

>=10

and

<25

values

>=25

and

<32

values

>=32

Friday, April 20, 12

Example B-tree

Assume at most 4 key values per node

2 7 11 15 22 30 41 53 54 63 66 69 71 76 78 84 93

11 3066 78

53

pointers to tuples

Friday, April 20, 12

B-trees with duplicate values

• If the B-tree is built on a key value that may contain duplicates, build the index in an identical way, except:

• The non-leaf node pointing to leaf node contains the key value of the first node that is not repeating from the previous sibling

• If there is no such key, then a null value is stored at this location.

Friday, April 20, 12

Example B-tree with duplicates

Assume at most 4 key values per node

2 7 11 15 15 15 18 18 22 41 41 41 41 41 55 63

11 18- 55

22

Friday, April 20, 12

B-tree equality search• Given select * from R where A = x and an index on R.A (assume no

duplicate values for R.A):

• While not at leaf level:

• Starting from the root, find the address for the node below that may contain this value (the pointer to the left of the first key value that is greater than x or the last pointer if no such value exists)

• Read the node from disk

• If the leaf level contains a tuple with the searched value, read the matched tuples from disk and return

Friday, April 20, 12

B-tree equality search• Given select * from R where A = x and an index on R.A (assume R.A

may contain duplicate values):

• While not at leaf level:

• Starting from the root, find the address for the node below that may contain this value (the pointer to the left of the first key value that is greater than x or the last pointer if no such value exists)

• Read the node from disk

• If the leaf level contains a tuple with the searched value, scan all sibling pointers until a value different than x is found. Read the matched tuples from disk and return

Friday, April 20, 12

B-tree range search

• Given select * from R where A < y and A > x an index on R.A:

• Using the same algorithm from before, find the first leaf node containing a value > x

• Traverse the sibling pointers from left to right until all tuples in the range are read

• Read all the matching tuples from the disk

Friday, April 20, 12

Index only search

• Given select A from R where A < 120 and A > 10 and an index on R.A:

• Scan the index for matching tuples as before and return the found A values (no need to read the tuples from disk)

Friday, April 20, 12

Index partial match• Given an index on R.A, R.B (index is sorted on A first and then

on B)

• Select * from R where A > 10 and A < 100 and B=2

• Scan index for the range A > 10 and A < 100, and for each matching tuple check the B value, read matched tuples from disk

• Select * from R where B > 10 and B < 100

• Scan the leaf level of the index completely to find the matching B tuples, read matched tuples from disk

Friday, April 20, 12

Insertion1. Given a new entry A to be inserted

1.1. Search the tree for the new entry

1.2. If the leaf node X has space for the new entry, insert.

1.3. Otherwise

1.3.1. Create a new leaf node Y and distribute the entries in X and the entry A to X and the new node

1.3.2. Create a new entry B with the address of Y and the lowest entry in Y

1.3.3. Insert B into the parent of X recursively (go to step 1.2)

Friday, April 20, 12

43

Insert Example

Insert record with key 57 (at most 4 key values)

2 7 11 15 22 30 41 53 54 63 66 69 71 76 78 84 93

11 3066 78

53

Friday, April 20, 12

44

Insert Example

Insert record with key 57 (at most 4 key values)

2 7 11 15 22 30 41 53 54 57 63 66 69 71 76 78 84 93

11 3066 78

53

We are done! No rebalancing necessary

Friday, April 20, 12

45

Another Insert Example

Insert 65

2 7 11 15 22 30 41 53 54 57 63 66 69 71 76 78 84 93

11 3066 78

53

Friday, April 20, 12

46

Another Insert Example

Overflown node is split

2 7 11 15 22 30 41 53 54 57 66 69 71 76 78 84 93

11 30 63 66 78

53

63 65

Friday, April 20, 12

47

Another Insert Example

Insert 70 and 94, one more node split

53

2 7

11 30 63 66 71 76

11 15 22 30 41 53 54 57 66 69 70 78 84 93 94

63 65 71 76

Friday, April 20, 12

48

Another Insert Example

Finally, insert 90 (which will cause the parent to split)

53

2 7

11 30 63 66 71 76

11 15 22 30 41 53 54 57 66 69 70 78 84 93 94

63 65 71 76

Friday, April 20, 12

49

Another Insert Example

Finally, insert 90 (which will cause the parent to split)

53 71

2 7

11 30 63 66

11 15 22 30 41 53 54 57 66 69 70 78 84 90

63 65 71 76 93 94

78 93

Friday, April 20, 12

Deletion

• Suppose we would like to delete entry A

• Locate leaf node X containing entry A and delete A

• If X has n/2 or more pointers, then adjust the parent node entry pointing to this node if necessary recursively (if we deleted the smallest entry in the node)

Friday, April 20, 12

Deletion• Otherwise, the node has too few pointers.

• If a sibling node with the same parent has more than n/2 pointers, then redistribute entries with the sibling and adjust the parent pointers

• Else

• delete A

• insert all the entries in A to a sibling B

• adjust the parent entry for B

• delete the entry Y in the parent for A recursively (go to the first step of this algorithm)

Friday, April 20, 12

52

Deletion Example

Delete key 30

2 7 11 15 17 22 30 53 54 78 84 93

11 22

78

53

Friday, April 20, 12

53

Deletion Example

Delete key 30Borrow from neighbor

Adjust the internal node

2 7 11 15 17 22 53 54 78 84 93

11 17

78

53

Friday, April 20, 12

53

Deletion Example

Delete key 30Borrow from neighbor

Redistribute betweenthe second andthird leaf nodes.

Adjust the internal node

2 7 11 15 17 22 53 54 78 84 93

11 17

78

53

Friday, April 20, 12

54

Another Deletion Example

Delete key 7

Cannot borrow from neighbor,Merge with neighbor

2 7 11 15 17 22 53 54 78 84 93

11 17

78

53

Friday, April 20, 12

55

Another Deletion Example

2 11 15 17 22 53 54 78 84 93

17

78

53

Delete the corresponding pointer

Friday, April 20, 12

56

Another Deletion Example

Delete 53, must merge with a sibling

2 11 15 17 22 53 54 78 84

17

78

53

Friday, April 20, 12

57

Another Deletion Example

2 11 15 17 22 54 78 84

17

78

53Node too empty,

cannot borrow from sibling,

must merge with sibling

Friday, April 20, 12

58

Another Deletion Example

2 11 15 17 22 54 78 84

17 54

78

53

Friday, April 20, 12

59

Another Deletion Example

2 11 15 17 22 54 78 84

17 54

The final tree.

Friday, April 20, 12

60

A B-Tree Example

Given:

disk page has capacity of 4K bytes

each tuple address takes 6 bytes and each key value takes 2 bytes

each node is 70% full

need to store 1 million tuples

Friday, April 20, 12

61

A B+-Tree Example

Leaf node capacity

• each (key value, tuple address) pair takes 8 bytes

• disk page capacity is 4K, so (4*1024)/8 = 512 (key value, rowid) pairs per leaf page

in reality there are extra headers and pointers that we will ignore

• Hence, the maximum number of points for the tree is about 256 (and 255 key values)

Friday, April 20, 12

62

Example Continued• If all pages are 70% full, each page has about

512*0.7 = 359 pointers

• To store 1 million tuples, requires

1,000,000 / 359 = 2786 pages at the leaf level

2789 / 359 = 8 pages at next level up

1 root page pointing to those 8 pages

Hence, we have a B-tree with 3 levels

Friday, April 20, 12

Hashing• Given a hash of K buckets

• Allocate a number of disk blocks M to each bucket

• For each tuple t, apply the hash function. Suppose, we hash on attribute A, if h(t.A) = x, then store t in the blocks allocated for bucket x.

• Search on attribute A (select * from r where r.a=c)

• Cost: M/2 (search half the pages for that bucket in the average

Friday, April 20, 12

Hashing

• Search on another attribute

• Cost: N

• Insertion cost: 1 read and 1 write (find the last page in the appropriate bucket and store)

• Deletion/Update cost: M/2 (search cost) + 1 to update

Friday, April 20, 12

Hashing - collisions

• If a bucket has too many tuples, than the allocated M pages may not be sufficient

• Allocate additional overflow area

• If the overflow area is large, the benefit of the hash is lost

Friday, April 20, 12

Extensible hashing• The address space of the hash (K) can be adjusted to the

number of tuples in the relation

• Use a hash function h

• But, use only first z bits of the hashed value to address the tuples

• If a bucket overflows, split the hash directory and use z+1 bits to address

Friday, April 20, 12

Extensible hashing• Using a single bit to address

tuples

0

1z=1

Page 1

0

1

Page 0Overflow!

new point

Friday, April 20, 12

Extensible hashing• Double the directory

Page 1

0

1

Page 0

Distribute to00 and 10

Friday, April 20, 12

Extensible hashing• Double the directory

Page 1

00

01

Page 0

Page 3

10

11

Page 2

Make a copy of thedirectory

Friday, April 20, 12

Extensible hashing

Page 1

00

01

Page 0

Page 3

10

11

Page 2

Update thelink for the new node

Friday, April 20, 12

Extensible hashing

Page 1

00

01

Page 0

Page 3

10

11

Page 2

How do we knowwhich nodes canbe split withoutsplitting the directory?

2

1

2

Friday, April 20, 12

Linear hashing• The addressing is the same, but we allow overflows

• We decide to split based on a global rule

• If number of pages/number of tuples > k %

• Split one bucket at a time

0

1

Friday, April 20, 12

Linear hashing0

1

new point

decide to split

00

1

10

split the contents

into 00 and 10

00

1

10

bucket 1 still contains

all entries

01 and 11

Friday, April 20, 12

Linear hashing• The bucket split is the next one in sequence

• it may not be the one that has overflow pages

• eventually all buckets will be split

Friday, April 20, 12

top related