homework #3 due thursday, april 17 problems: –chapter 11: 11.6, 11.10 –chapter 12: 12.1, 12.2,...

Homework #3

• Due Thursday, April 17

• Problems:– Chapter 11: 11.6, 11.10

– Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7

Quick Review of material covered Apr 3

• Indexing methods are used to speed up access to desired data• Definitions: Search key, ordered indices, hash indices:

• Ordered Indices– An ordered index stores the values of the search keys in sorted order

– primary index: search key also determines the sort order of the original file. Also called clustering indices

– secondary indices: search key specifies an order different from the sequential order of the file.

– an index-sequential file is an ordered sequential file with a primary index.

– dense and sparse Indices

– Multi-level Index

– Issues connected with index update operations

B+- Tree Index Files

• Main disadvantage of ISAM files is that performance degrades as the file grows, creating many overflow blocks and the need for periodic reorganization of the entire file

• B+- trees are an alternative to indexed-sequential files– used for both primary and secondary indexing

– B+- trees are a multi-level index

• B+- tree index files automatically reorganize themselves with small local changes on insertion and deletion.– No reorg of entire file is required to maintain performance

– disadvantages: extra insertion, deletion, and space overhead

– advantages outweigh disadvantages. B+-trees are used extensively

B+- Tree Index Files (2)

Definition: A B+-tree of order n has:• All leaves at the same level

• balanced tree (“B” in the name stands for “balanced”)

• logarithmic performance

• root has between 1 and n-1 keys

• all other nodes have between n/2 and n-1 keys (>= 50% space utilization)

• we construct the tree with order n such that one node corresponds to one disk block I/O (in other words, each disk page read brings up one full tree node).

B+- Tree Index Files (3)

A B+-tree is a rooted tree satisfying the following properties:

• All paths from root to tree are the same length

• Search for an index value takes time according to the height of the tree (whether successful or unsuccessful)

B+- Tree Node Structure

• The B+-tree is constructed so that each node (when full) fits on a single disk page– parameters: B: size of a block in bytes (e.g., 4096)

K: size of the key in bytes (e.g., 8)

P: size of a pointer in bytes (e.g., 4)

– internal node must have n such that:

(n-1)*K + n*P <= B

n<= (B+K)/(K+P)

– with the example values above, this becomes

n<=(4096+8)/(8+4)=4114/12

n<=342.83

B+- Tree Node Structure (2)

• Typical B+-tree Node

Ki are the search-key values

Pi are the pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes)

• the search keys in a node are ordered:

K1<K2 <K3 …<Kn-1

Non-Leaf Nodes in B+-Trees

• Non-leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with n pointers:– all the search keys in the subtree to which P1 points are less than

K1

– For 2<= i <= n-1, all the search keys in the subtree to which Pi points have values greater than or equal to Ki-1 and less than Kn-1

Leaf Nodes in B+-Trees

• As mentioned last class, primary indices may be sparse indices. So B+-trees constructed on a primary key (that is, where the search key order corresponds to the sort order of the original file) can have the pointers of their leaf nodes point to an appropriate position in the original file that represents the first occurrence of that key value.

• Secondary indices must be dense indices. B+-trees constructed as a secondary index must have the pointers of their leaf nodes point to a bucket storing all locations where a given search key value occur; this set of buckets is often called an occurrence file

Example of a B+-tree

• B+-tree for the account file (n=3)

Another Example of a B+-tree

• B+-tree for the account file (n=5)

• Leaf nodes must have between 2 and 4 values

((n-1)/2 and (n-1), with n=5)

• Non-leaf nodes other than the root must have between 3 and 5 children

(n/2 and n, with n=5)

• Root must have at least 2 children

Observations about B+-trees

• Since the inter-node connections are done by pointers, “logically” close blocks need not be “physically” close

• The non-leaf levels of the B+-tree form a hierarchy of sparse indices

• The B+-tree contains a relatively small number of levels (logarithmic in the size of the main file), thus searches can be conducted efficiently

• Insertions and deletions to the main file can be handled efficiently, as the index can be restructured in logarithmic time (as we shall examine later in class)

Queries on B+-trees

• Find all records with a search-key value of k– start with the root node (assume it has m pointers)

• examine the node for the smallest search-key value > k

• if we find such a value, say at Kj , follow the pointer Pj to its child node

• if no such k value exists, then k >= Km-1, so follow Pm

– if the node reached is not a leaf node, repeat the procedure above and follow the corresponding pointer

– eventually we reach a leaf node. If we find a matching key value (our search value k = Ki for some i) then we follow Pi to the desired record or bucket. If we find no matching value, the search is unsuccessful and we are done.

Queries on B+-trees (2)

• Processing a query traces a path from the root node to a leaf node

• If there are K search-key values in the file, the path is no longer than logn/2 (K)

• A node is generally the same size as a disk block, typically 4 kilobytes, and n is typically around 100 (40 bytes per index entry)

• With 1 million search key values and n=100, at most log50(1,000,000) = 4 nodes are accessed in a lookup

• In a balanced binary tree with 1 million search key values, around 20 nodes are accessed in a lookup– the difference is significant since every node access might need a disk I/O,

costing around 20 milliseconds

Insertion on B+-trees

• Find the leaf node in which the search-key value would appear

• If the search key value is already present, add the record to the main file and (if necessary) add a pointer to the record to the appropriate occurrence file bucket

• If the search-key value is not there, add the record to the main file as above (including creating a new occurrence file bucket if necessary). Then:– if there is room in the leaf node, insert (key-value, pointer) in the

leaf node– otherwise, overflow. Split the leaf node (along with the new entry)

Insertion on B+-trees (2)

• Splitting a node:– take the n (search-key-value, pointer) pairs, including the one

being inserted, in sorted order. Place half in the original node, and the other half in a new node.

– Let the new node be p, and let k be the least key value in p. Insert (k, p) in the parent of the node being split.

– If the parent becomes full by this new insertion, split it as described above, and propogate the split as far up as necessary

• The splitting of nodes proceeds upwards til a node that is not full is found. In the worst case the root node may be split, increasing the height of the tree by 1.

Insertion on B+-trees Example

Deletion on B+-trees

• Find the record to be deleted, and remove it from the main file and the bucket (if necessary)

• If there is no occurrence-file bucket, or if the deletion caused the bucket to become empty, then delete (key-value, pointer) from the B+-tree leaf-node

• If the leaf-node now has too few entries, underflow has occurred. If the active leaf-node has a sibling with few enough entries that the combined entries can fit in a single node, then– combine all the entries of both nodes in a single one

– delete the (K,P) pair pointing to the deleted node from the parent. Follow this procedure recursively if the parent node underflows.

Deletion on B+-trees (2)

• Otherwise, if no sibling node is small enough to combine with the active node without causing overflow, then:– Redistribute the pointers between the active node and the sibling

so that both of them have sufficient pointers to avoid underflow

– Update the corresponding search key value in the parent node

– No deletion occurs in the parent node, so no further recursion is necessary in this case.

• Deletions may cascade upwards until a node with n/2 or more pointers is found. If the root node has only one pointer after deletion, it is removed and the sole child becomes the root (reducing the height of the tree by 1)

Deletion on B+-trees Example 1

homework #3 due thursday, april 17 problems: –chapter 11: 11.6, 11.10 –chapter 12: 12.1, 12.2,...

Documents

b tree node structure

b trees

ordered index

level balanced tree

indexsequential file

rooted tree

index value

order n