fs lecture

Upload: tanvir1987

Post on 30-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 FS Lecture

    1/17

    1

    File Structures

    Indexed Sequential File Access

    and Prefix B+ Trees

  • 8/9/2019 FS Lecture

    2/17

    March 16 & 21, 2000 2

    Indexed Sequential Access

    Up to this point, we have had to choose betweenviewing a file from an indexed point of view or from asequential point of view.

    Here, we are looking for a single organizational methodthat provides both of these views simultaneously.

    Why care about obtaining both views simultaneously? Ifan application requires both interactive random accessand cosequential batch processing, both sets of actions

    have to be carried out efficiently. (E.g., a student recordsystem at a University).

  • 8/9/2019 FS Lecture

    3/17

    3

    Maintaining a Sequence Set: The

    Use of Blocks I

    Asequence setis a set of records in physical key order whichis such that it stays ordered as records are added and deleted.

    Since sorting and resorting the entire sequence set as records

    are added and deleted is expensive, we look at other strategies.In particular, we look at a way to localize the changes.

    The idea is to use blocks that can be read into memory andrearranged there quickly. Like in B-Trees, blocks can be split,mergedor their records re-distributedas necessary.

  • 8/9/2019 FS Lecture

    4/17

    4

    Maintaining a Sequence Set: The

    Use of Blocks II

    Using blocks, we can thus keep a sequence set in orderby key without ever having to sort the entire set ofrecords.

    However, there are certain costs associated with thisapproach:A Blocked file takes up more space than an

    unblocked file because ofinternal fragmentation.

    The order of the records is not necessarilyphysicallysequential throughout the file. The maximumguaranteed extent of physical sequentiality is within ablock.

  • 8/9/2019 FS Lecture

    5/17

    5

    Maintaining a Sequence Set: The

    Use of Blocks III

    An important aspect of using blocks is the choice

    of a block size. There are 2 considerations to keep

    in mind when choosing a block size:The block size should be such that we can hold

    several blocks in memory at once

    The block size should be such that we canaccess a block without having to bear the cost

    of a disk seek within the block read or block

    write operation.

  • 8/9/2019 FS Lecture

    6/17

    6

    Adding a Simple Index to the

    Sequence Set

    Each of the blocks we created for our Sequence Setcontains a range of records that mightcontain therecord we are seeking.

    We can construct a simple single-level index for theseblocks. The combination of this kind of index with the

    sequence set of blocks provides complete indexedsequential access. This method works well as long as

    the entire index can be held in memory. If the entire index cannot be held in memory, then we

    can use a B+ Tree which is a B-Tree index plus asequence set that holds the records.

  • 8/9/2019 FS Lecture

    7/17

    7

    The Content of the Index:

    Separators Instead of Keys

    The index serves as a kind of road map for for thesequence set ==> We do not need to have keys inthe index set.

    What we really need are separators capable ofdistinguishing between two blocks.

    We can save space by using variable-lengthseparators and placing the shortest separator in the

    index structure. Rules are: Key < separator ==> Go left .

    Key = separator ==> Go right .Key > separator ==> Go right

  • 8/9/2019 FS Lecture

    8/17

    8

    The Simple Prefix B+ Tree

    The separators we just identified can be formed

    into a B-Tree index of the sequence set blocks and

    the B-Tree index is called the index set. Taken together with the sequence set, the index set

    forms a file structure called asimple prefix B+

    Tree.

    simple prefix indicates that the index set

    contains shortest separators, or prefixes of the

    keys rather than copies of the actual keys.

  • 8/9/2019 FS Lecture

    9/17

    9

    Simple Prefix B+ Tree

    Maintenance

    Changes localized to single blocks in the sequence set:Make the changes to the sequence set and to the index set.

    Changes involving multiple blocks in the sequence set:If blocks are split in the sequence set, a new separator

    must be inserted into the index setIf blocks are merged in the sequence set, a separator

    must be removed from the index set.

    If records are re-distributed between blocks in thesequence set, the value of a separator in the index setmust be changed.

  • 8/9/2019 FS Lecture

    10/17

    10

    Index Set Block Size

    The physical size of a node for the index set is usually the sameas the physical size of a block in the sequence set. We, then,speak of index set blocks, rather than nodes.

    There are a number of reasons for using a common block size for

    the index and sequence sets:The block size for the sequence set is usually chosen because

    there is a good fit among this block size, the characteristics ofthe disk drive, and the amount of memory available.

    A common block size makes it easier to implement a

    buffering scheme to create a virtual simple prefix B+TreeThe index set blocks and sequence set blocks are oftenmingled within the same file to avoid seeking between 2separate files while accessing the simple prefix B+Tree.

  • 8/9/2019 FS Lecture

    11/17

    March 16 & 21, 2000 11

    Internal Structure of Index Set

    Blocks: A Variable-Order B-Tree

    Given a large, fixed-size block for the index set, how

    do we store the separators within it?

    There are many ways to combine the list ofseparators, the index to separators, and the list of

    Relative Block Numbers (RBNs) into a single index

    set block.

    One possible approach includes a separator count

    and keeps a count of the total length of separators.

  • 8/9/2019 FS Lecture

    12/17

    March 16 & 21, 2000 12

    Loading a Simple Prefix B+ Tree I

    Successive Insertions is not a good method because splitting

    and redistribution are relatively expensive and would be best to

    use only for tree maintenance.

    Starting from a sorted file, however, we can place the recordsinto sequence set blocks one by one, starting a new block when

    the one we are working with fills up. As we make the transition

    between two sequence set blocks, we can determine the

    shortest separator for the blocks. We can collect theseseparators into an index set block that we build and hold in

    memory until it is full.

  • 8/9/2019 FS Lecture

    13/17

    March 16 & 21, 2000 13

    Loading a Simple Prefix B+ Tree II:

    Advantages

    The advantages of loading a simple Prefix B+ Tree almost alwaysoutweigh the disadvantages associated with the possibility ofcreating blocks that contain too few records or too few separators.

    A particular advantage is that the loading process goes morequickly because:The output can be written sequentially;we make only one pass over the data;No blocks need to be reorganized as we proceed.

    Advantages after the tree is loaded

    The blocks are 100% full.Sequential loading creates a degree ofspatial locality within our

    file ==> Seeking can be minimized.

  • 8/9/2019 FS Lecture

    14/17

    March 16 & 21, 2000 14

    B+ Trees

    The difference between a simple prefix B+ Tree and a plain B+ Tree

    is that the plain B+ Tree does not involve the use of prefixes as

    separators. Instead, the separators in the index set are simply copies

    of the actual keys.

    Simple Prefix B+ Tree are often more desirable than plain B+ Trees

    because the prefix separators take up less space than the full keys.

    B+ Trees, however, are sometimes more desirable since 1) they do

    not need variable length separator fields and 2) some key sets are not

    always easy to compress effectively.

  • 8/9/2019 FS Lecture

    15/17

    March 16 & 21, 2000 15

    B-Trees, B+Trees and Simple

    Prefix B+ Trees in Perspective I

    B and B+ Trees are not the only tools useful for File Structure Design.Simple Indexes are useful when they can be held fully into memory andHashingcan provide much faster access than B and B+ Trees.

    Common Characteristics of B and B+ and Prefix B+ Trees:

    Paged Index Structures ==> Broad and shallow treesHeight-Balanced TreesThe trees are grown Bottom Up and the operations used are: block

    splitting, merging and re-distributionTwo-to-Three Splitting and redistribution can be used to obtain

    greater storage efficiency.Can be implemented as Virtual Tree Structures.Can be adapted for use with variable-length records.

  • 8/9/2019 FS Lecture

    16/17

    March 16 & 21, 2000 16

    B-Trees, B+Trees and Simple

    Prefix B+ Trees in Perspective II

    Differences between the various structures:

    B-Trees: multi-level indexes to data files that are entry-sequenced.Strengths: simplicity of implementation. Weaknesses: excessive

    seeking necessary for sequential access. B-Trees with Associated Information: These are B-Trees that

    contain record contents at every level of the B-Tree. Strengths: cansave up space. Weaknesses: Works only when the recordinformation is located within the B-Tree. Otherwise, too muchseeking is involved in retrieving the record information.

  • 8/9/2019 FS Lecture

    17/17

    March 16 & 21, 2000 17

    B-Trees, B+Trees and Simple

    Prefix B+ Trees in Perspective III

    Differences between the various structures (Contd):

    B+ Trees: In a B+ Tree all the key and record info is contained in alinked set of blocks known as the sequence set. Indexed access isprovided through the Index Set. Advantages over B-Trees: 1) The

    sequence set can be processed in a truly linear, sequential way; 2) Theindex is built with a single key or separator per block of data recordsrather than with one key per data record. ==> index is smaller and henceshallower.

    Simple Prefix B+ Trees: The separators in the index set are smaller than

    the keys in the sequence set ==> Tree is even smaller.