csc 556 – dbms ii, spring 2013

CSC 556 – DBMS II, Spring 2013

April 10 & 17, 2013Storage media hierarchies, external

sorts, B-trees

Storage medium abstraction

• Storage medium as a subclass of an interface allows you to prototype storage medium semi-independently from DBMS structures atop it.

Sequential File I/O

• Queue abstraction supports sequential file I/O or core I/O via a series of enqueue, peek & dequeue calls. Supplies one-record lookahead.

• External sorts, i.e., sorts where the data do not fit into memory, rely on this approach.

• A record size entry can precede any variable length record in the file, OR

• A sentinel value can mark a record’s end.

Direct Access File I/O

• Direct access (a.k.a. random access) file I/O uses a seek system call to locate a position with a direct-access (binary) data file.

• man lseek, fseek, ftell• Offsets are from 0, or current seek position, or end.• Unix 3C library builds atop level 2 system calls. DBMS

may access level 2 system calls directly.• It treats the file as an array where a seek address is an offset

into the array.• Applications attempt to keep contiguous records in

contiguous blocks.

Merge sort & Sequential File I/O

• The split phase partitions alternate runs from a queue into two helper queues, where the initial queue is the file to be sorted, and the helpers are temporary files.

• A run is a sorted subsequence.• The merge phase doubles the run length by merging peer

runs from the helper queues back into the initial queue.• Natural-length merge sort inspects the data to locate run

boundaries. It may stumble onto big runs.• Another variants uses an internal sort such as quicksort to

build larger initial runs in memory.• A file-based sort is an external sort.

April 10 example of merge sort• End of split phase with run length = 1.• End-of-run record is underlined.• Destination of this phase is in bold.

main -17 -101 5 0 18 29 62 666 1 -1

tmp0 -17 5 18 62 1

tmp1 -101 0 29 666 -1

April 10 example of merge sort• End of split phase with run length = 1.

• End of merge phase grows run length to 2.

main -17 -101 5 0 18 29 62 666 1 -1

tmp0 -17 5 18 62 1

tmp1 -101 0 29 666 -1

main -101 -17 0 5 18 29 62 666 -1 1

tmp0 -17 5 18 62 1

tmp1 -101 0 29 666 -1



main -101 -17 0 5 18 29 62 666 -1 1

tmp0 -101 -17 18 29 -1 1

tmp1 0 5 62 666

main -101 -17 0 5 18 29 62 666 -1 1

tmp0 -101 -17 18 29 -1 1

tmp1 0 5 62 666



main -101 -17 0 5 18 29 62 666 -1 1

tmp0 -101 -17 0 5 -1 1

tmp1 18 29 62 666

main -101 -17 0 5 18 29 62 666 -1 1

tmp0 -101 -17 0 5 -1 1

tmp1 18 29 62 666



main -101 -17 0 5 18 29 62 666 -1 1

tmp0 -101 -17 0 5 18 29 62 666

tmp1 -1 1

main -101 -17 -1 0 1 5 18 29 62 666

tmp0 -101 -17 0 5 18 29 62 666

tmp1 -1 1

Merge sort is O(n log(n))

• Picture a merge sort as a tree growing up from N runs of length 1 to 1 run of length N.

April 10 radix10 sort (bucket sort)

• This sort also uses sequential file I.O.• Initial sequence of integers.

• Sequence normalized to non-negative values with digits to accommodate the largest.

main -17 -101 5 0 18 29 62 666 1 -1

main 084 000 106 101 119 130 163 707 102 100

Number temp queues = radixmain 084 000 106 101 119 130 163 767 102 100

tmp0 000 130 100

tmp1 101

tmp2 102

tmp3 163

tmp4 084

tmp5

tmp6 106

tmp7 767

tmp8

tmp9 119


tmp0 000 100 101 102 106

tmp1 119

tmp2

tmp3 130

tmp4

tmp5

tmp6 163 767

tmp7

tmp8 084

tmp9


tmp0 000 084

tmp1 100 101 102 106 119 130 163

tmp2

tmp3

tmp4

tmp5

tmp6

tmp7 767

tmp8

tmp9

Number temp queues = radix

• Sort is O(N x D) for N items and D digits, but• D is a constant and does not affect growth rate• Radix sort is therefore O(N) on data size• It requires a fixed-bit-width key field.

• Fixed-width fields required for a relational DBMS.

• It has a lot of copy overhead.• Use a radix with many bits, many (smaller) files.

main 000 084 100 101 102 106 119 130 163 767

final -101 -17 -1 0 1 5 18 29 62 666

Base-2numbits code.

void radixsort(interface_queueOfInts *queueToSort, int numbits,

interface_queueOfInts * temporaryQueues[]) { const int numqueues = (1 << numbits); int mask = ~(~0 << numbits); const int allbits = sizeof(int) * 8 ; // bits per sorted value for (int shifter = 0 ; shifter < allbits ; shifter += numbits) { splitphase(queueToSort, temporaryQueues, shifter, mask); mergephase(queueToSort, temporaryQueues,

numqueues); } }

Base-2numbits code.

static void splitphase(interface_queueOfInts *merger, interface_queueOfInts * splitter[], int shifter, int bitmask) { bool ignoreme ; // We only peek on queues with data, etc. while (merger->canPeek()) { // while there are runs to split int value = merger->peek(ignoreme); merger->dequeue(); int qid = (value >> shifter) & bitmask ; splitter[qid]->enqueue(value);}}

Base 2numbits code.

static void mergephase(interface_queueOfInts *merger, interface_queueOfInts * splitter[], int numqueues) { bool ignoreme ; for (int qtodrain = 0 ; qtodrain < numqueues ; qtodrain++) { while (splitter[qtodrain]->canPeek()) { merger->enqueue(splitter[qtodrain]->peek(ignoreme)); splitter[qtodrain]->dequeue(); } }}

MultiSet is a B+-tree Map over various storage subclasses

• DataMine/mset/MultiSet.h

Relational DBMS

• Flat, fixed width records (tuples) fit into contiguous memory locations & file blocks.

• Take the least common multiple (LCM) of the record size and block size, and allocate using that.

• Unix lseek and fcntl are the primary system calls. Windows has counterparts. The low-end cylinder allocation on the disk is managed by the operating system.

fixed0 fixed1 fixed2 fixed2 fixed3 fixed4 fixed5 fixed6 fixed7 fixed8

B-trees and index files

• B-trees are balanced binary trees; typically degree D > 2, as determined by block size.

• How many B-tree node records fit into a disk block?• Each node holds between D/2 and D entries.

• The root is an exception. It can hold < D/2 entries.• In B+-trees the leaves hold the actual data, typically as seek

indices into the contiguous database file. B+-trees also link the leaves into a sorted chain for range-based serial access.

• B+-tree interior nodes hold pointers to children.• A B-tree grows from the bottom up whenever an insertion

causes a node to split.

MultiSet

• MultiSet uses sets of keys or multisets of keys (duplicate keys allowed) to map to application data elements.

• Search includes ==, <, <=, > or >= key.• First or last key occurrence for actual multisets.• Greatest lesser value when key is not present.• Also supports least greater value.• Serial linked list at leaves (B+ tree) supports duplicate

keys & iterating over results, including following operations.• A result is a MultiSet amenable to union, intersection

and set difference with other MultiSet objects.

MultiSet.htypedef unsigned long location ; const unsigned BTREEDEGREE = 16 ;template <class KeyType> struct treenode { // basic internal node location parent ; // type treenode location child[BTREEDEGREE] ; // treenodes or leafnodes KeyType key[BTREEDEGREE] ;} // min keys for those childrentemplate <class KeyType> struct leafnode { // leaf connects treenodes to treeelems location parent ; // type treenode location prev, next ; // siblings or cousins location child[BTREEDEGREE] ; // treeelems KeyType key[BTREEDEGREE] ;} // keys for those children

MultiSet.htemplate <class ElementType> struct treeelem { /* a btree leaf element */ ElementType element ; // main contents location next ; // next avail. if needed for a free list } ;template <class ElementType, class KeyType>class MultiSet { /* ElementType is the type of the set's elements. KeyType is the type,

typically part or possibly all of an ElementType object, thatconstitutes a search key.

... */

Abstract location

• Location is an unsigned long that is either a seek offset (file) or a cast of an object pointer.

• MultiSet records the depth of the tree.• Interior nodes are type treenode.• Leaf nodes are type leafnode.• Leaf nodes point to treeelem application data.• Those may be app data or may be record

indices into another file of flat data records.

Searching through the B-tree

• findleaf uses findsubtree on interior nodes.• Uses O(log n) binary search on interior node.• Calls findsubtree in a loop until hitting the

leaves.• Multi-key version finds first or last instance.• findslot returns the array index inside a node.

Insertion in the B-tree

• randominsert starts at the root• Initial element is a special case.

• Root node is always special because it may contain fewer than degree / 2 entries.

• Otherwise, if it fits into a node, put it there.• When the node is full, split into two nodes of

size N/2 and N/2 + 1.• Add the new node to its parent.• If that parent node was already full, split it recursively.

Deletion from a B-tree

• There is no problem when number of entries remains >= N/2. Just slide entries above the deleted entry down to cover the deleted one.

• When entries < N/2, try merging with neighbors in a serial chain (siblings or cousins).

• If there are too few for that, merge into one node. Other is empty, delete it from its parent.

• If parent entries < N/2, perform recursive delete.• See deleteelem and deleteleaf in MultiSet.

http://en.wikipedia.org/wiki/B-tree

• keeps keys in sorted order for sequential traversing• uses a hierarchical index to minimize the number of

disk reads• uses partially full blocks to speed insertions and

deletions• keeps the index balanced with an elegant recursive

algorithm• minimizes waste by making sure the interior nodes

are at least half full

CoreMultiSet & FileMultiSet

• storage read / write via abstract location• readnode / readleaf / readelem• writenode / writeleaf / writeelem• allocnode / allocleaf / allocelem• freenode / freeleaf / freeelem• FileMultiSet maintains free lists of the above.• Flat file structure in main relational file makes

storage management of free list “easy.”

Other Indexing – Skip Lists

• Skip lists support probabilistic log(N) lookup, insertion & deletion of a key -> value mapping.

• We will review Pugh’s paper from the 1990’s.• A skip list links each mapping in a series of key-sorted

linked lists, in which higher-order lists contain fewer members, acting as “highways” & “boulevards” in locating keys.

• Skip lists provide better support for concurrency than some balanced tree algorithms by avoiding global tree restructuring on rebalancing.

Other Indexing – Hashing

• Hashing approaches O(1) (constant-time) lookup for an ideal hash function.

• The ideal hash function preserves all of the bits of distinguishing information is a search key, while folding them into fewer bits to use as a lookup index into an array, an index file, or a flat file of fixed-width records.

• Hash tables disambiguate key collisions either by storing colliding elements in a linked list (chained hashing) or by rehashing to a new bucket (open address hashing).

• Hashing supports only == tests, not <, <=, >, >=.

Other Indexing – Sorting

• When a sequence of fixed-size records or indices are sorted in a file of contiguous records, binary search is a viable option for locating a key.

• Hashing and sorting are particularly appropriate for indexing on-the-fly, temporary result sets to be combined via intersection, union or set difference.

• Approximate O(1) hashing is fast when combining result sets based only on equality.

• Sort-based search and merging are appropriate when a query requests a result set sorted on an attribute.

Query Processing (Elmasri & Navathe chapter 19)

• Translate SQL statements into abstract syntax tree using basic compilation techniques.

• Interpret the abstract syntax tree.• ORDER-BY and elimination of duplicate tuples in a

PROJECT are supported by external sort.• Duplicate elimination can use low-level memcmp byte

comparison when comparing fixed-size records.

• If SORT is based in an index key or composite index key, or if query does not entail PROJECT, then a B+-tree index avoids need for a sort.

SELECT Processing

• Use indexed fields as primary search keys.• Use index-based set intersection, union and

difference operations to support AND, OR and NOT.• Use slow (O(n)) sequential search, or external sorting

(O(n log(n)) where appropriate (ORDER-BY or duplicate elimination) combined with binary search.

• Use hashing for == matching.• Use composite search key indexing or hashing.• Utilize selectivity where possible.

JOIN Processing

• Nested-loop is brute force approach (O(NK)).• Single-loop when join attributes are indexed.• Sort-merge join requires records to be sorted

on the join attributes, and then merged.• Partition-hash join hashes smaller of two

contributing relations into chained hash table.• Larger relation is then hashed on join

attributes to retrieve tuples from the smaller.

PROJECT Processing

• If projection includes a distinguishing key, the projected tuples are already unique. Just select a subset of the query results.

• Otherwise, DISTINCT projections require sorting based on entire tuple to eliminate duplicates.

• It is also possible to hash on entire tuples to eliminate duplicates.

Approaches to Query Optimization

• Pipelined or stream-based processing of query stages across multiple thread / memory units.

• Compiler optimization techniques such as common subexpression elimination.

• Functional approaches such as lazy evaluation.• Use meta-data and heuristics such as size of

contributing relations (smaller is faster and may fit into memory), key distribution data.

Costs to consider

• Access cost (disk I/O), e.g., NFS vs. local disk.• Disk storage cost for intermediate files.• Computation costs (O(?)) cost.• Memory usage cost (avoid thrashing).• Communication cost for distributed systems.• Maintenance cost in terms of resources &

availability of the database.

Physical Database Design

• Chapter 20 in Elmasri & Navathe textbook.• Accumulate query statistics & data mine

them.• What attributes to index?• When to use a clustered index on a non-key.• Actual dataset can be organized on only 1 key.• Hashing works well on equality-only joins.• Dynamic hashing for volatile files.

Physical Database Techniques

• CREATE [ UNIQUE ] INDEX <index name> ON <table name> (<column name> [ <order> ] { , <column name> [ <order> ] } ) [ CLUSTER ] ;

• Denormalization demotes normalized tables to weaker forms for increased speed.

• Vertical partitioning splits relations over attributes to speed projection dynamics.

• Horizontal partitioning splits relations over indexed tuples to speed selection dynamics.

Collect statistics

• Storage statistics include data about table spaces, index spaces, buffer pools. DBMS may require preallocation and sizing / tuning for clustering.

• I/O and device performance statistics include read / write (paging) on disk intents, hot spots, and thrashing for core memory.

• NFS/local/in-core, number of network interface cards, amount of core memory, cache, memory topology.

• Query / transaction statistics help determine attributes to index & query distributions.

Tuning queries• Precompiled queries offer opportunities for profiling and speed

improvement.• Avoid generating larger intermediate result sets when smaller ones are

available.• Avoid nested queries that generate large cross-products in favor of

sequential queries. P. 737 example potential to search all of M for each tuple from E.

SELECT Ssn SELECT MAX(Salary) AS High_salary, Dno INTO TEMPFROM EMPLOYEE E FROM EMPLOYEE GROUP BY Dno ;WHERE SELECT EMPLOYEE.Ssn FROM EMPLOYEE, TEMPSalary = SELECT MAX(Salary) WHERE EMPLOYEE.Salary =

TEMP.High_salaryFROM EMPLOYEE AS M AND EMPLOYEE.Dno = TEMP.Dno ;WHERE M.Dno = E.Dno ;

csc 556 – dbms ii, spring 2013

Documents